Just wondering, but couldn't hash tables be implemented as "clever" association lists ? I mean, the interpreter / compiler could switch by itself to a hash-table representation when the user call 'assoc or 'alref on something like
((a b) (c d) (e f) ...)
In other words, a list of lists-of-size-2. The new representation would of course have to share the same semantics as a-lists (possible to cons, to get the car, the cddadr, ...)but with performance as a bonus.
Of course, it would still have to be able to go back to a "regular" list : I could do
(push 3 (car l))
-> ((3 a b) (c d) ...)
There, the compiler would have to see I don't want an association list anymore and switch back to a more traditional representation. In a way, this is the same idea as almkglor's one about lists secretly implemented as arrays.
This way, we would have one less primitive with all the benefits of hash tables. I don't know if it is possible, but as for know I don't see why it wouldn't.
Hmm, right, but that's just a design issue. If pg wanted to simplify the axioms, he could remove hash tables, make alists behave the way I explained and make the syntax foo!x look for the 'x key in the alist 'foo.
Edit : There would be a problem however if we try foo!3 : do we mean the third element, or the element associated to the key 3 ?
So that we don't walk on each other's toes, I propose to work myself on the following (when I'll have time and the possibility) :
1) annotate and friends, so as to make macros possible (as a macro is an annotated list).
2) chars and strings (and unicoding symbols too. For now on, it's not a problem as they can't be destructured : an utf8 string looks like a byte of chars. But, with strings, as you can access individual chars, it won't work anymore).
3) bignums, rational and inexact numbers. Maybe complex numbers, too.
4) hash tables. Maybe I'll use Lua's approach so as to make them more array-friendly.
Ok, I've got annotations working, with annotate, type and rep tuned to make everything work. Displaying annotated data gives the same output as canonical Arc.
pr also displays lists the same way as canonical Arc : instead of
(foo . (bar . (baz . nil)))
it displays
(foo bar baz)
And
(foo . (bar . baz))
is displayed
(foo bar . baz)
I'm fighting with strings now. I can display individual characters, as long as they are regular ASCII characters. A character is only a Unicode codepoint, encoded as a long. A string is internally an array of codepoints, so it occupies a lot of space, as each character uses 8 bytes. But accessing / modifying characters is fast & easy. When output, strings are converted to UTF-8 (that's where it does not work anymore :).
Edit : it now works, as long as you only use ASCII characters, which is, I admit it, rather stupid, but I had to do that first. And the len primitive is now implemented and works on cons cells and strings.
Re: symbols - perhaps it's better to leave them in UTF-8, then only convert them to UTF-32 if and only if they are coerced to strings.
I suggest making tables work first before the really weird numbers ^^
Edit: there's also the problem of making I/O ports redirect to strings under construction. 'tostring is pretty much the idiomatic way of creating strings in Arc.
Edit2: As an aside, here's what I intend to do with ac.scm built-ins:
1) Remove ac.scm xdef'ed functions from the mac* stuff in xe.arc
2) Create a special lib-ac.scm.arc, where we can access primitives:
(set car
(fn (x)
(%car x))) ;will access primitive %car, but only in lib-ac.scm.arc
Code that isn't in lib-ac.scm.arc will not be able to access the primitives %foo.
The above will of course allow code like:
(map car (pair lst))
3) Finish up my inliner, so that user code of the form:
Yep. I left symbols encoded in UTF-8. When a string is output to anything (included when transformed to a symbol) it is translated to UTF-8. And I was planning to implement tables before the numeric tower. I think there's more fun in tables rather than in numbers :)
Just a question, but I wonder if you could give eds write access to the repo? And of course, if there's anyone else out there who would like to contribute, just let us know, I think sacado would be glad to accommodate you ^^
I'd like to have access too. Currently I don't have much free time to contribute, but if I find some time i'll be glad to contribute. My github username is, surprisingly, 'stefano'.
Now, I've got first class functions : they have a type tag and can thus be printed (it only displays #<procedure>, as canonical arc), asked their type, etc. As soon as I did that, Boehm GC started complaining a lot (always saying "hmm, is that a pointer or not ? I dunno, I'm lost...").
Not very dramatic, these are just warning messages. But as Boehm seems almost borken on some machines and as it was starting to bother me, I did it again : I implemented a new version of my own, hand-made, garbage collector. It was easier with real first-class closures and it is much faster this time. On the test code I've written, the program runs twice as fast with my buggy-GC.
The code still partially relies on Boehm GC, as I have to make it know how every data type has to be collected, but it currently collects tagged objects, cons cells and closures. Will be pushed on Monday.
We'll all be waiting ^^. How'd you implement closures? As a structure or just an array? Boehm might get confused in the array case if you're using the first entry of the array as a type tag (which isn't a pointer). Or maybe not; I haven't studied Boehm GC very well.
As for GC: what kind did you write? Copy or mark? If it's marking, I'd suggest a mark-and-don't-sweep collector. I think most incremental and thread-friendly modern GC's are copying though.
Edit: as for me doing the macro hacking stuff, well, it looks like I'm all hacked out. Hehehe^^
Hmm... I'm not very good at terminology, but I'm almost sure it's a mark-and-sweep. The implementation relies on system malloc. Every time some memory is required, the user calls gc_malloc. This function calls malloc stores the pointer in an array and returns that pointer. Once the array is full (we're not talking about consumed memory yet, but about built references, so it can break down if you build very big objects), collection is performed : everything not reachable from the stack (or recursively from reachable objects) is freed. It has to be improved, but for now on it's working.
I implemented closures as an array of long. Very easy to deal with. The first one is the tag type, the second one is the goto label, the third is size of the array (we need it for garbage collection) and all others are the arguments (well, they are objs, but they are implemented as a long).
It does indeed seem to be a mark-and-sweep. Generally though most GC's will handle the heap: they allocate one big bunch of memory via malloc() and allocate from that.
"Mark" means to determine if a memory area is accessible. Usually this means setting some sort of bit or variable for each memory area. After you've marked all reachable memory, you perform a "sweep": any unmarked memory is freed.
A slightly-more-efficient algorithm than mark-and-sweep is mark-and-don't-sweep (obviously because you skip the "sweep" step), but this requires us to handle the heap directly. Here's an explanation:
Each memory area in the heap has a "free/in-use" bit. This bit's sense of "free" can vary. For example, at any one time, all "free/in-use" bits may have the meaning:
0 = FREE
1 = IN-USE
At another time, however, the meaning might be:
0 = IN-USE
1 = FREE
The magic here is the way the free/in-use bit is interpreted by the memory manager.
Now suppose we allocate a bit of memory that is too large for the current free memory pointed at the alloc pointer:
|-------| <- I need something this big
+---+-----+--------------+---+------------+---------------+-------+
| 1 | 0 | 1 | 1 | 0 | 1 | 1 |
+---+-----+--------------+---+------------+---------------+-------+
^
alloc pointer
Obviously, we have to skip the free memory that's too small. However, let me introduce an invariant: everything to the left of the alloc pointer must be in-use. So if ever we skip free memory that's too small, we still mark it in-use, but we don't return it (obviously, it's too small!). Instead we continue over to the next free memory and see if that is large enough, and so on.
In this case the very next portion of memory is available:
|-------| <- I need something this big
+---+-----+--------------+---+-------+----+---------------+-------+
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
+---+-----+--------------+---+-------+----+---------------+-------+
| ^alloc pointer
v
returned
...and afterwards... uhh... well.... we just allocate as normal, except the meaning of 0/1 of the free/in-use bit has flipped. "Don't sweep". Thus our sweep step is part of our allocation.
As an aside: I've started writing an 'eval function for use with macros in arc2c. This is done by creating a new "eval" function using (make-eval) in make-eval.arc. It's not done yet though.
My plan is that for each compilation run, we (make-eval) a new version of 'eval. Why? Because we want to protect the global environment.
For example, the user code might want to use the following macro:
(mac xe body
`(tag (div class 'xe)
,@body))
Unfortunately, 'xe is a function defined and used by arc2c. If we were to simply 'eval all 'mac forms, then user code could thrash arc2c.
Instead, we create a "protected" eval. This eval, when used, will prevent writes to global variables. Instead, writes to global variables will mutate a global-variable table, not the "real" global variables.
However, it's not done yet, there are a bunch of TODO's floating around. And unfortunately, I might not be able to do this for a week. Or maybe two weeks, or maybe a month.
A friend of mine has a pretty big personal Real Life(TM) problem (it involves, like nearly every big personal RL problem, a member of the opposite sex). I'll need to help him for now. Sorry.
(the guy will, usa embassy willing, be in san francisco, california, usa a month from now. he's had to borrow quite a bit from his friends too, so we're all pretty tapped out and can't accompany him. err. just wondering if someone near there could keep an eye on him.)
The code for the 'eval interpreter is on github. Anyone who wants to try continuing it is welcome to do so. You're even welcome to completely dismantle that bit and start some other way of doing macros.
That looks clever, and not too complicated... I'll try to implement it when I'll have enough time... As a matter of fact, dealing with heap space myself would let me reduce the size of closure objects (I wouldn't need to know the # of arguments they hold anymore).
Well, good luck with your friend, and see you soon !
Okay. Although I'd like to ask if you can implement the Anarki 'defcall and 'call* too?
Basically this means that instead of updating the pc C-var at END_JUMP(), our jump: C-label does the updating. If it's an ordinary function, extract pc from CLOSURE_REF(LOCAL(0),0). If it's not, lookup its type in the call* table, and rearrange the stack:
(obj k . ind)
=>
((call* (type obj)) k obj . ind)
This also means that closures need type tags now too.
Ok, I'll work on this too. I need to type closures now anyway, as I am implementing full support for things like pr and type.
Btw, about type tags and the pointer-as-fixnum hack : I read a paper about the implementation of Lua (for tips about tables implementation). In Lua, everything, including numbers, is implemented as a structure containing first the type, then the data for the actual object (somebody already mentioned this in the forum). The reason is that the ANSI C standard does not allow the pointer-as-fixnum trick : you cannot know for sure that addresses will have a 0 low bit. In practice, it works on the most common architectures, but it's not fully portable (which is a problem given Lua's target). Hmm, you were right, maybe later we could add another version of codegen that would be slower and more memory-consuming but completely portable.
True, which is why I was always a bit leery of the trick. Besides, if you're going to implement bignums anyway, you might as well start off "fixnums" as very small bignums ^^. LOL. Of course there's the obvious problem that most applications won't use bignums, and in applications that do, most numbers still aren't bignums.
But then, if you add two fixnums together and the result won't fit in a fixnum...
As an aside, in the current mzscheme implementation, it seems fixnums are type 'int and everything else is type 'num
Finally: if you need to access call* , it may be possible to determine its position in GLOBAL() and add a C-global CALL_STAR:
You're welcome. You can prolly review bits of the code via github if you don't have access to your own computer this week.
Wonder how eds is doing on arc2c? BTW have you requested to mentor his GSoC application? If you already did I'll withdraw my request.
The bit about strings is - how do we represent them? UTF-8? UTF-32? As an array or list of characters?
Arc's underlying mzscheme divides strings into code points; each code point is representable by a single 32-bit number (I think). An individual "character" in mzscheme is thus a code point (from what I gather), although in Unicode a character could be represented by several code points (or so I hear).
Now the point is that the following is quite valid:
arc> (= p "asdf")
"asdf"
arc> (= (p 0) #\c)
#\c
arc> p
"csdf"
So obviously access to individual characters should be easy, and replacing individual characters should probably not cause us to mess too much with memory. This almost prevents the use of UTF-8, even though all I/O will pretty much just use UTF-8.
Yes, I think UTF-8 would be a disaster with modifiable strings. mzscheme uses UCS-4 (UTF-32) internally, and that would be the simplest approach. If you are willing to ignore Unicode characters > 65536, then UCS-2 would be okay with half the memory usage. When you talk about a character represented by several code points, are you talking about Unicode surrogates for characters > 65536? (Oversimplifying, two UCS-2 surrogate characters are used to represent one Unicode code point > 65536.) I think you'd be better off with UTF-32 than UTF-16 and surrogates, as surrogates look like a nightmare that you'd only want if you need backwards compatibility, see Java's character support: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character....
Oh, Unicode combining characters and normalization. I classify that as "somebody else's problem." Specifically, if you're writing a font rendering engine, it's your problem. If you're writing an Arc compiler, it's not your problem. If you want complete Unicode library support in your language (like MzScheme's normalization functions string-normalize-nfd, etc.), then you just use an existing library such as ICU, and it's not your problem. ICU: http://www-306.ibm.com/software/globalization/icu/index.jsp
I've been following up on the forum threads but I haven't had time to actually read the code yet. (And the last time I checked, the arc2c executable gave me a segfault.)
The current arc2c output assumes that you have a proper Boehm GC installation, but since I can't seem to get a good install here (prolly something to do with being AMD64 again) I just disable the GC for now.
I finally got arc2c to work (without GC as you suggested). One note though: apparently arc2c relies on rm-global.arc, but doesn't load it by default. So under the current version of the compiler, 'compile-file will error until you (load "rm-global.arc").
Oops. Must have forgotten to add it to arc2c.arc then ^^. Unfortunately I won't be able to fix this until maybe 7 hours from now T.T, haha, I'm in the office ^^
The thing about GC working: well, you need to somehow download the development version of Boehm GC, and, well, that's what's stopping me for now T.T
When I run (nsv) I get the same "ready to serve port 8080," which is a good sign, though I'm almost sure that my server (via shared hosting) isn't going to route everything automatically and properly...
In other words, I expect that there's some config I have to do and I'm not sure what it is.
The port for HTTP should really be port 80. However, I think most of the people here who have deployed the Arc server generally put Apache (serving on port 80) between the real world and the Arc server (serving on port 8080). This is generally done because nobody trusts the Arc server to be hackproof yet, especially not the Anarki version, which has been touched by quite a few people and may not be perfectly secure (but at least has a few more abilities than the Arc2 version).
If you're willing to risk it, then try (nsv 80), which forces the Arc server to listen on 80; you might need to run as root though.
I don't have root access, so naturally (nsv 80) gives "listen on 80 failed (Permission denied..." error.
Apache is up, but I'm not sure what I can do with it. Is there a simple redirect I need to run?
To be precise, I've loaded arc2 into a directory [mydomain]/news corresponding to news.[mydomain].com, with the news.arc specifying "news.[mydomain].com" as the site url; I launch the Arc REPL from there ("mzscheme -m -f as.scm") and do "(nsv)"... to no avail.
Try accessing news.[mydomain].com:8080. If you have some sort of terminal access on your remote computer, try lynx http://news.mydomain.com:8080/
As for the Apache <-> Arc talking thing, you'll have to wait for someone else to answer that question; I honestly don't know, because I haven't deployed one yet.
Wow congratulations. I was planning to add an arc2jvm module to arc2c once it will be working (as code generation is just one of the many transformation performed by the compiler, and not the hardest one), but now it is done. And in French :) :) :)
So, now, we've got more or less complete implementations of the language in javascript, CL, java and arc itself, plus the official one. I like to see arc is simple enough to be implemented quite easily in other languages. Now, a PHP implementation would be great as it would enable us to host arc programs almost everywhere :).
I think that's an excellent idea too. Boehm doesn't seem to be very good with threads. And green threads make things like the use of extended threading à la Erlang much easier. IO would probably be a problem, but that's something that can be defered.
Yes, true. It would also be a pretty simple solution IMO, although potentially fraught with some dangers, especially when checking dead threads.
In any case I'm currently thinking of adding a library, where if a global variable is not read (only assigned to), it is simply removed.
Basically: extract the set of global variables that are read, and the set that are written to. If both sets are not equal, eliminate all set operations on members of the written set that are not members of the read set (replace them with their (car ast!subx)). Then eliminate any hanging constants: naked fn definitions and other constants that are in sequences not in the tail position. Repeat until set of global variables that are written is a full subset of the set of global variables that are read, and if they're not equal, raise an unbound variable error (i.e. a global is read but never assigned to, which as you mentioned would segfault. Not perfect of course, because reading the global before it is assigned to will still crash, but better than nothing).
wow, congratulations, you are much more productive than I am ! I won't work a lot on the compiler this week as I am far from my own computer, so I guess many things will have changed next time I'll explore the code :)
I've been circling this problem for some time now, getting nowhere quick.
The problem is that the current style of implementing closures - where local and closure variables are copied to the closure structure - strikes me as a premature optimization. A useful one - if closures are not nested, then all variable references are O(1). However for shared mutable variables we need to use a different strategy. I'm trying to figure out some way of retaining the current closure style for cases where variables are not mutated, versus for cases that closure variables are.
BTW how come no one else seems to want to work on the compiler? It seems that aside from you I'm the only commiter?
Well, I'm watching with fascination, but it's over my head. I'm working through Essentials of Programming Languages (Friedman, Wand, and Haynes), though, so maybe soon ;)
It looks like a bug... Anybody's got pg's email address :) ? He doesn't appear to come here very often, and if it really is a bug, that's the kind of feedback he said he really wanted...
I suspect that parsing a float (or complex) is a nonimplemented feature. According to my documentation http://arcfn.com/foundation-doc.html#coerce a string can be coerced to sym, cons, or int only.
Well, parsing a float is implemented, in the sense that read will parse it for you, so that's not the problem. I suspect it doesn't work because num is just a place-holder for "this kind of number is not yet fully implemented; don't expect much". It will probably die in a future release.