I personally think that strings and symbols should be separate, largely because of their different uses.
That said, I did recently write an app which used symbols as de facto strings, and the text file format being used as a configuration/task description file by the user was just an s-expr format. The app wasn't written in Lisp or anything like it, which was why I settled for using symbols as strings (to make it easier on my readerfunction - I just didn't have a separate readable string type).
Given that experience, well, I really must insist that having separate strings and symbol types are better. In fact in the app I wrote it for, the config/taskdesc file was just a glorified association list (where the value is the 'cdr of the list, not the 'cadr)!
As for strings being lists/arrays of characters, yes, that's a good idea. We might hack into the writer and have it scan through each list it finds, checking if all elements are characters, and if so just print it as a string. We might add a 'astring function which does this checking (obviously with circular list protection) instead of [isa _ 'string].
I think the strongest reason for separate strings and symbols is that you don't want all strings to be interned - that would just kill performance.
About lists of chars. Rather than analyzing lists every time to see if they are strings, what about tagging them? I've mentioned before that I think Arc needs better support for user defined types built from cons cells. Strings would be one such specialized, typed use of lists.
Also, how do you feel about using symbols of length 1 to represent characters? The number one reason I can see not to, is if you want chars to be Unicode and symbols to be ASCII only.
From the implementation point of view representing characters as symbols is a real performance issue, because you would have to allocate every character on the heap, and a single character would then take more than 32 bytes of memory.
I think that's an implementation detail. You could still somewhat keep the character type in the implementation, but write them "x" (or 'x) instead of #\x and making (type c) return 'string (or 'sym).
Or, if you take the problem the other way, you could say "length-1 symbols are quite frequent and shoudn't take too much memory -- let's represent them a special way where they would only take 4 bytes".
This would require some kind of automatic type conversions (probably at runtime), but characters-as-symbols seems doable without the overhead I thought it would lead to.
Personally I think memory should be managed by refcounts, and GC only when the cyclic garbage adds up. However adding refcounts is somewhat harder since every state-mutating 'set, 'sref, 'scar, 'scdr, and 'cons needs to decrement the current obj refcount and increment the new obj refcount.
I also suppose that currently the time taken by GC isn't actually very big yet, since all we've been compiling are a few dorky simply bits of Arc code.
One day a student came to Moon and said, "I understand how to make a better garbage collector. We must keep a reference count of the pointers to each cons." Moon patiently told the student the following story-
"One day a student came to Moon and said, "I understand how to make a better garbage collector...
hmm, interesting. I'm not really fond of refcounting. It makes FFI or C extensions really hard. That's what I don't like with Python's FFI. You always have to think : "do I have to increment the refcount ? decrement it ? Leace it alone ?" If you don't do it well, you have random bugs. The sad story is that Python makes C programming harder than it already is.
On the opposite, palying with mzscheme's or Lua's FFI is a real pleasure : you don't have to bother with GC. You even have (sometimes) your malloced object collected for you.
But if we can cetnralize the refcount operations in a single (or a very small number) of places, I'm OK... Their talking about stack_push / stack_pop is rather inspiring...
For information : On a GC-relativey-intensive program (mainly calculating (fib 40), which generates a lot of garbage), and with a heap of 50 million possible references, for a total time of 228000 ms, i got the following GC info :
total time : 177ms, nb cycles : 17, max time : 42ms, avg time : 10 ms
That's far from perfect, of course, but it doesn't look so bad to me.
Btw, doctstrings are a real performance killer : they are useless, but are allocated on the heap and fill it up really quick (ah, recursive functions...). We should add something in the code removing immediate values in functions.
> Btw, doctstrings are a real performance killer : they are useless, but are allocated on the heap and fill it up really quick (ah, recursive functions...). We should add something in the code removing immediate values in functions.
Really? You've tried it? Because docstrings are supposed to be removed by the unused global removal step.
> Btw, doctstrings are a real performance killer : they are useless, but are allocated on the heap and fill it up really quick (ah, recursive functions...
Are you saying that you alloc a docstring at every function call?
Well, for the moment, yes. Every object appearing in the program has to be allocated (it's not an optimizing compiler yet). Useless objects are not detected, so every time the compiler sees a string, it generates code to allocate it, and it is freed on the next GC cycle. Every time, you call the function, that code is executed. Well, that's an easy optimisation, so I'll work on it very soon I guess.
Yes, it's not difficult. You just have to find all constant values, create a global var for each constant, assign the const value to the global var and substitute the occurence of the constant with the global var name.
Refcounting performs a lot worse than a generational gc. When dealing with many deep data structures, it becomes more worse. And a simple generational gc is not very hard to implement.
Eh. It's attitude, not job description :) And anyway, you should probably take that with a relatively enormous grain of salt, as small sample sizes aren't conducive to accurate data.
Where pg fails.... --warning-blatant-self-promotion-- Anarki! Whee!
Hmm. Probably need a "Report on Anarki" as a spec of the standard Anarki, particularly 'call* and 'defcall, which may very well be the most important extension in Anarki.
Let me tell you a story about a language called "Stutter". It's a full m-expr language. Function calls have the syntax
f[x], math quite sensibly uses a + b * c notation, etc. The only
weird thing is that assigning a variable uses the set special form -
set[x 42] - because of some weird stuff in the parser that is only
tangentially related to our discussion.
Now I'll introduce to you a special syntax in Stutter. In Stutter, ()
introduces a data constant called an array, which is just like any
other sequential data collection in any other language. So it's
possible to do something like this in Stutter:
set[var (1 2 3 4)]
Stutter is a dynamically-typed language, and arrays can contain
strings, numbers, or even arrays:
set[var (1 "hello" ("sub-array" 4 ))]
Let me introduce to you something new about Stutter. In Stutter,
variable names are themselves data types. Let's call them labels
(this is slightly related to why we have a set[] special form). And
like any other data type, they can be kept in arrays:
set[var (hello world)]
Remember that () introduces an array constant. So (hello world)
will be an array of two labels, hello and world. It won't suddenly
become (42 54) or anything else even if hello is 42 and world is 54.
The variable's name is a label, but the label is not the variable (at
least not in the array syntax; it was a bit of a bug in the original
implementation, but some guys went and made code that used labels in
that manner and it got stuck in the language spec).
The array is just like any other array in any other language. You can
concatenate them like so:
set[var append[(array 1) (array 2)]]
=> now var is (array 1 array 2)
You can add an element in front of it like so:
set[var cons[1 (some array)] ]
=> now var is (1 some array)
Array access syntax is not very nice, but it does exist:
nth[1 (this is the array)]
=> this returns the label "is"
You could create an empty array with:
nil[]
And you could create an array with a single element with:
array["hello"]
=> returns the array ("hello")
Oh, and remember those guys who abused the labels in array syntax I
told you about? Well, they created a sort-of Stutter interpreter, in
Stutter. However, they sucked at parsing, so instead of accepting
files or strings or stuff like that, their Stutter interpreter
accepted arrays. They were going to make the parser later, but they
just really sucked at parsing.
They called their Stutter interpreter "lave", because they were hippie
wannabes and were tripping at the time they were choosing the name.
It was supposed to be "love", but like I said, they were tripping.
Of course, since lave accepted arrays, it couldn't get at the nice
f[x] syntax. So they decided that the first element of an array would
be the function name as a label. f[x] would become the array (f x).
lave had some limitations. For example, instead of Stutter's nice
infix syntax a + b, lave needed (plus a b). Fortunately lave included
a plus[x y] function which was simply:
define plus[x y]
x + y
So how come these guys became so influential? You see, Stutter's BDFL
is a bit of a lazy guy. He is so lazy that he didn't even bother to
fix up the syntax for if-then-else. In fact, there was no
if-then-else. What was in fact present was a ridiculously ugly cond
syntax:
cond[
{ x == y
...your then code...
}
;yes, what can I say, Stutter's BDFL is lazy
{ !(x == y)
...your else code...
}
]
lave's creators pointed out that you could in fact represent the above
code, in lave-interpretable arrays, as:
(cond
( (eq x y)
...your then code...)
( (not (eq x y))
...your else code...))
Then they created a new Stutter function which would accept 3 arrays, like so:
if[
(eq x y)
(...your then code...)
(...your else code...)]
You could then use an if-then-else syntax like this:
lave[
if[ (eq x y)
(...your then code...)
(...your else code...)
]
]
Then they thought, hmm, maybe we can integrate this into our lave
function. So they wisely decided to create a new feature in lave,
called "orcam". I think it was supposed to be "okra", but
unfortunately I asked them about it while they were tripping, so maybe
I just got confused.
Basically, you could tell lave that certain Stutter functions would be
treated specially in their lave-syntax. These functions would have
the "orcam" property set in some private data of lave. Instead of
just running the function, lave would extract the array components,
pass them to the function, and then run whatever array that function
returned. So you could simply say:
lave_set_orcam_property[(if)]
lave[
(if (eq x y)
(...your then code...)
(...your else code...)
)
]
Because of this, people started creating all sorts of
orcam-property-functions. For example, there was only a while loop in
the language (lazy, lazy). Someone created an orcam-property-function
called for:
define for[s c u code]
append[ (begin) //begin{} is just a compound statement
cons[ s
append[(while)
cons[c
cons[ code array[u]]
]
]
]
]
So you could do:
for[(set i 0) (less i 42) (preincrement i)
(begin (print i))]
And it would look like:
(begin
(set i 0)
(while (less i 42)
(begin (print i))
(preincrement i)
)
)
So if you wanted something like a C for loop you could do:
lave_set_orcam_property[(for)]
lave[
(for (set i 0) (less i 42) (preincrement i)
(begin
(print i)
)
)
]
It was particularly difficult to create nice orcam-property-functions,
but it was easier than trying to get Stutter's BDFL to move.
Soon after lave's creators added orcam-property-functions, Stutter's
BDFL decided to do something about the language. He was always
bothered about the bug in Stutter array syntax where something like
(hello world) would return, well, the array (hello world), instead of
sensibly returning an array with the values of hello and world. So he
introduced the `, syntax. An array constant prefixed with "`" would
have a special meaning. It would not be completely a constant.
Instead, when it saw a "," Stutter would evaluate the expression
following the comma and insert that element into the array being
created. So `(,hello ,world) could now become (42 54), if hello was
42 and world was 54.
Some of the top ocam-property-function writers realized that Stutter's
new `, syntax would really, really help. For example, instead of the
kludgy, virtually unreadable if code, you could just write:
define for[s c u code]
`(begin
,s
(while ,c
,code
,u
)
)
However, you weren't limited to just the `, syntax. It was usually
the best choice, but if there was a lave-expression array you wanted
that couldn't exactly be given by "`,", you could still use the good
old append[] and cons[] functions. In fact, for really complex
lave-expression arrays, a combination of the new `, syntax and the old
append[] and cons[] functions would do quite well.
Because of this, creating orcam-property-functions became easier and
their power skyrocketed. Other languages which automatically
evaluated variable labels in their arrays couldn't imitate it (so what
was originally a bug - that (hello world) did not evaluate the
variables hello and world - became a feature). Worse, those other
languages' arrays sometimes couldn't themselves contain arrays, or
even have different types.
Their arrays just weren't powerful enough to hold code, so other
languages never managed to create a powerful orcam-property syntax.
Eventually, people were writing Stutter programs like so:
lave[
(define (fn x)
(if (less x 1)
1
(times x (fn (minus x 1)))
)
)
]
And so, Stutter's BDFL decided to be even more lazy and simply wrote Stutter as:
while[true]{
print[ lave[ read[] ] ]
}
so that everyone didn't have to keep writing "lave[]"