Concerning bloom filters, they are only helpful if knowing the fact that a particular process has a reference matters when deciding to copy. So, it's useful for the case when you're sending an object to a process. Otherwise ref counting is better, because you aren't broadcasting "do you have this?" messages.
Bloom filters are probably only a good idea if your copy is a large one, rather than just a small variable.
Here's an interesting idea: for extremely large messages, have some sort of "lazy copying" where the process only copies pieces just before it uses them, or when there isn't a cost associated with the transfer. That way large copies don't slow down your system. You'd still increment the reference counter, if you were using one, and any mutations would have to change a different copy.
Actually, a good question is what the relative costs of sharing vs. copying are. I would think that for small things copying is less expensive than keeping track of who has it, etc., just to avoid copying. For bigger items, it might be worth avoiding a premature copy. I don't know.
Edit 1: About the k hash functions, I think that if there is low enough correlation between bytes in the result you can break up one hash function into several. Also, most good hash functions, (the ones that might be too slow for this purpose) have enough of a difference in the result with slight changes to the input that you could add just 1 to the value several times, and get a dozen hashes. So you could use:
(hash val)
(hash (+ val 1))
(hash (+ val 2))
...
and get "3" hash functions. If you break each hash into bytes as well, and each produced multiple bytes, you could easily have 6, 9, or 27 "hash functions"
> So, it's useful for the case when you're sending an object to a process.
Yes, but it is in fact the only case where we copy objects anyway (even reading a global variable would be effectively sending the object in the global to the process in question)
> for extremely large messages...
#define N I don't know
if(msg.total_size() > N){
msg.share(process);
} else {
msg.copy(process);
}
I think this will end up having twice the overhead though: you need mechanisms to support both sharing with copy-on-write-and-if-receiver-already-has-it and copying.
Take note also that sending messages is recursive: we might be sending a list and thus end up sending several items.
> I would think that for small things copying is less expensive than keeping track of who has it, etc., just to avoid copying.
This was the justification in the Erlang BEAM and JAM machines to always copy.
Unless we make some special immutable types as mentioned above; then we could share those. Maybe I'm just prematurely optimizing, and should just give up on trying to find ways to avoid the copy. Life is much simpler just copying. :)
How about just starting with always-copy, and later when we have performance data, and experience using it, we can code exceptions? I don't know how easy it would be to code such exceptions, but it would be less likely to break if we didn't make these decisions without empirical evidence.
>twice the overhead...
I didn't think it would create any extra mechanisms, just a short circuit rule when deciding to copy or not. Rather than checking the ref counts each time the item is mutated or sent, if it is smaller than a certain amount, we know it was copied.
Now, maybe ref counts are faster than total_size(), but I don't know.
Certainly, always copying would avoid this whole overhead business. :)