Friday 12 March 2010

In which an error is realised

I’ve been giving some deeper thought to the multiple-has table I described in my last post, and I’ve figured out where the problem is. All the usual actions can be performed in constant time for any specific table, as usual… the issue is what kind of time they require. For n-dimensional tables, the number of operations for find increases linearly with n, but the operations required for add and delete increases factorially. For n=2, add requires three actions, but for n=3, add takes seven. n=4 causes add to take twenty-five, unless I’ve mucked up my math.

This is, of course, bad. Unless you can magically prepopulate the hashtable, and you never have to modify it… and that’s exceedingly unlikely. I think I need to find a better way of storing the indices. I think the problem that’s causing the factorial growth is the fact that I was originally thinking of storing the sub-hashtables (say, where n=2, the tables one dimension deep) as independent of each other, and relying on encapsulation. Didn’t I specifically mention that the annoying part about classic multidimensional arrays is that they’re based on encapsulation?!

Back to the drawing board.

Sunday 7 March 2010

In which an algorithm is partially puzzled out

This past week, I was working on a small project at work that seemed, at first glance, to need a unique type of multidimensional array—I needed to be able to store a collection of data that had a composite key, where the values of each component of the key were bounded sets. I realised after an hour or so of trying to figure it out that I was overthinking the problem (a regular multidimensional array would suffice, because as the programmer I can fully control which key component would be the primary axis of the array), but it left me with an interesting idea in my head that, after a brief bit of research, I've realised nobody's really approached... and while I thought of the idea at work, I also tossed it out as being unnecessary, so I don't think they can really lay claim to the intellectual property, because I'm not doing any development of the idea on their dime! Only part of what I'm going to get into was something done at the office, so the vast bulk of this is my own work, for my own amusement.

That idea is, in essence, an n-dimensional multi-hash table (which I'll just refer to as an MHT, for brevity's sake). The problem that I thought I had was that I would need to allow the user to specify either of the components, then update the available choices from the other component appropriately, or retrieve a subset of the data that had that key (like I said, it turned out to not be the case), but I've realised since then, what if, someday, I really need to do that, and I want to be able to do it quickly? My approach to the problem follows, but this is fairly off-the-cuff, and hasn't been implemented. I'm really just trying to think this out, and I want to make sure that if this turns out to be Important, that there's a published record of when I came up with it.

Let S be a set of data. Each item s in S contains (for the purpose of this example) a binary composite key, made of values j and k, where j and k are members of sets J and K. s has a unique key value n that may or may not be exposed to the user, but is available to the software. Instead, the user wishes to access s by specifying j and k, but may need to obtain a list of all items in S with a specific value of either j or k. Since S may be of arbitrary size, maintaining a low order for the algorithm is important. Hash tables are particularly well-suited to this, as they can typically maintain O(1) performance for most access functions.

In order to accomplish this, multiple hash tables must be maintained (it's actually quite similar to maintaining hash indices in a DBMS; perhaps I should look into the specific algorithms used there). A hash table of all J values must be maintained, and each item in the J hashtable must point to a hash table with values as its key. These relationships must in turn be transposed, so that the user can access all the s in S with a particular k.

The issue, of course, becomes one of implementation. We also want to maintain a low storage requirement, so each item in the J-to-K hash table should simply be a pointer into S; this is why I mentioned that each s in S has a unique key that is programmatically available, but not necessarily exposed to the user.

So, what's the best approach to doing this? It seems to me that n+1 hash tables are required, where n is the number of dimensions that the user needs to use to access a given s. A master S hash table will exist with all the items. Then each dimension will have its own hash table. For dimension J, each record would be a key-value pair with the key j, and the value being a hash table where the keys are values from K, and the values are pointers into S. Dimension K would exist similarly. When adding an item x to S, after calculating its unique key value, the add function would need to add xk to J and xj to K. The reverse would be necessary for deletions. So, for a 2-dimensional MHT, it seems that adds and deletes could be performed in 3 O(1) steps. It's not perfect, but it's better than O(n). Fortunately, the data storage requirements for the two-dimensional are 3n (so still O(n))—each record would exist one time in each of the S, J, and K hash tables.

The difficulty that I'm realising, as I write this, could very easily come up, is what do you do when you need to create an n-dimensional multihash table where n>2? The mechanism I've described above works fine where each j in J refers to a simple hash table, but what happens when you have keys J, K, and L? What would each j in J refer to? A multidimensional array? Then updating J may necessitate updating multiple tables with the same data (which is something I'd like to avoid). I think the answer lies in what each j in J points to, but I haven't worked out how that's going to work, and I tend to do my best figuring out of this type of thing when I can see the problem—and being 12:30 in the morning, at home, I don't exactly have access to a gigantic whiteboard on which to work, or the opportunity to try hacking on the problem to see what works best.

But there's my approach to a 2-dimensional hash table. This is definitely something I want to try implementing, and that I want to continue working with in a bid to make an n-dimensional, any-addressable, MHT. I don't know if it will give any kind of accolades, but I think it'll be a fun problem to solve... and considering the fact that I can't find any evidence of anyone else trying to approach it this way on The Google, I'm starting to think that it might even be patentable.. so I think this would qualify as prior art!

Now I'm even more excited.