Some Notes on Distributed Key Stores

Last week I ended up building a distributed keystore for a client. That wasn’t my original intention, but after doing testing on just about every project out there, it turned out to be the best (only?) solution for our needs.

Specifically, a production environment handling at least 100M items with an accelerating growth curve, very low latency retrievals, and the ability to handle 100s of inserts/s w/ variable-sized data (avg 1K, but up in many cases well beyond) … on EC2 hardware. The previous system had been using S3 (since SDB is limited to 1K values) – err, the lesson there, BTW is don’t do that.

So, these requirements are decent – something that actually requires a distributed system, but something that shouldn’t be beyond what can be handled by a few nodes. My assumption was that I’d actually just be doing some load testing and documenting installation on the keystore the client picked out, and that would be that. This was not the case.

I’m still catching up on a number of other projects, so I don’t have a great deal of time to do a formal writeup, hoewver, the work I’ve done may be useful for those who might actually need to implement a production keystore.

Some other recent useful starting points may be Richard Jones’ Anti-RDBMS roundup and Bob Ippolito’s Drop ACID and think about data Pycon talk.

  • MySQL – while the BDB backend is being phased out, MySQL is a good baseline. With my testing, on a single m1.large, I was able to store 20M items within one table at 400 inserts/s (with key indexes). Key retrievals were decently fast but sometimes variable. There are very large production keystores are being run on MySQL setups. Friendfeed has an interesting writeup of something they’re doing, and I have it on good authority that there are others running very big key stores w/ very simple distribution schemes (simple hashing into smaller table buckets). If you can’t beat this, you should probably take your ball and go home.
  • Project Voldemort – Voldemort has a lot of velocity, and seems to be the de facto recommendation for distributed keystores. A friend had used this recently on a similar-scale (read-only) project, and this was what I spent the majority of my time initially working with. However, some issues…
    • Single node local testing was quite fast – 1000+ inserts/s, however, once run in a distributed setup, it was much slower. After about 50M insertions, a multinode cluster was running at <150 inserts/s. This… was bad and led me to ultimately abandon Voldemort, although there were other issues…
    • There is currently only a partially complete Python client. I added persistent connections in as well as client-side routing w/ the RouteToAll strategy, but well, see above
    • Embedded in the previous statement is something worth mentioning – server-side routing currently doesn’t exist.
    • While I’m mentioning important things that don’t exist, there is currently no way to rebalance or migrate partitions, either online, or, as far as I could tell, even offline. This puts a damper on things, no?
    • As a Dynamo implementation, a VectorClock (automatic versioning) is used – this is potentially a good thing for a large distributed infrastructure, but without the ability to add nodes or rebalance, it means that for a write-heavy load, it would lead to huge growth with no way for cleanup of old/unused items (this of course, also is not implemented)
  • LightCloud – this is a simple layer on top of Tokyo Tyrant but the use of two hash rings was a bit confusing and the lack of production usage beyond by the author (on a whopping 2 machines containing “millions” of items) didn’t exactly inspire confidence. Another problem was that it’s setup was predicated on using master-master replication which requires update-logs to be turned on (again, storing all updates == bad for my use case). This was of course, discovered rooting through the source code, as the documentation (including basic setup or recommendations for # of lookup & storage nodes, etc is nonexistent). The actual manager itself was pretty weak, requiring setup and management on a per-machine basis. I just couldn’t really figure out how it was useful.
  • There were a number of projects that I tried, including Cassandra (actually has some life to it now, lots of checkins recently), Dynomite and Hypertable that I tried and could not get compiled and or set up – my rule of thumb is that if I’m not smart enough to get it up and running without a problem, the chances that I’ll be able to keep it running w/o problems are pretty much nil.
  • There were a number of other projects that were unsuitable due to non-distributed nature or other issues like lack of durable storage or general skeeviness and so were dismissed out of hand, like Scalaris (no storage), memcachedb (not distributed, weird issues/skeeviness, issues compiling) and redis (quite interesting but way too alpha). Oh, although not in consideration at all because of previous testing with a much smaller data set, on the skeeviness factor, I’ll give CouchDB a special shout out for having a completely aspirational (read: vaporware) architectural post-it note on its homepage. Not cool, guys.
  • Also, there were one or two projects I didn’t touch because I had settled on a working approach (despite the sound of it, the timeline was super compressed – most of my testing was done in parallel with lots of EC2 test instances spun up (loading millions of nodes and watching for performance degradation just takes a long time no matter how you slice it). One was MongoDB, a promising document-based store, although I’d wait until the auto-sharding bits get released to see how it really works. The other was Flare, another Japanese project that sort of scares me. My eyes sort of glazed over while looking at the setup tutorial (although having a detailed doc was definitely a pleasant step up). Again, I’d finished working on my solution by then, but the release notes also gave me a chuckle:

    released 1.0.8 (very stable)

    • fixed random infinite loop and segfault under heavy load

OK, so enough with all that, What did I end up with you might ask? Well, while going through all this half-baked crap, what I did find that impressed me (a lot), was Tokyo Cabinet and its network server, Tokyo Tyrant. Here was something fast, mature, and very well documented with multiple mature language bindings. Testing performance showed that storage-size/item was 1/4 of Voldemort’s, and actually 1/2 of actual size (Tokyo Cabinet comes with built-in ZLIB deflation).

Additionally, Tokyo Tyrant came with built-in threading, and I was able to push 1600+ inserts/s (5 threads) over the network without breaking a sweat. With a large enough bucket size, it promised to average O(1) lookups and the memory footprint was tiny.

So, it turns out the easiest thing to do was just throw up a thin layer to consistently hash the keys across a set of nodes (starting out with 8 nodes w/ a bucket-size of 40M – which means O(1) access on 80% of keys at 160M items). There’s a fair amount of headroom – I/O bottlenecks can be balanced out with more dedicated EC2 instances/EBS volumes, and the eventual need to add more nodes shouldn’t be too painful (i.e. adding nodes and either backfilling the 1/n items or adding inline moves).

There are some issues (an issue w/ hanging on idle sockets) but current gets are at about 1.2-3ms across the network (ping is about 1ms) and it seems to otherwise be doing OK.

Anyway, if you made it this far, the takeaways:

  1. The distributed stores out there is currently pretty half-baked at best right now. Your comfort-level running in prod may vary, but for most sane people, I doubt you’d want to.
  2. If you’re dealing w/ a reasonable number of items (<50M), Tokyo Tyrant is crazy fast. If you're looking for a known, MySQL is probably an acceptable solution.
  3. Don’t believe the hype. There’s a lot of talk, but I didn’t find any public project that came close to the (implied?) promise of tossing nodes in and having it figure things out.
  4. Based on the maturity of projects out there, you could write your own in less than a day. It’ll perform as well and at least when it breaks, you’ll be more fond of it. Alternatively, you could go on the conference circuit and talk about how awesome your half-baked distributed keystore is.

UPDATE: I’d be remiss if I didn’t stress that you should know your requirements and do your own testing. Any numbers I toss around are very specific to the hardware and (more importantly) the data set. Furthermore, most of these projects are moving at a fast clip so this may be out of date soon.

And, when you do your testing, publish the results – there’s almost nothing out there currently so additional data points would be a big help for everyone.