random($foo)

Some Notes on Distributed Key Stores #

random($foo) is the personal site of Leonard Lin, where I collect shiny things and publish original writing and code. more »

Last week I ended up building a distributed keystore for a client. That wasn’t my original intention, but after doing testing on just about every project out there, it turned out to be the best (only?) solution for our needs.

Specifically, a production environment handling at least 100M items with an accelerating growth curve, very low latency retrievals, and the ability to handle 100s of inserts/s w/ variable-sized data (avg 1K, but up in many cases well beyond) … on EC2 hardware. The previous system had been using S3 (since SDB is limited to 1K values) – err, the lesson there, BTW is don’t do that.

So, these requirements are decent – something that actually requires a distributed system, but something that shouldn’t be beyond what can be handled by a few nodes. My assumption was that I’d actually just be doing some load testing and documenting installation on the keystore the client picked out, and that would be that. This was not the case.

I’m still catching up on a number of other projects, so I don’t have a great deal of time to do a formal writeup, hoewver, the work I’ve done may be useful for those who might actually need to implement a production keystore.

Some other recent useful starting points may be Richard Jones’ Anti-RDBMS roundup and Bob Ippolito’s Drop ACID and think about data Pycon talk.

  • MySQL – while the BDB backend is being phased out, MySQL is a good baseline. With my testing, on a single m1.large, I was able to store 20M items within one table at 400 inserts/s (with key indexes). Key retrievals were decently fast but sometimes variable. There are very large production keystores are being run on MySQL setups. Friendfeed has an interesting writeup of something they’re doing, and I have it on good authority that there are others running very big key stores w/ very simple distribution schemes (simple hashing into smaller table buckets). If you can’t beat this, you should probably take your ball and go home.
  • Project Voldemort – Voldemort has a lot of velocity, and seems to be the de facto recommendation for distributed keystores. A friend had used this recently on a similar-scale (read-only) project, and this was what I spent the majority of my time initially working with. However, some issues…
    • Single node local testing was quite fast – 1000+ inserts/s, however, once run in a distributed setup, it was much slower. After about 50M insertions, a multinode cluster was running at <150 inserts/s. This… was bad and led me to ultimately abandon Voldemort, although there were other issues…
    • There is currently only a partially complete Python client. I added persistent connections in as well as client-side routing w/ the RouteToAll strategy, but well, see above
    • Embedded in the previous statement is something worth mentioning – server-side routing currently doesn’t exist.
    • While I’m mentioning important things that don’t exist, there is currently no way to rebalance or migrate partitions, either online, or, as far as I could tell, even offline. This puts a damper on things, no?
    • As a Dynamo implementation, a VectorClock (automatic versioning) is used – this is potentially a good thing for a large distributed infrastructure, but without the ability to add nodes or rebalance, it means that for a write-heavy load, it would lead to huge growth with no way for cleanup of old/unused items (this of course, also is not implemented)
  • LightCloud – this is a simple layer on top of Tokyo Tyrant but the use of two hash rings was a bit confusing and the lack of production usage beyond by the author (on a whopping 2 machines containing “millions” of items) didn’t exactly inspire confidence. Another problem was that it’s setup was predicated on using master-master replication which requires update-logs to be turned on (again, storing all updates == bad for my use case). This was of course, discovered rooting through the source code, as the documentation (including basic setup or recommendations for # of lookup & storage nodes, etc is nonexistent). The actual manager itself was pretty weak, requiring setup and management on a per-machine basis. I just couldn’t really figure out how it was useful.
  • There were a number of projects that I tried, including Cassandra (actually has some life to it now, lots of checkins recently), Dynomite and Hypertable that I tried and could not get compiled and or set up – my rule of thumb is that if I’m not smart enough to get it up and running without a problem, the chances that I’ll be able to keep it running w/o problems are pretty much nil.
  • There were a number of other projects that were unsuitable due to non-distributed nature or other issues like lack of durable storage or general skeeviness and so were dismissed out of hand, like Scalaris (no storage), memcachedb (not distributed, weird issues/skeeviness, issues compiling) and redis (quite interesting but way too alpha). Oh, although not in consideration at all because of previous testing with a much smaller data set, on the skeeviness factor, I’ll give CouchDB a special shout out for having a completely aspirational (read: vaporware) architectural post-it note on its homepage. Not cool, guys.
  • Also, there were one or two projects I didn’t touch because I had settled on a working approach (despite the sound of it, the timeline was super compressed – most of my testing was done in parallel with lots of EC2 test instances spun up (loading millions of nodes and watching for performance degradation just takes a long time no matter how you slice it). One was MongoDB, a promising document-based store, although I’d wait until the auto-sharding bits get released to see how it really works. The other was Flare, another Japanese project that sort of scares me. My eyes sort of glazed over while looking at the setup tutorial (although having a detailed doc was definitely a pleasant step up). Again, I’d finished working on my solution by then, but the release notes also gave me a chuckle:

    released 1.0.8 (very stable)

    • fixed random infinite loop and segfault under heavy load

OK, so enough with all that, What did I end up with you might ask? Well, while going through all this half-baked crap, what I did find that impressed me (a lot), was Tokyo Cabinet and its network server, Tokyo Tyrant. Here was something fast, mature, and very well documented with multiple mature language bindings. Testing performance showed that storage-size/item was 1/4 of Voldemort’s, and actually 1/2 of actual size (Tokyo Cabinet comes with built-in ZLIB deflation).

Additionally, Tokyo Tyrant came with built-in threading, and I was able to push 1600+ inserts/s (5 threads) over the network without breaking a sweat. With a large enough bucket size, it promised to average O(1) lookups and the memory footprint was tiny.

So, it turns out the easiest thing to do was just throw up a thin layer to consistently hash the keys across a set of nodes (starting out with 8 nodes w/ a bucket-size of 40M – which means O(1) access on 80% of keys at 160M items). There’s a fair amount of headroom – I/O bottlenecks can be balanced out with more dedicated EC2 instances/EBS volumes, and the eventual need to add more nodes shouldn’t be too painful (i.e. adding nodes and either backfilling the 1/n items or adding inline moves).

There are some issues (an issue w/ hanging on idle sockets) but current gets are at about 1.2-3ms across the network (ping is about 1ms) and it seems to otherwise be doing OK.

Anyway, if you made it this far, the takeaways:

  1. The distributed stores out there is currently pretty half-baked at best right now. Your comfort-level running in prod may vary, but for most sane people, I doubt you’d want to.
  2. If you’re dealing w/ a reasonable number of items (<50M), Tokyo Tyrant is crazy fast. If you're looking for a known, MySQL is probably an acceptable solution.
  3. Don’t believe the hype. There’s a lot of talk, but I didn’t find any public project that came close to the (implied?) promise of tossing nodes in and having it figure things out.
  4. Based on the maturity of projects out there, you could write your own in less than a day. It’ll perform as well and at least when it breaks, you’ll be more fond of it. Alternatively, you could go on the conference circuit and talk about how awesome your half-baked distributed keystore is.

UPDATE: I’d be remiss if I didn’t stress that you should know your requirements and do your own testing. Any numbers I toss around are very specific to the hardware and (more importantly) the data set. Furthermore, most of these projects are moving at a fast clip so this may be out of date soon.

And, when you do your testing, publish the results – there’s almost nothing out there currently so additional data points would be a big help for everyone.

Tags: , , ,

  • anildash

    70% of this was above my head but all of it was entertaining to read. :)

  • http://randomfoo.net/ lhl

    Well, at least some good has come out of this then. ;)

  • http://parand.com/say/ Parand

    Any thoughts of open sourcing the thin layer of consistent hashing business? Will save me writing my own.

  • http://randomfoo.net/ lhl

    Sorry Parand, as this was work for a client (so not actually mine to open source). Like I mention though, the actual coding for the thin layer was literally hours. In this case, I used the a custom hash implementation, but you can check out Amir S's hash_ring implementation (although I was pretty meh about LightCloud, I'm actually a big fan of Amir's blog and what he's been up to).

    I may do a clean-room rewrite and add in the dynamic expansion features and all that, but based on the ridiculousness of my near-term schedule… that probably wouldn't be anytime soon.

  • http://taint.org/ Justin Mason

    Thanks — great writeup! I hadn't paid much attention to Tokyo Tyrant, but I'll be changing that.

    Were you looking for any backup capability? can you snapshot the state of the Tokyo Cabinet store to take a backup of that? or are you just relying on doing that via EBS?

    (also: S3 as a k-v store: slooooow)

  • http://randomfoo.net/ lhl

    Yeah, for backup the decision was that it'd be good enough to snapshot (TC atomic copy and EBS snapshot) – for the Master-Master replication, maybe someone who's done more w/ it can chime in w/ when the ulogs can be removed.

    For people that couldn't afford the data loss, I'm assuming that they can run M-M (w/ ulog'ing on another EBS volume if disk I/O is write-limited) and create a cleanup daemon that will check the rts's and delete expired ulogs? Based on my understanding, anything older than the rts timestamp on the corresponding master could be safely dispensed of? I didn't really test and I couldn't find that in the documentation so I punted. But if there are any Tokyo Cabinet experts reading this (or people that have tested) it'd be great to hear.

  • http://neo4j.org Emil Eifrem

    Great writeup. Did you consider a graph database like Neo4j? If so, why did it fall short? In Neo4j, you can set arbitrary key-value properties on both nodes and the relationships between them. Neo4j ships as an embedded db but if you add a thin REST layer, you basically get a key-value store but with full-blown support for relationships. In my experience, how entities are related is often a very important part of a domain.

    http://neo4j.org
    http://github.com/andreasronge/neo4j/tree/master (Ruby bindings with REST support)

    (Disclaimer: I'm involved.)

    -EE

  • http://amix.dk/ amix

    Using consistent hashing and using Tokyo Tyrant as the backend _is_ basically what LightCloud does…! The only reason why you need two hash rings is if you want to dynamically add and remove nodes to the system.

    Some issues with your system:
    - what happens if a node fails? Without using Tokyo Tyrant's master-master replication you are pretty doomed
    - what happens if you need to scale beyond 150M keys?

  • http://gavinbell.com/ zzgavin

    Like Anil, much of this was uncharted territory to me, but thanks for writing such a helpful overview.
    I know a lot more about non-rdbms than I did 20 minutes ago.
    cheers Gavin

  • http://project-voldemort.com jay

    Hi Leonard,

    WRT voldemort. You are correct that all of these systems are very alpha. However, you should not see an decrease in performance based on the number of nodes. We are using a number of voldemort clusters at LinkedIn and we have not seen this problem, it sounds like a bug. Could you send me a little more information about your test setup so we can try to reproduce?

    Thanks!

  • Abraham

    Its funny you say redis is way to alpha so you reinvented the wheel instead, like that you built is not alpha :)

  • Bojan

    Thank you for this post, I'm dealing with the exact same problem now, it seems. Although the amount of data (and the bandwidth I expect on write) is probably one step above compared to yours.
    interestingly enough, i came to the exact same conclusion – Tokyo Cabinet/Tyrant with custom routing to multiple nodes is the only available solution (that doesn't cost an arm and a leg) to be fast enough and rock-solid.

  • csb

    Which persistence backend did you use for your Project Voldemort testing?

  • http://invece.org antirez

    Hello! (disclaimer I'm the author of Redis).

    Redis is surely a beta product, there are people using it in production but still we are entering in few days the feature-freeze stage now that Redis-git includes non-blocking replication. After we enter the feature freeze stage Redis will be stress-tested for weeks, then 1.0.0-rc1 will be released. My goal is to provide a rock solid product to the market.

    So my hint is: handle with care since it's young code, but we are moving very fast, and feature-freeze stage is near. Also to make people safer Redis 1.0 will include a tool to dump a Redis DB into SQL format.

    Redis apart, 1600 inserts / second are very poor performances. I think Tokyo cabinet is ways faster, probably it's the networking layer that is slow? Even MySQL is capable of 1600 inserts/second so if you really care about stability, replication, and things like this and you can live without very fast performances a table with an unique key ID and a blob value can really be a good alternative, especially in contexts where all you need is a plain key-value DB like Tokyo.

    This is why Redis is stressing a lot on the data structures bit, that are things that are hard to model otherwise.

  • Zak

    What an awesome post! Thanks very much for writing it.

  • klaus

    Is there a stable Python client for TokyoCabinet?

  • Jan Lehnardt

    Hi,

    thanks for writing this up. Can you elaborate on the CouchDB vaporware bit, though?

    Cheers
    Jan

  • http://spyced.blogspot.com Jonathan Ellis

    Yes, Cassandra is starting to get its act together under the Apache umbrella now.

    What problems did you run into? We recently fleshed out the docs at http://incubator.apache.org/cassandra/; we'd appreciate feedback as to what needs to be added.

  • http://spyced.blogspot.com Jonathan Ellis

    oops, the semicolon is being included in the url — that's http://incubator.apache.org/cassandra/

  • http://randomfoo.net/ lhl

    Hey Amir, I'm aware that what I built is similar (but simpler) to LightCloud – by the time that I went through testing multinode TT performance and figured out the setup I had built the thin layer that we needed. My understanding is that the addition of the lookup ring means that you end up with a lookup record for every single key?

    As I mentioned given the tradeoff in write i/o, snapshotting was acceptable for the client, so not doom, but more like minor inconvenience w/ acceptable data loss.

    For scaling, O(2) or O(3) lookups will probably be OK as the nodes until additional nodes are added and redistributed.

  • http://randomfoo.net/ lhl

    Hey antirez,

    Redis looks cool so it'll definitely be something I'll keep an eye out in the coming months. The data structure approach is interesting…

    You're correct that the insert/s numbers are much lower than the typically published numbers. Part of it is that it is going over the network, another part is that the items sizes are much bigger than those typically used in the benchmarks published. And it's EC2, so the I/O is crap. You're right that MySQL is the baseline there – I think lots of people don't know how fast it can be w/ simple queries — although it tends to like lots of spindles. Lots and lots of them.

  • http://randomfoo.net/ lhl

    Jan, if I recall correctly, when I tried out CouchDB last year, Lucene wasn't in releases or trunk (and the branch didn't build) and the replication was a joke. While there has been consistently big talk about CouchDB scaling, I could not find any actual distributed features for dealing with large datasets. CouchDB choked on a relatively modest data set when generating views – many minutes to generate one on a small 100K item/2GB data set. It also took 5+GB of storage for that.

    I also couldn't really wrap my head about the benefit of not having indexes but having to recalculate a view anytime the data changed, but I'd say mostly that at the time (and based on the Q&A at Bob Ippolito's talk maybe still) that CouchDB fanboys and developers were all over the Internet taking up oxygen about Couch while like I mentioned, what I assumed were core components didn't exist workably, much less being suitable for anything but the most toy test projects.

  • http://randomfoo.net/ lhl

    Jonathan, the Getting Started Guide was useful, but I'd recommend a more comprehensive step-by-step? I know that's a PITA to document, but I suspect would really help in both getting people up and running and getting specific feedback on problems people are encountering (eg, in Step 12 on RHEL5 I had to do x).

    If you'd like to see what I ran into, you could spin up an EC2 instance running a Rightscale Debian or Alestic Ubuntu instance and make sure that a user is able to get a blank system up and running. I was able to compile Cassandra, but the Thrift bits gave me some trouble. Once that was supposedly all up and running, I couldn't actually talk to the Cassandra server to test, so my assumption was that I'd missed something in setup.

    I got distracted by some other tests at that point and ended up never pushing further having found a better solution. Also, I'm not sure if your intentions are for widespread production use at this point, but if so, I think that I'm like most sane people in getting totally skeeved out by running trunk checkouts in prod. Some packages/releases would probably also be really helpful in that regard.

  • CLRS

    O(2) = O(3) = O(1) = O(1000000000000000000)

  • http://randomfoo.net/ lhl

    Actually, you're right but not for that reason – in the real world there is a big difference between O(1) and O(bazillion) – however, my assumption was that the chaining for collisions was a fixed value but it's chained w/ a binary search tree, so O(log n).

  • http://randomfoo.net/ lhl

    Yeah, I'm using Bob Ippolito's (he's everywhere :) pytyrant, a pure python implementation that's *very* active. There is also a wrapper for the C API: python-tokyotyrant.

  • http://randomfoo.net/ lhl

    I know that's slightly tongue and cheek, and there's some truth to that, but I think it's worth pointing out the difference between the solid data storage and retrieval part and the distributed part. In a comparison between redis and tc/tt for the former, I don't think there's any question (certainly not in my mind) which one is more battle-tested. So it's not like I went out and built my own keystore. For the distributed part, it was a matter of putting together the simplest thing that could work after it turned out that I there wasn't a black box solution to be had.

  • Jason Dusek

    I would really like to know how this discussion turns out.

  • http://spyced.blogspot.com Jonathan Ellis

    Good feedback, thanks.

    We're about to turn the corner from mostly-developer-focused to trying to get something that works out of the box for most people. Getting a release out is part of that. Thrift is a bitch and there's unfortunately not much we can do about that, but maybe providing a vmware image with sample single-node and 3-node configurations (for instance) would help there.

  • http://anyall.org brendano

    Why aren't there Debian packages for Thrift? Yes, packaging is a pain, but Thrift is intended to be a lowish-level service that many different apps use; therefore having .rpm's and .deb's makes a lot of sense. That would certainly make Cassandra installation easier. (I had similar issues.)

  • http://anyall.org brendano

    Hey, I just wanted to say, thanks for doing all this legwork and posting the results. It's very hard to find evaluations of these systems that weren't done by their authors.

  • evgen

    CouchDB does take a long time to generate the view the first time you access a view, but each subsequent access uses the pre-generated version and returns quite quickly. When the dataset from which a view is created gets updated the view is also updated with the new data.

    Next time, when you have no clue how a system works it would be best to refrain from talking trash about it and revealing the depth of your ignorance. CouchDB is not the greatest thing since sliced bread, is not a key-value datastore (which makes me wonder why it was even in your list other than to justify your petty little rant), and has a ways to go to meet some of its design goals, but within its niche it is a rather interesting tool that people should pay attention to.

  • http://spyced.blogspot.com Jonathan Ellis

    My understanding is that debian prefers to wait until a project has an official release and then package that rather than a svn snapshot. I work with a debian developer and he has a deb ready to roll as soon as Thrift gets out their 0.1 or whatever they are going to tag it. (They are actually making an effort towards that now, so hopefully soon.)

    Of course then there's the whole RPM side of things, not to mention things like Gentoo or even (shudder) Windows. :)

  • http://randomfoo.net/ lhl

    Oh BURN! (not) I'm perfectly happy to reveal the depth of my ignorance if the ensuing discussion can help shed light on it (although my patience for other people's asshattery on *my blog* is finite). You're right that CouchDB isn't a kv store, but since every conversation about any of these subjects ineveitably brings up the “What about x?” where x invariably includes Couch, it'd be worth pre-empting. Personally, I think the critique I give in the posting (one line != rant) is pretty valid. The detailed response to Jan was because he asked. I'm sure he and the rest of the Couch team are good peoples, but my sentiments aren't unique – some have suggested that I should have made the CouchDB line it's own bullet-point.

    As for trash talk, I have to say that you've been engaging in a fair amount of it. I'm posting my experiences (and I don't claim it to be anything more than that). It's not rocket science, but it's real data w/ real world usage in an area where there's significantly more smoke than fire (or published results). So, what's your skin in the game, and what's your contribution?

  • http://randomfoo.net/ lhl

    Sample single-node and multi-node AMIs would be *huge*. I think that that, and some sample schemas would be great. If you created an empty table on the wiki for people to post up their testing results, I'd have to believe that it'd also fill up pretty quickly. I think there are a lot of people that are reviewing these things, but probably getting hung up getting started/wrapping their heads around deploying.

  • http://randomfoo.net/ lhl

    Nope, I'd actually looked at Neo4j in the past but didn't actually even think about it for this. Hopefully someone takes a look at it and posts some results.

  • http://randomfoo.net/ lhl

    FYI, I sent Jay some details on my setup and dataset. Hopefully that's enough to help replicate, otherwise may be a bit slow going since I'm juggling a few other balls atm.

  • http://randomfoo.net/ lhl

    Yeah, I'll be happy if this helps some people getting started, but even happier if this encourages more people out there to publish their findings/results, even if it's like mine where I could only get a few of them running (that in itself maybe a useful datapoint).

    It'd be nice to get the s/n ratio up a bit for people that actually need to run something into production (I mean sure it's the Internet, but the amount of fandom/hater hot air has been pretty insufferable in the area).

  • http://www.toluju.com Toby Jungen

    Good roundup, thanks. Was linked her from Alex Miller (https://twitter.com/puredanger)

  • http://jimpick.com/ Jim Pick

    I'd be interested in seeing how Bycast's StorageGRID product stacks up. It's a proprietary solution sold to hospitals through HP and IBM, as well as through direct sales. It isn't being marketed as a keystore, but that's what it is.

    http://www.bycast.com/

    I used to work there, so I know the product intimately. But I haven't compared it to the open source stuff out there. I doubt any of the free solutions are the type of thing you'd run a hospital on (due to support, documentation, etc.)

  • Jan Lehnardt

    Hey, thanks for the feedback. CouchDB documentation is still coming along and there's a lot of things you can do wrong if you don't understand it as it is different in a lot of ways. So lucene is the only “vapor that's on the frontpage”, by now there's a decent GitHub branch that we're looking to integrate. Replication is solid and has been for quite some time, I wonder what didn't work for you. View Indexes only recalculate what changed in a DB, it's incremental, they do not reindex all data when things change. Also, the view behaviour you saw is most likely wrong usage or outdate dependencies. We're happy to address any issues on the user@ mailing list, but I understand that getting under the skin of every project out there is not your priority.

  • Jan Lehnardt

    In addition, the CouchDB devs never did any “big talk about scaling” auto-sharding et.al is a future feature. Just by the fact that it comes with a HTTP interface makes key-based partitioning a snap (see couchdb-lounge on Google Code for a 3rd party project that does just that). CouchDB has “alpha software” written all over the place :)

  • http://invece.org antirez

    I can confirm that EC2 performed poorly compared even to low-end linux boxes when we run tests against EC2. In the smaller instances the redis benchmark returned 15000 queries/second, with the largest instance they provide it was 50000 q/s that is a number that's trivial to get with any kind of old Linux box.

    Btw thank you for this article, I understand your findings can be complete or always accurate, but to find non biased data on this stuff is really hard. I hope many other guys in the field will try different key-value stores under real world load and publish their findings. This is the only way all this projects can mature faster, start to be more reliable, and understand what the real user feeling is.

  • http://randomfoo.net/ lhl

    No worries, will continue to keep an eye out. I think part of the problem is the cool stuff CouchDB is tackling (non-relational, document-based, built w/ erlang, map-reduce processing) is catnip for devs and tends to make them forget about the “alpha software” bit unless it's big and blinking. The inevitable backlash/eye-rolling when it gets brought up everywhere isn't necessarily your fault, but something to be aware of.

    Cheers!

  • http://ekschi.com/technology/2009/04/22/distributed-key-stores-roll-your-own/ Distributed Key Stores: Roll your own! – Technology

    [...] Lin has some interesting notes on distributed key stores. He implemented a distributed key store for a client based on Tokyo Cabinet and a consistent [...]

  • http://ilya.us ilya haykinson

    I'm surprised that you didn't try HBase. It uses Hadoop as the backing store, and has both serious production use as well as high performance going for it. While its native API is Java, the thing does come with a Thrift-based interface that I found to be just as fast.

  • Fred

    The point is that O(bazillion) is TOTALLY WITHOUT MEANING. A bazillion what? There's no units, no variables, and no point.

    If you meant to say that operation *foo* causes 3 lookups to be made, then say it — don't abuse notation.

  • http://randomfoo.net/ lhl

    On the one hand, sure “technically” you're right. On the other hand, do I really need the Big-O Notation police commenting here? Is this really the kind of conversation I want in this thread?

  • http://randomfoo.net/ lhl

    Ilya, do you have numbers on your HBase setup? How's the latency for queries? My understanding (and this applied to Hypertable too if I would've been able to get it up and running) that as BigTable clones, they're oriented about fast sequential requests, but not as good on the random. Would be interesting to get actual #s from your testing (ms latency, qps, on #/kind of nodes, w what kind of data set).

  • http://twitter.com/andriijas Andreas
  • http://www.tardis.ed.ac.uk/~fmc Frazer

    Hi,
    Good writeup.
    Just to add to the list of 'what about X' posts : You mention MySQL, but not MySQL Cluster.
    (I am a MySQL Cluster developer btw.)
    MySQL Cluster is at heart a distributed key-value store using hash based partitioning.
    It supports in-memory or disk storage of key-value pairs as primary key and attributes.
    Additionally it supports :
    Multi kv pair transactional reads/updates
    Synchronous replication of updates within a cluster
    Disk-persistence of in-memory data via Redo and checkpointing.
    Automatic node failure and recovery handling
    Asynchronous replication of updates to other clusters/MySQL databases, including Master-Master with conflict detection/resolution
    Online addition of storage nodes and data repartitioning (from version 7.0)
    Secondary indices on data (unique, ordered).
    SQL access supporting MySQL SQL syntax
    Access from all MySQL supported connectors (JDBC, PHP, Perl, etc..)
    Latency+throughput optimised API for remote clients
    Online snapshot backup, optionally compressed
    e.t.c…

    It is open source, licensed through GPL, with support available if required.
    I suspect that MySQL Cluster could meet or beat the latencies and throughputs of the other systems discussed here, especially when accessed via a native API rather than through MySQLD. Internally it uses a message-passing state machine architecture (similar to the CSP style of Erlang) which gives really nice properties w.r.t. latency, throughput and system efficiency.

    Perhaps because MySQL Cluster is associated with MySQL it appears to be 'relational' and therefore does not get included in open-source kv store comparisons?
    Hope this doesn't sound too much like an advert :) ,
    Frazer

  • http://anyall.org/blog/2009/04/language-model-storage-timings/ Language model storage timings – Brendan O’Connor’s Blog

    [...] it would be nice to have a distributed key-value store with a heavy caching layer. Check out Leonard Lin’s post on the subject. Unfortunately, these experiments were limited to single-node with a small [...]

  • http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/ Performance comparison: key/value stores for language model counts – Brendan O’Connor’s Blog

    [...] it would be nice to have a distributed key-value store with a heavy caching layer. Check out Leonard Lin’s post on the subject. Unfortunately, these experiments were limited to single node with a small [...]

  • EllisGL

    How about JavascriptDB? (he he)
    http://www.persvr.org/

  • http://dancres.wordpress.com/2009/04/22/links-for-2009-04-22/ links for 2009-04-22 « Dan Creswell’s Linkblog

    [...] Some Notes on Distributed Key Stores (tags: distributed storage) [...]

  • http://anyall.org brendano

    I did a small-scale performance comparison related to the task I was worrying about (term counts). http://anyall.org/blog/2009/04/performance-comp…

  • scoop

    Hey Leo, thanks for the great write up. There is a lot of hype around these k-v dbs. By the time you write a serious domain application around most of them, you begin to understand why “traditional” persistent stores are not as fast. If you want to use these as the primary persistent back end for a domain app, you'll soon realize that most of these “databases” push the messy details to the programmer.

    “Partionable”, and “distributed” are also tall claims for most of them. I looked at redis too and can't understand where the distributed part comes in.

    “Based on the maturity of projects out there, you could write your own in less than a day. It’ll perform as well and at least when it breaks, you’ll be more fond of it. Alternatively, you could go on the conference circuit and talk about how awesome your half-baked distributed keystore is”

    Completely agree. At the end of the day, its not rocket science to write your own memory hash-map and have a thread write backups to a disk file or just embed BDB and be done with it. And you can tune it to do exactly what you need for your own domain, including managing relationships if necessary.

  • http://parand.com/say/ Parand

    I haven't personally used HBase, but a friend is using it in production in a fairly large site and tells me query speed is definitely not fast enough to be user facing (they have a huge memcached farm in front of it).

    Also, my understanding is that HBase mostly uses HDFS (the distributed file system) as opposed to Hadoop.

  • http://randomfoo.net/ lhl

    Here's your chance EllisGL, do some tests and post some numbers.

  • coder

    On Voldemort store; Bob Ippilito has no idea what he is talking about. I asked VM folks about his claims on how VM leaves deleted objects around, and they flatly denied his claims. See

    http://groups.google.com/group/project-voldemort/browse_thread/thread/2746cd8bd3700162

    Rebalancing, lI believe, is the next feature to be released.

  • http://randomfoo.net/ lhl

    Frazer, I've played around w/ NDBCLUSTER a bit, which is what MySQL Cluster is running on, right? Does it have durability now? My understanding at the time I played w/ it was that it was neat but didn't have storage – for the disk-persistence you mention, how does check-pointing affect performance? It sounds interesting, although one of the appelas of running a “simple” system is not needing a dedicated DBA or data-wrangler…

    Hopefully someone gives MySQL Cluster a spin, would love to see how it compares.

  • http://twitter.com/mediajunkie xian

    99% of it was over my head but reading you Leonard always makes me *feel* smarter.

  • http://www.tardis.ed.ac.uk/~fmc Frazer

    Hi,
    Yes, MySQL Cluster is the name we give a system of MySQL servers connected to an Ndb Cluster.
    I think there's some confusion with the definition of disk-persistence and durability.
    MySQL Cluster has always had disk-persistence. All changes are redo-logged to disk and checkpoints to disk are used to allow the Redo log to be trimmed. Checkpointing has a few percent impact on achievable throughput – the disk write bandwidth used can be traded off against checkpoint duration and hence redo log size. The redo log is not fsynced at every transaction commit, but periodically – usually every 2s, and down to every 100millis. This tradeoff allows high throughput on Cluster's internal 2PC.
    This window means that committed transactions are not immediately disk-durable, but when running with 2 or more replicas, all data is synchronously replicated at commit time, so committed transactions are machine-failure durable, and become disk durable (on all replicas) within ~2s. This is a three-way trade off between tolerance to total cluster failure (requiring disk durability), tolerance to machine failure (requiring machine-failure durability) and throughput (requiring control of fsyncs/s).

    Prior to MySQL 5.1, all data was held (and had to fit) in memory.
    From MySQL 5.1, non indexed data (i.e. the values in a kvp) can be stored on disk. This means that when they are read/written they are fetched from disk into an in-memory LRU cache in the same way as most databases. This allows data sets larger than the memory size to be handled by a single cluster node, at the cost of some performance. Persistence/Durability is the same, with Redo log flushed periodically etc.

    Over time we will add support for disk-storage of indexed data (keys in a kvp), disk-durable transactions etc.

    I take your point about complexity. Getting a system that has 'just enough' complexity to meet your needs is always hard. I think MySQL Cluster could suit some folks but it's not the simplest system out there.
    Frazer

  • http://feeds.fluidinfo.com/fluidinfo terrycojones

    Thanks for the fast-paced, interesting, and amusing write-up. Sounds like a fairly intense weekend :-)

    Terry

  • http://randomfoo.net/ lhl

    coder, that's a useful link, however saying that Bob has no idea what he's talking about is going a bit too far I think, seeing as he did explore Voldemort enough to write (the only) python binding for it… (your post btw also nears the line where I start with comment smackdowns – if you're gonna blast people, you need to man up and put your Real Name and Reputation on it; I find that helps to keep conversation constructive and civilized),

    Both the partition-rebalance2 and protobuf branches look promising, so we'll just have to see.

  • dready

    Thanks for the nice practical roundup.

    For the idle socket hanging issue, do you think it's an issue with pytyrant or tyrant itself?

    I submitted a patch to pytyrant that could potentially be related to it, basically the client hangs when the socket is closed (which could happen on idle connections.)

    http://code.google.com/p/pytyrant/issues/detail…

    See if it fixes your issue?

    =wil

  • Colin Howe

    Great post!

    Rough comparison of MySQL performance against Redis performance
    http://colinhowe.wordpress.com/2009/04/27/redis…
    Probably similar numbers for other KVS… but yet to find out.

    Will be looking at Tokyo Cabinet later and adding in a similar test :)

  • http://www.marko.anastasov.name/blog/2009/05/02/know-your-storage-options-benchmarking-tokyo-cabinet/ Know Your Storage Options – Benchmarking Tokyo Cabinet « Markov Blog

    [...] Some Notes on Distributed Key Stores [...]

  • http://taint.org/2009/04/21/220502a.html Links for 2009-04-21 / taint.org: Justin Mason’s Weblog

    [...] Some Notes on Distributed Key Stores : great investigation from Leonard Lin; Tokyo Tyrant gets a strong thumbs-up. also: ‘based on the maturity of projects out there, you could write your own in less than a day. It’ll perform as well and at least when it breaks, you’ll be more fond of it. Alternatively, you could go on the conference circuit and talk about how awesome your half-baked distributed keystore is.’ ha! (tags: scaling storage distcomp k-v-stores tokyocabinet tokyotyrant voldemort mysql databases cassandra) [...]

  • coder

    Hi; I am a firm believer in “knowing what you don't know”. Getting up there and talking in front of people and spewing bunch of misinformation is not cool. I don't think Bob did the research, to his credit, I don't think any one man could do it given the wide range of products he was looking at; he was looking at bunch of stuff, and he confused things, and wrote some wrong stuff. I understand this. But I also want to set the record straight. Writing some JSON library does not make you an automatic authority on all things IT.

  • http://randomfoo.net/ lhl

    Fair enough on making sure the data is accurate. Still, the hyperbole (“has no idea what he's talking about” or “spewing misinformation”) does a disservice if free information flow is your goal. And from your tone and borderline ad hominem attacks, it sounds you have an axe to grind (unless your last comment is simply confusion- ie, I wasn't referring to his writing simplejson, but rather writing voldemort_client).

    Personally, not knowing Bob at all, I found his presentation to be useful as a good overview for people that haven't been playing around with the various packages (and it jibed well enough with my own experiences) – I think he was pretty up front about where he was approaching it from (as someone who needed a solution that worked and his experiences – not as any domain expert, whatever that means). Most of the data is going to be out of date anyway since projects have been moving pretty quickly. And the plain fact of the matter is that his presentation has gotten attention precisely because there's so little published out there. In that respect, I think that it's a pretty big contribution to the community and I wish there was *more* of that out there, not less.

    Anyway, if want to correct the errata in the presentation, why not just drop a line and it'd get fixed? If it just offends your sensibilities that “anyone” can go around, test some stuff and talk about it… well, that's err, usually how that works. At least he's put his real name to it (that's what I'm a firm believer in).

    Anyway, since we've all said our pieces, I'd like to consider this conversation closed unless Bob jumps in. Life's too short.

    cheers,

  • schemafree

    Hi everyone, here is yet another key/value store: Schemafree.

    http://code.google.com/p/schemafree/

    It uses Mysql as storage, has key based distribution, versioning, being able to make incremental changes to lists and other features. It has built-in integration with Memcached.

  • przemyslaw

    Let me be the next to announce another K/V store ;) http://www.subrecord.org
    From what I have just learned seems I have been working on kind of Voldemort clone. I mean, working in isolation I've come to very similiar concept/architecture however still different. Nice :)

  • http://twitter.com/communicating Chris

    Great write ip, chocked full of good insights.

    I know the pain of having to build my own fundamental library as what existed is just not quite what I needed.

    On the topic of key/value stores you missed one which is a little more obscure but I like it a lot. SkipDB from the author of IO.

  • http://patg.net/ CaptTofu

    Wow! Thank you for this write-up and all the various comments. I'm curious – what kind of application is this that the kv store is required for? What do you use for your various tests?

  • http://randomfoo.net/ lhl

    kvstores are particularly good for anything where you want to pull stuff quickly and randomly by id – canonical storage for documents perhaps, or pointers to media files. also, just about anything you would use something like memcache for, but that requires persistence.

  • http://twitter.com/belapatkai Bela Patkai

    Thanks for the great write-up, we need more of these. I started using #cloud_nohype on twitter to tag posts that are realistic and have hard work behind them. In a new project I need a key-value store – or a two column table – with very high and variable demand both in terms of size and queries. Your post made me reconsider starting S3 testing, but also made me reconsider an earlier idea of a HSQLDB cluster on EC2. MySQL Cluster sounds good too, but it is not marketed well so I didn't know about it, even if I go to the mysql site sometimes. The sad fact is that sql dbs are not very exciting anymore – we were using them too long :)

  • http://twitter.com/belapatkai Bela Patkai

    Thanks for the great write-up, we need more of these. I started using #cloud_nohype on twitter to tag posts that are realistic and have hard work behind them. In a new project I need a key-value store – or a two column table – with very high and variable demand both in terms of size and queries. Your post made me reconsider starting S3 testing, but also made me reconsider an earlier idea of a HSQLDB cluster on EC2. MySQL Cluster sounds good too, but it is not marketed well so I didn't know about it, even if I go to the mysql site sometimes. The sad fact is that sql dbs are not very exciting anymore – we were using them too long :)

  • findchris

    Are those redis numbers local or over a network?

  • findchris

    Any luck getting a master master config running?

  • Gandalf62

    Was your client only considering open source databases, or was that more your choice? There are other very mature products out there that might be a better option… consider the Caché database (by InterSystems) for example.

  • http://randomfoo.net/ lhl

    The client wasn't considering commercial systems – but that's probably a shared bias of most people building things in the web startup space – which I also happen to share.

    But not entirely w/o good reason. For example, I've never heard of Caché (nor has it come up in any of the NoSQL conversations in the community I track until now) – the chances are that any startup will be able to hire someone w/ expertise in that system is probably pretty low. I ended up putting together a system in less time that it would have taken to schedule a sales call, much less go through the evaluation process (built and launched for client over the weekend). Certainly for less cost than the evaluation time as well. Not to mention the inevitable incremental costs for upgrades, additional licensing, and support that comes w/ enterprise software.

    In fact, I can't think of any large internet companies that isn't based primarily on an open source infrastructure supplemented by their own code (MS is an exception, but not in terms of depending on 3rd party technology, just avoiding most of the open source stuff).

    I think for most startups the risk/expense/agility curve is just isn't a good fit for most niche 3rd party software infrastructure.

    (Oh, also, peeking at the docs Caché doesn't support language bindings for the languages that the startup used, so that's another strike.)

  • GPN

    Thanks for the useful write up. Certainly helped me go through a lot of options fast.

    I am facing kind of an different angle on the distribution problem, was wondering what you may recommend for it. Description:

    A SMALL set of data that is both read & write. Each server holds a FULL copy of the data (since its small). However must be distributed between MANY MANY servers. When one server updates its local copy, the update must be propagated to the others. “Eventual consistency” is good enough but should be seconds not hours. Nice to have: bindings to PHP, C/C++.

    The catch – do not want to add a data layer (no NFS, no DB) but rather keep it all between the symmetric identical “many many servers”.

    Any suggestions other than “roll your own” ?

  • http://randomfoo.net/ lhl

    Well, you could try something like repcached I suppose, which as long as you can repopulate from a canonical store if everything goes down, should work.

  • http://randomfoo.net/ lhl

    Well, you could try something like repcached I suppose, which as long as you can repopulate from a canonical store if everything goes down, should work.