<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>random($foo) &#187; distributed</title>
	<atom:link href="http://randomfoo.net/tag/distributed/feed" rel="self" type="application/rss+xml" />
	<link>http://randomfoo.net</link>
	<description>blog blog blog</description>
	<lastBuildDate>Sun, 15 Jan 2012 20:26:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Some Notes on Distributed Key Stores</title>
		<link>http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores</link>
		<comments>http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores#comments</comments>
		<pubDate>Tue, 21 Apr 2009 05:39:51 +0000</pubDate>
		<dc:creator>lhl</dc:creator>
				<category><![CDATA[Tech]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[distributed]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[keystore]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://randomfoo.net/?p=5364</guid>
		<description><![CDATA[Last week I ended up building a distributed keystore for a client. That wasn&#8217;t my original intention, but after doing testing on just about every project out there, it turned out to be the best (only?) solution for our needs. Specifically, a production environment handling at least 100M items with an accelerating growth curve, very [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I ended up building a distributed keystore for a client.  That wasn&#8217;t my original intention, but after doing testing on just about every project out there, it turned out to be the best (only?) solution for our needs.</p>
<p>Specifically, a production environment handling at least 100M items with an accelerating growth curve, very low latency retrievals, and the ability to handle 100s of inserts/s w/ variable-sized data (avg 1K, but up in many cases well beyond) &#8230; on EC2 hardware.  The previous system had been using S3 (since SDB is limited to 1K values) &#8211; err, the lesson there, BTW is don&#8217;t do that.</p>
<p>So, these requirements are decent &#8211; something that actually requires a distributed system, but something that shouldn&#8217;t be beyond what can be handled by a few nodes. My assumption was that I&#8217;d actually just be doing some load testing and documenting installation on the keystore the client picked out, and that would be that.  This <b>was not the case</b>.</p>
<p>I&#8217;m still catching up on a number of other projects, so I don&#8217;t have a great deal of time to do a formal writeup, hoewver, the work I&#8217;ve done may be useful for those who might actually need to <b>implement</b> a production keystore.</p>
<p>Some other recent useful starting points may be <a href="http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/">Richard Jones&#8217; Anti-RDBMS roundup</a> and <a href="http://blip.tv/file/1949416/">Bob Ippolito&#8217;s Drop ACID and think about data Pycon talk</a>.</p>
<ul>
<li>MySQL &#8211; while the BDB backend is being phased out, MySQL is a good baseline.  With my testing, on a single m1.large, I was able to store 20M items within one table at 400 inserts/s (with key indexes).  Key retrievals were decently fast but sometimes variable. There are very large production keystores are being run on MySQL setups. Friendfeed has <a href="http://bret.appspot.com/entry/how-friendfeed-uses-mysql">an interesting writeup</a> of something they&#8217;re doing, and I have it on good authority that there are others running very big key stores w/ very simple distribution schemes (simple hashing into smaller table buckets).  If you can&#8217;t beat this, you should probably take your ball and go home.</li>
<li><a href="http://project-voldemort.com/">Project Voldemort</a> &#8211; Voldemort has a lot of velocity, and seems to be the de facto recommendation for distributed keystores.  A friend had used this recently on a similar-scale (read-only) project, and this was what I spent the majority of my time initially working with.  However, some issues&#8230;
<ul>
<li>Single node local testing was quite fast &#8211; 1000+ inserts/s, however, once run in a distributed setup, it was much slower.  After about 50M insertions, a multinode cluster was running at &lt;150 inserts/s.  This&#8230; was bad and led me to ultimately abandon Voldemort, although there were other issues&#8230;</li>
<li>There is currently only a <a href="http://groups.google.com/group/project-voldemort/browse_thread/thread/e2bdca1f924493cf">partially complete</a> Python client.  I added persistent connections in as well as client-side routing w/ the RouteToAll strategy, but well, see above</li>
<li>Embedded in the previous statement is something worth mentioning &#8211; <a href="http://groups.google.com/group/project-voldemort/browse_thread/thread/cb7252ed7a2f9fca">server-side routing currently doesn&#8217;t exist</a>.</li>
<li>While I&#8217;m mentioning important things that don&#8217;t exist, there is <a href="http://groups.google.com/group/project-voldemort/browse_thread/thread/685cc2623025c557">currently no way to rebalance or migrate partitions</a>, either online, or, as far as I could tell, even offline.  This puts a damper on things, no?</li>
<li>As a Dynamo implementation, a VectorClock (automatic versioning) is used &#8211; this is potentially a good thing for a large distributed infrastructure, but without the ability to add nodes or rebalance, it means that for a write-heavy load, it would lead to huge growth with no way for cleanup of old/unused items (this of course, also is not implemented)</li>
</ul>
</li>
<li><a href="http://opensource.plurk.com/LightCloud/">LightCloud</a> &#8211; this is a simple layer on top of <a href="http://tokyocabinet.sourceforge.net/tyrantdoc/">Tokyo Tyrant</a> but the use of two hash rings was a bit confusing and the lack of production usage beyond by the author (on <a href="http://news.ycombinator.com/item?id=498914">a whopping 2 machines</a> containing &#8220;millions&#8221; of items) didn&#8217;t exactly inspire confidence.  Another problem was that it&#8217;s setup was predicated on using master-master replication which requires update-logs to be turned on (again, storing all updates == bad for my use case).  This was of course, discovered rooting through the source code, as the documentation (including basic setup or recommendations for # of lookup &amp; storage nodes, etc is nonexistent).  The actual manager itself was pretty weak, requiring setup and management on a per-machine basis.  I just couldn&#8217;t really figure out how it was useful.</li>
<li>There were a number of projects that I tried, including <a href="http://incubator.apache.org/cassandra">Cassandra</a> (actually has some life to it now, lots of checkins recently), <a href="http://github.com/cliffmoon/dynomite">Dynomite</a> and <a href="http://hypertable.org/">Hypertable</a> that I tried and could not get compiled and or set up &#8211; my rule of thumb is that if I&#8217;m not smart enough to get it up and running without a problem, the chances that I&#8217;ll be able to keep it running w/o problems are pretty much nil.</li>
<li>There were a number of other projects that were unsuitable due to non-distributed nature or other issues like lack of durable storage or general skeeviness and so were dismissed out of hand, like <a href="http://code.google.com/p/scalaris/">Scalaris</a> (no storage), <a href="http://memcachedb.org/">memcachedb</a> (not distributed, weird issues/skeeviness, issues compiling) and <a href="http://code.google.com/p/redis/">redis</a> (quite interesting but way too alpha).  Oh, although not in consideration at all because of previous testing with a much smaller data set, on the skeeviness factor, I&#8217;ll give <a href="http://couchdb.apache.org/">CouchDB</a> a special shout out for having a completely aspirational (read: vaporware) architectural post-it note on its homepage. Not cool, guys.</li>
<li>Also, there were one or two projects I didn&#8217;t touch because I had settled on a working approach (despite the sound of it, the timeline was super compressed &#8211; most of my testing was done in parallel with lots of EC2 test instances spun up (loading millions of nodes and watching for performance degradation just takes a long time no matter how you slice it).  One was <a href="http://www.mongodb.org/">MongoDB</a>, a promising document-based store, although I&#8217;d wait until the auto-sharding bits get released to see how it really works.  The other was <a href="http://labs.gree.jp/Top/OpenSource/Flare-en.html">Flare</a>, another Japanese project that sort of scares me.  My eyes sort of glazed over while looking at the <a href="http://labs.gree.jp/Top/OpenSource/Flare/Document/Tutorial-en.html">setup tutorial</a> (although having a detailed doc was definitely a pleasant step up).  Again, I&#8217;d finished working on my solution by then, but the release notes also gave me a chuckle:<br />
<blockquote><p>released 1.0.8 (very stable)</p>
<ul>
<li>fixed random infinite loop and segfault under heavy load</li>
</ul>
</blockquote>
</li>
</ul>
<p>OK, so enough with all that, What did I end up with you might ask?  Well, while going through all this half-baked crap, what I <em>did</em> find that impressed me (<b>a lot</b>), was <a href="http://tokyocabinet.sourceforge.net/index.html">Tokyo Cabinet</a> and its network server, Tokyo Tyrant.  Here was something fast, mature, and <a href="http://tokyocabinet.sourceforge.net/spex-en.html">very well documented</a> with multiple mature language bindings.  Testing performance showed that storage-size/item was 1/4 of Voldemort&#8217;s, and actually 1/2 of actual size (Tokyo Cabinet comes with built-in ZLIB deflation).</p>
<p>Additionally, Tokyo Tyrant came with built-in threading, and I was able to push 1600+ inserts/s (5 threads) over the network without breaking a sweat.  With a large enough bucket size, it promised to average O(1) lookups and the memory footprint was tiny.</p>
<p>So, it turns out the easiest thing to do was just throw up a thin layer to consistently hash the keys across a set of nodes (starting out with 8 nodes w/ a bucket-size of 40M &#8211; which means O(1) access on 80% of keys at 160M items). There&#8217;s a fair amount of headroom &#8211; I/O bottlenecks can be balanced out with more dedicated EC2 instances/EBS volumes, and the eventual need to add more nodes shouldn&#8217;t be too painful (i.e. adding nodes and either backfilling the 1/n items or adding inline moves). </p>
<p>There are some issues (an issue w/ hanging on idle sockets) but current gets are at about 1.2-3ms across the network (ping is about 1ms) and it seems to otherwise be doing OK.</p>
<p>Anyway, if you made it this far, the takeaways:</p>
<ol>
<li>The distributed stores out there is currently pretty half-baked at best right now.  Your comfort-level running in prod may vary, but for most sane people, I doubt you&#8217;d want to.</li>
<li>If you&#8217;re dealing w/ a reasonable number of items (<50M), Tokyo Tyrant is crazy fast.  If you're looking for a known, MySQL is probably an acceptable solution.</li>
<li>Don&#8217;t believe the hype.  There&#8217;s a lot of talk, but I didn&#8217;t find any public project that came close to the (implied?) promise of tossing nodes in and having it figure things out.</li>
<li>Based on the maturity of projects out there, you could write your own in less than a day.  It&#8217;ll perform as well and at least when it breaks, you&#8217;ll be more fond of it.  Alternatively, you could go on the conference circuit and talk about how awesome your half-baked distributed keystore is.</li>
</ol>
<p>UPDATE: I&#8217;d be remiss if I didn&#8217;t stress that you should know your requirements and do your own testing.  Any numbers I toss around are very specific to the hardware and (more importantly) the data set.  Furthermore, most of these projects are moving at a fast clip so this may be out of date soon.  </p>
<p>And, when you do your testing, publish the results &#8211; there&#8217;s almost nothing out there currently so additional data points would be a big help for everyone.</p>
<ul>
<li>2009-04-22: <a href="http://anyall.org/blog/2009/04/performance-comparison-keyvalue-stores-for-language-model-counts/">Performance comparison: key/value stores for language model counts</a> &#8211; BDB, TC, TT memcache</li>
<li>2009-04-24: <a href="http://michalfrackowiak.com/blog:redis-performance">Redis Performance on EC2</a> &#8211; tests on a couple EC2 instance sizes and vs real hardware</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores/feed</wfw:commentRss>
		<slash:comments>85</slash:comments>
		</item>
	</channel>
</rss>

