Internet Asshattery, Armchair Scaling Experts Edition

Posted on April 25, 2008March 5, 2009 by lhl

I know it’s never good to pay attention to the nattering classes, but there was a pretty high profile fusillade that Mike Arrington launched on Blaine Cook which seemed to bring out the arm-chair experts in full force in the comments. Now, while I think that Arrington’s post is way out of line (I’ll explain that in a bit), I’m almost not as bothered by it (as long as he’s not to bothered for being called out on it)… What really bugs me is the number of clueless “developers” throwing in their two cents. That includes Arrington’s two Rails developers with “finger on the pulse of the rails community” (ha!). My discontent was further exacerbated by this (unrelated) completely clueless piece on The Register. Is this the best that tech journalism has to offer?

First, a disclaimer: I don’t know Blaine very well, and I don’t have any privileged info on Twitter or Obvious Corp.

There’s no question that Twitter has and continues to suffer from capacity, load, and other stability issues, and pointing that out is fair game, however pointing at Blaine’s scaling talk as a personal dig is a disservice to the everyone, especially since:

The advice in the slides are generally good (and the “It’s Easy” is obviously snark – just look at the failcat in the next slide; it’d be easy to confirm that by asking anyone who was in the talk (like me or several hundred other people) instead of projecting prideful boasting to justify his attack — I’ll avoid ascribing motivations to why Arrington chose to do this).
More crucially, the slides themselves point to the issues that a proper tech journalist would be able to spot and follow up on to try to find out what was really going on (assuming he cared about that).

For example, 600 QPS on 8 machines is pretty decent – but this raises the question of utilization and capacity planning. You can see from the 1×1 MySQL structure and the note on DRb that there were many single points of failure – again, this raises questions of BCP and redundancy. With the constant bumping of limits, you could guess that they were running really hot (and from a single data center, even after the move (probably w/o backup routers, etc.)) — all these issues are as much (if not moreso, since these are technical no-brainers) business/financial decisions than architectural/technological ones.

Now, I don’t know what happened between ops, management, and engineering, but guess what? Arrington doesn’t either, and he never bothered to follow up and kicks Blaine in the head instead, even when such clues obviously raise significant doubts about whether it’s appropriate. I agree with Arrington’s point about accountability, which is why I say now that Arrington wasn’t posting journalistically (the minimal followup with someone w/ half a clue would have pointed out exactly what I did), and Blaine deserves an apology. If you’re gonna shit on someone and start pointing fingers, you better have the goods to back it up. Whiny, uninformed personal attacks belong on Arrington’s Live Journal or (wait for it…) Twitter stream.

Now, onto the retarded comments from wanna be developers. Well first, of the entire thread, I only saw one half-decent attempt at a technical critique, and even that falls down when you look at it. I don’t want to belabor the point, but the poster, Jordan, actually raises technical points worth addressing (and refuting):

On indexing: while it’s true you don’t want to index willy-nilly and it’s incomplete to say “index everything”, if your ORM isn’t automatically indexing frequently used keys, you can be sure as heck that you’ll want to make a point of indexing them, especially if you’re doing joins. Yeah, you don’t index what you don’t need, but even if you have frequent writes, you need to eat it if you’re ever going to ever query. Because people suffer from lack of indexes, unless you’re not adding an index and examining, you’re not gonna have a problem “over indexing.” I don’t know the exact fan-out/pub-sub architecture, but you can be sure you’ll be doing a lot more reads even if you cache the hell out of it. If you’re thrashing, you’re looking at having mis-configured index caches more than anything else.
DRb: This is a case where it looks like he just misread. It’s easy if you don’t have the context of the talk. DRb was good enough… until it wasn’t – which is why Starling was written to replace it. Now, we still don’t know if it’s a single point of failure, but it obviates that whole rant (as to why DRb was chosen in the first place, more on that later)
Caching: again, the same thing with indexes. Of course over-caching is bad, but that’s never going to be your problem because you start with no caching, and you add caching until you start losing performance. Also, the “no substitute for fixing the underlying problem” is naive – most of the time, your problems are that there’s no need to do complex queries or processing since the data doesn’t change and should be cached. durrf.
Profiling: ok, this I’d sorta agree with. Mentioning ruby-prof would probably be good, but honestly, 90%+ of optimizations can be done on simple timers, explains, and logs alone. (And also, performance tuning doesn’t have all that much to do with scaling anyway.)

As to the rest of the wannabees, it really is true that if you haven’t done it, that is: been intimately involved growing a social web app from prototype to Internet-scale on a UNIX stack, then you really don’t know shit. (I know more than my fair share of people that have, and I didn’t see any of them posting armchair bs on the comments). I’m not trying to say this just to be dismissive, but only to say, you really really, don’t understand the technical challenges involved. Generating target sets on social objects is extremely expensive and ill-suited to traditional 4NF data models in RDBMSs. So is social activity fan-out and any number of activities core to Twitter’s message routing/storage and to social web apps in general. These are not traditional problems and standard, HA solutions just aren’t available.

Even if you’re architecturally sound, you’re dealing with development with extremely tight timelines/pressures, so you have to make decisions to pick things that will work but will probably need to eventually be replaced (e.g. DRb for Twitter) — usually you won’t know when and what component will be the limiting factor since you don’t know what the uses cases will be to begin with. Development from prototype on is a series of compromises against the limited resources of man-hours and equipment. In a perfect world, you’d have perfect capacity planning and infinite resources, but if you’ve ever experienced real-world hockey-stick growth on a startup shoestring, you know that’s not the case. If you have, you understand that scaling is the brick that hits you when you’ve gone far beyond your capacity limits and when your machines hit double or triple digit loads. Architecture doesn’t help you one bit there.

And the people that have experienced this and lived to tell the tale also know that it’s impossible to critique the technical/operational aspects made w/o seeing and understanding the QPS targets, load graphs, profiling data/sar info and all manner of other architectural/technical data and details (that none of us are privy to) before commenting with any sort of authority.

Anyway, if you were given the choice of working with/hiring someone like Blaine who has had the firsthand full life-cycle scaling experience and any random developer (and definitely anyone from the TechCrunch comments), I think it’s fairly obvious what the right decision would be. I guess I’ll leave it at that.

This leads to Part Deux of my rant… this lead-paint baby of an article entitled Backlash starts against ‘sexy’ databases which has the following quote, I shit you not:

“The bottom line is don’t tell me RDMBS [sic] can’t scale if you can’t write a decent query or design a normalized database schema.”

This is by one John Holland. Now, no doubt the WordPress code can be pretty shitty (although sometimes there are good reasons for the multiple queries to support various hooks/plugins), but you will never hit the type of performance problems in a WP (non-mu) installation that have people looking for MySQL alternatives because WP just doesn’t have the types of queries that destroy RDBMs.

I can understand that it’s not the article author’s (Phil Manchester) fault for conflating the “cons” with arguments that WP is badly coded with the “pros” (correct!) that you can’t write the kinds of queries you need for social apps because if he’s like the reporters I know, he probably doesn’t actually understand it at all and is doing his beat writeup, but dammit, can’t the author get some decent frickin’ technical advisors to explain this if he’s doing tech journalism? The entire article is based on characterizing a misinformed blog post as a “brewing controversy.”

I mean, I don’t want to be more mean than I have to about this, but John Holland just has no idea what he’s talking about. He picks up on Atwood‘s post on WP inefficiency, and then uses that to (completely incorrectly, and not without a tinge of reverse elitism) generalize on why the “cool kids” are hyping non-relational data stores. He goes on to boldly state “Relational databases are not the bottleneck” due to his complete lack of understanding of the actual problem set (hint: I don’t know anyone who’s suggesting WP should be switched off MySQL). This then leads to a horribly ignorant article being published by a writer who is in the best case, lazy and doesn’t understand what he’s reporting (just show two equal sides and do a writeup) or in the worse case is simply looking for a manufactured conflict that only will serve to stir controversy and confuse the non-savvy reader.

(The reasons for alternative data-stores actually exist in a couple axes – one is for more development flexibility or the ability to change functionality w/o expensive downtime (schemaless), one is for issues of scale and availability (distributed), and then a whole bunch for supporting social queries that just are horribly suited to RDMSs (multi-attribute, inverse index, mq/pubsub, etc.). Many of the alternatives are a combinations of various axes.)