Skip to content

Hadoop World Roundup

OK, most of you that will probably read this were actually at Hadoop World NY, but here’s my roundup none the less.  There was much to go to, and some very interesting things to hear, and if you want to know how I feel about every detail, at least that I could post until my phone needed to be hooked up to life support around half-time, check out the tweets from the day.


  • Cloudera announced Cloudera Desktop, aimed at making development and debugging a little bit easier for new comers.  My take: it’s a little lack luster at the moment.  Given a few months or more extra development, it could be a very useful tool, but at the moment it seems a bit gimmicky and pretty lack luster to me.  Karmasphere, the next tool, I think delivers a lot more.
  • Amazon showed off Karamsphere, an IDE built on top of NetBeans specifically designed for developing Map/Reduce jobs.  This seems like a really great tool, even though I don’t really appreciate NetBeans too much.  Check it out at for lots of details, but it makes debugging and deploying a lot easier.
  • This one’s pretty far off still, but Doug Cutting talked about adding into Hadoop a sort-of spreadsheet like interface for non-programmers to do data analytics.
  • Paul Brown from Booz Allen was talking about protein alignment or something or other (sorry, I’m not a bio-informatics guy, but it’s comparing proteins for similarity, mutation, things like that) briefly showed off a tool they developed for their own needs to visualize and debug activity in the cloud.  If I can find a video, I’ll be sure to post for all of you who didn’t see it.

New Tech

  • The one thing I saw that was really new, was Avro.  Avro is basically a serialization and data container, like Thrift or Protocol Buffer, but better.  At least, that’s what Doug was saying.  I can’t even tell you what the difference is, since it’s been a couple of days, and I was a little fuzzy to begin with, but we’ll have to wait for his slides to be posted to find out (unless someone else knows).
  • An encrypted, secure file system in Hadoop using Tahoe was showed off by Aaron Cordova from Booz Allen.  This is a great idea, and I think it will make some people willing to put their data on a public cloud that weren’t before.

Interesting Ideas

  • One of the ideas that popped up in a couple of places was to use a sort of tiered architecture or phased architecture, where basically you have clouds built for one function or another, and they work together in a sort of workflow like fashion.  The best analogy I came up with was in the Facebook talk by Zheng Shao, where basically they have a processing cluster, a quick storage cluster, and then a long-term storage architecture.  Basically, when something needs to be processed, the quick storage cluster goes out to the archive cluster if it doesn’t have the data, and then is the storage for the processing cluster, instead of the processing cluster having its own storage.
  • Instead of having data in triplicate across the cloud, have it set up more like a RAID5 setup, where you have a parity block instead of the triplicate.
  • One other thing I wanted to mention was some of the stuff the business analytics guy Bradford Stephens talked about, which was using Katta and Bobo (can’t find a link to Bobo, and I haven’t heard of it before) to create almost real time (less than 5 seconds) map/reduce jobs.  The key to these is that they’re not really performing any additive operations, just simple things like counts.

There’s probably several other things to add to the list, but I haven’t been able to get enough information to reliably talk about any of them yet from the flood of tweets from all the other geeks and nerds in the room that where spamming Twitter as much as I was, so don’t let this be the be-all of everything that happened.

Be Sociable, Share!

Posted in Open Source, Programming. Tagged with , .

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.