On MapReduce, Java and patterns

Some thoughts after reading "MapReduce Design Patterns" book

Many years ago I read GoF book on design patterns, and I really liked it a lot, as it showed me many things I invented some months or even years ago (and thus I knew them already), but GoF book gave them names, and provided me with acknoledgment that my inventions are reasonable. - From these days, I like pattern-based approach to the development...

I came across MapReduce when checking CouchDB. CouchDB is using MapReduce to generate materialized views, and views are the only way how to fetch data from CouchDB unless you know the ID of the individual document. I found that elegant and consistent (much better than MongoDB's optional usage of MapReduce), but I found challenging to come up with MapReduce routines.

I found that designing MapReduce require some mental shift in my sequencially-oriented mind, and I start looking for the material to get me there and to show me elegant way to invent my MapReduce routines. And then I found that book: MapReduce Design Patterns and it sounded just right.

But what I found in the book was quite disappointing:

  • MapReduce in the book is not presented as elegant technology - it is heavy, cumbersome technique to fight the performance problems; there was not much about conceptional decomposition of the algorithm into two functions - map and reduce, but more about technical decomposition of the load between mappers and reducers
  • Clumpsy MapReduce examples written for Hadoop show also the clumpsiness of Java - in this context Java is festival of boilerplate code and unnecessary flooding of verbosity. Expected straighforward and elegant solutions, which I hoped to find in Javascript style of MapReduce for CouchDB, are simply not possible in Hadoop and Java
  • In general the entire topic is about heavy and cumbersome (sorry to repeat the same words, but it is how it is) fight with performance. It leaves an impression that the entire Big Data topic is only about this - how to make actually very simple things like total sum or average, but for many data points, viable in reasonable time-scale. While I admit this might be the actual topic, it takes away all the attractiveness of the area for me.

For me, Big Data should be about new algorithms, surprising understanding of data, elegant computations and visualisation. But current world of MapReduce in the context of Hadoop and Java is far from that.

I guess I need to check Pig to find out some more elegant approaches, although MapReduce in Pig is hidden and kind of invisible... But maybe there is no way to make MapReduce elegant, and thus hiding it might be the only proper way to move forward...

Get me right: I was ok with the style and writing of the book, it is topic, which missed me completelly...