Saturday, January 12, 2008

MapReduce for Mere Mortals

The areas of my expertise do not extend broadly... and it's fair to say that MapReduce is probably not covered under them. Yet, for some reason, I feel the need to try to explain it. So be it.

At this point, MapReduce has become an apocryphal explanation of Google's superiority. How do they make those searches so fast? MapReduce! What makes Google brilliant? MapReduce! Who caters the Google cafeteria? MapReduce!

Maybe not. MapReduce is pretty darn simple if explained right.

MapReduce is a function. This function takes 3 parameters:

  1. a map of key/value pairs.
  2. a function to perform on each entry in the map, which returns a new map
  3. a function to perform on the result of step #2
Let's walk through this with an example: Let's say you want to take every letter in the alphabet and figure out how many words in the language start with that letter. Perhaps you've secured a spot on Wheel of Fortune and you just must prove the R S T L N and E are really the most common English letters.

Where do you start? With an English dictionary, of course. A dictionary is a map of keys (words) to values (definitions). This dictionary will be our argument number 1 for our mapReduce function.

Now what? There are any number of ways to divide up the work to be done. Let's think about how to simplify the problem. A dictionary is big, let's make it smaller so we'll have an easier time working with it. Let's remove all the definitions and replace them with first letter of each word. The result will just be another map of words and their starting letter. This trimming of the dictionary is the function we'll use for argument number 2, which we'll refer to as the "map function".

The only thing left to do is tally up how many times each letter appears. This counting is the function we'll use for argument number 3, which we'll call the "reduce function".

That's all there is to it. We have a dataset of key/values, we map a function across each element to simplify the problem, and then we reduce the answer into something we want to know using the last function.

The Groovy code is pretty simple, too:
def mapReduce = { data, mapFunction, reduceFunction ->
def mappedData = data.collect(mapFunction)
return reduceFunction(mappedData)
}
mapReduce is a function with three parameters... the mapFunction is run against each entry in data, resulting in a new map... and the reduceFunction is run against the result.

Here is the example usage:
//create a dictionary
def dictionary = [abacus: "a device for making arithmetic calculations",
arc: "any unbroken part of the circumference of a circle",
beaver: "a large, amphibious rodent of the genus Castor"]

// the "map" of "mapReduce" will transform each
// value into the first character of the key
def startingCharacter = {
it.value = it.key[0]
return it
}

// the "reduce" of "mapReduce" will count the frequency
// of each character, giving us a result
def countCharacters = { data ->
def result = [a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0, j: 0, k: 0, l: 0, m: 0, n: 0, o: 0, p: 0, q: 0, r: 0, s: 0, t: 0, u: 0, v: 0, w: 0, x: 0, y: 0, z: 0]
data.each {
def initialValue = result.get(it.value)
result.put(it.value, initialValue+1)
}
return result
}

println mapReduce(dictionary, startingCharacter, countCharacters)
There you have it. The result is the total number of words starting with each character in the alphabet.

Does it sound so great? Probably not as described here. The real benefit is when you have multiple processors and multiple machines that are each able to do a portion of the work of the mapFunction. If each visit to a dictionary entry takes one second, and there are 26,000 words in the English language, then this mapReduce is going to take over 7 hours to perform. But if you have 26,000 computers that can each analyze a single word, then the problem takes one second to solve (plus the overhead of distributing the work, of course). That does sound great. It's not mapReduce that gives Google the advantage, it's the framework they've built around the concept that is so unique.

Another take on Groovy MapReduce can be had on Ken Barclay's blog, and the original discussion that triggered all this can be found on the Groovy Users of MN discussion group. Code is here for fiddling.

4 comments:

kodeninja said...

Hamlet,

Shouldn't the key used in the code for countCharacters when accessing the result map be it.value, rather than it.

Using it causes the following exception:
java.lang.NullPointerException: Cannot invoke method plus() on null object

-kodeninja

kodeninja said...

Ohh, I see you've the right code in the attached MapReduce.groovy file.

-kodeninja

Hamlet D'Arcy said...

fixed, sorry about that.

paulk_asert said...

A very slightly tweaked version:

def dictionary = [
  abacus: "a device for making arithmetic calculations",
  arc: "any unbroken part of the circumference of a circle",
  beaver: "a large, amphibious rodent of the genus Castor"
]

def startingCharacter = {
  it.value = it.key[0]
  it
}

def countCharacters = { data ->
  def result = [:]
  ('a'..'z').each{ result[it] = 0 }
  data.each { result[it.value]++ }
  result
}

def mapReduce = { data, mapFunction, reduceFunction ->
  def mappedData = data.collect(mapFunction)
  reduceFunction(mappedData)
}

println mapReduce(dictionary, startingCharacter, countCharacters)