Just about the first thing you’re going to do when you learn MapReduce is do the word frequency job. That just reads in a bunch of text and reports on how many times each word is used. The mapper spits out a map of “word” and “1″, and the reducer adds each number associated with each word.
That seemed like a toy problem when I first did it. What I didn’t realize at the time was that some words and phrases can be used to determine the date a particular text was written. For example, the frequency of the word “gnarly” increased significantly around 1982 when Fast Times at Ridgemont High hit the theaters.
Now it seems that historians are using this technique to date texts so that they can properly order events. A great overview of this process is given at The Physics arXive. From the article:
For example, Tilahun and co say that the phrase “amicorum meorum vivorum et mortuorum”, which means “of my friends living and dead”, was popular between the years 1150 and 1240 but not at other times. And the phrase “Francis et Anglicis”, which is a form of address meaning “to French and English”, was phased out when England lost Normandy to the French in 1204.
The other thing a word and phrase frequency map does is form a sort of high-level fingerprint for a collection of text. In addition to classifying text according to date, this fingerprint can also be used to gauge the likelihood a certain author wrote a particular document.
All from the “Hello World” of MapReduce. Dude!