Friend Recommendations using MapReduce

July 4, 2013 John Berryman
Category: Relevancy

So Jonathan, one of our interns this summer, asked an interesting question today about MapReduce. He said, “Lets say you download the entire data set of whos following who from Twitter. Can you use MapReduce to make recommendations about who any particular individual should follow?” And as Jonathans mentor this summer, and as one of the OpenSource Connections MapReduce experts I dutifully said, “uuuhhhhh…”

And then in a stoke of genius … I found a way to stall for time. “Well, young Padawan,” I said to Jonathan, “first you must more precisely define your problem… and only then will the answer be revealed to you.” And then darn it if he didnt ask me what I meant! Left with no viable alternatives, I squeezed my brain real hard, and this is what came out:

How to make “follow” recommendations on Twitter

In my mind, the more precise definition of the problem goes something like this:

If Jonathan is following Frank, Fred, Frida, and Fernando,  and if Frank, Fred, Frida, and Fernando are all following Zeke,    and if Jonathan is **not** currently following Zeke      **then Jonathan should follow Zeke.**

Or a bit more generally, “if a lot of my friends are following someone who I am not yet following, then I might be interested in following that person”.

Now, given my only somewhat more precise statement of the problem, lets look at the inputs and outputs to the problems. Lets say that capital letters stand for people on Twitter. The information that we download from Twitter can be formatted to look something like this:

A -follows-> [B C D M P Q X Y Z]B -follows-> [A C D F P Q X Y Z]C -follows-> [A B D F G M X Y]D -follows-> [A B C F G M Q]...

The ultimate output might then be an ordered list of who each person should follow

A -should-follow-> [F G]B -should-follow-> [M]C -should-follow-> [Q P]D -should-follow-> [X Y Z]...

And the big question is how do we generate this list? Now I was unfortunately too slow to answer this question on the spot. However after staring at the ceiling for several hours, the answer came to me in a flash of insight:

We can do this with MapReduce, but were going to need to spread it out into two iterations.

Iteration 1: Find friends of friends.
Iteration 2: Count and sort the friends of frieds on a per person basis, and find friends of friends that arent already being followed by said person.

Friend recommendations in pseudocode:

To reiterate, before the we run the map reduce, we have the following input rows

people       listOfWhoTheyFollowA            [B C D M P Q X Y Z]B            [A C D F P Q X Y Z]C            [A B D F G M X Y]D            [A B C F G M Q]...

Iteration 1: Find Friends of Friends: Mapper

map(person,listOfWhoTheyFollow) :  emit [person, 'follows'] -> listOfWhoTheyFollow  for(friend in listOfWhoTheyFollow) :      emit [friend, 'followed_by'] -> person

This mapper will emit things that look like this:

    [A, 'follows'] -> [B C D M P Q X Y Z]    [B, 'followed_by'] -> A    [C, 'followed_by'] -> A    [D, 'followed_by'] -> A    ...    [B, 'follows'] -> [A C D F P Q X Y Z]    [A, 'followed_by'] -> B    [C, 'followed_by'] -> B    [D, 'followed_by'] -> B    ...

Here, the keys are compound, containing a person and then their relationship to the people mentioned in the values. Why would I want to do this? Because Im getting ready to take advantage of some less commonly used feature of MapReduce – custom partitioning, custom grouping, and custom sorting. Heres how it will work:

Iteration 1: Find Friends of Friends: Shuffle Sort

Define a custom partitioner that partitions on the first element of the mapper outputs keys. This will ensure that [A, 'follows'] -> [B C D M P Q X Y Z] goes to the same machine as [A, 'followed_by'] -> D.
Define a custom sort that sorts on person first and on the relationship second. Make sure that followed_by comes after follows. Now [A, 'follows'] -> [B C D M P Q X Y Z] goes to the same machine and comes before [A, 'followed_by'] -> D.
Define a custom grouping so that all keys with the same person go to the same reducer. In this case [A, 'follows'] -> [B C D M P Q X Y Z] goes to the same reducer as [A, 'followed_by'] -> D. This seems like just the same thing as the custom partitioner, but if we hadnt taken this extra step, these two key value pairs would have gone to different reducers on the same machine.

(For more details about custom partitioning, sorting, and grouping, see the Secondary Sorting section of Chapter 8 of Tom Whites Hadoop: The Definitive Guide. Alternatively, look at our post: Custom Partitioning, Sorting, and Grouping in Hadoop MapReduce.) After the shuffle and sort, then a single reducer will get all the key values associated with a person sorted so that the “follows” relationship comes first. For instance, for A the reducer will receive these inputs:

[A, 'follows'] -> [B C D M P Q X Y Z][A, 'followed_by'] -> [B C D ... ]

And for B the reducer will receive these inputs:

[B, 'follows'] -> [A C D F P Q X Y Z][B, 'followed_by'] -> [A C D ... ]

Iteration 1: Find Friends of Friends: Reducer

For each person, the associated reducer will be called twice. The first time with the “follows” list and the second time with the “followed_by” list. The goal of the reducer is to convert this information to a list of friends of friends.

listOfWhoTheyFollow = nullreduce(compoundKey,people) :    if(compoundKey[1] = 'follows') :        listOfWhoTheyFollow = people        # and do nothing else because the next call to the        # reducer will have the corresponding 'followed_by' list    else if(compoundKey[1] = 'followed_by') :        listOfPeopleThatFollowThem = people        for(followed in listOfWhoTheyFollow) :            for(follower in listOfPeopleThatFollowThem) :                emit follower -> followed

As seen here, whenever the reducer is first called it sets aside the list of people that this person follows. On the subsequent call to this reducer, the reducer iterates through all of the people that follows this person and all of the people that this person follows, and for each pair, the follower -> followed key,value is emitted.

Effectively we are iterating through all possible friends of friends that exist in Twitter. Considering that there are roughly 500M Twitterers and the average number of followers (and followees?) is 200, and considering that the id of users are probably stored as longs, then the result of this iteration is roughly 730TB of stored data. Is this a problem (well maybe!) of course not! This is Big Data! (…whimper…)

Half way there.

This post is getting long, so Ill truncate it here for the time being. In the next iteration of this algorithm, well count up all of the friend-of-friend relationship, weed out the ones that theyre already following, and present the users with the best people to follow – namely, those friends of friends who are most popular among friends.

Interested in seeing the rest of the algorithm? Comment below!

Check out my LinkedIn Follow me on Twitter

Improvements:

First reducer or second mapper to eliminate friends of friends who are already being followed which will be plentiful.
When FoF relationships are redundant – just output them once, with a count. Combiner.