Quick Start with Neo4J using YOUR Twitter Data

John Berryman — November 27, 2013 | 13 Comments | Filed in: solr

When learning a new technology it’s best to have a toy problem in mind so that you’re not just reimplementing another glorified “Hello World” project. Also, if you need lots of data, it’s best to pull in a fun data set that you already have some familiarity with. This allows you to lean upon already established intuition of the data set so that you can more quickly make use of the technology. (And as an aside, this just why we so regularly use the StackExchange SciFi data set when presenting our new ideas about Solr.)

When approaching a graph database technology like Neo4J, if you’re as avid of a Twitter user as I am then POOF you already have the best possible data set for becoming familiar with the technology — your own Social network. And this blog post will help you download and setup Neo4J, set up a Twitter app (needed to access the Twitter API), pull down your social network as well as any other social network you might be interested in. At that point we’ll interrogate the network using the Neo4J and the Cypher syntax. Let’s go!

Installing and setting up Neo4J

Since we’re not setting Neo4J up for production use, this part’s real easy. Just go to the Neo4J download page, click on that giant blue download button, and 36.1M later you’ll have your very own copy of Neo4J. Unzip it to some reasonable place on your machine, cd into that directory, and simply issue the command bin/neo4j start. (Once you’re finished, a bin/neo4j stop will shut Neo4J down.) Now if you point your browser at http://localhost:7474 and see stuff (rather than lack of stuff), then you’re ready to start shoveling data into Neo4J.

Prepping Twitter

You’ll need to create a Twitter app before you can start pulling down your connections because you need the app’s credentials in order to access Twitter’s API. But don’t sweat it, this literally takes less than a minute. Just go to the Twitter developer apps page, sign in, and there will be yet another big blue button, this time labeled “Create a new application” — click it! After filling out a really short form, checking the “I blindly agree to whatever is included in this legal contract” checkbox, entering a CAPTCHA string, and clicking the “Create your own Twitter application” button, you will indeed have your very own Twitter app. You’ll be taken to a screen that contains the details for your new app, but most importantly the OAuth credentials. Initially, you won’t have the access tokens, but you can click the “Create access tokens” button at the bottom and next time you refresh the page (wait a few seconds) you’ll see that the access keys are available. Keep track of the credentials here because you’ll need to refer to them soon.

Scraping Your Social Circles from Twitter

Check out my Python TwitterScraper script. Though it’s not yet the most beautiful code, it doesn’t really matter, because there’s not much here! Let’s take a moment to walk through it. The first section is where you set up Twitter and Neo4J. Naturally you’ll need to pip install the Tweepy and Py2Neo libraries, but they don’t have any weird dependencies, so this shouldn’t be a problem. Also notice, this is where all the access keys for your Twitter app should be used. Go ahead and copy and paste your credentials there. Now you should be ready to go.

The remaining code includes two functions. The first, create_or_get_node, creates, or gets a node (in this case a Twitter user) from Neo4J by id_str, and if it’s creating the node for the first time, it also inserts all of the relevant user metadata into Neo4J. Also, the create_or_get_node optionally takes a list of labels that will later be used to group certain users together. The second function. insert_user_with_friends, takes a Twitter user (via their screen name), pulls that all relevant metadata for that user from the Twitter API and inserts it into Neo4J. This function will then do the same thing for all the individuals that this Twitter user follows. And finally, insert_user_with_friends will establish a FOLLOWS relationship linking the source Twitter user to those that she follows. Again here, insert_user_with_friends takes an optional list of labels that can be used to group the seed nodes (those that are followed do not get labeled).

The last bit of the script is the fun part. This is where you programmatically lay out the social networks and individuals that you want to stalk… er, uh… observe. For your convenience, I’ve added all of the OpenSource Connections team, as well as several notable individuals from the Neo4J community. I’ve also included grouping labels that I though were pretty reasonable descriptors for these individuals and groups. As that last comment in the code states, make sure to add several people that you follow as well. Remember, the goal here is to create a data set that you are eminently familiar with. Once you’re happy with the data set, the run it: python TwitterScraper.py. It will pull down twitter users 200 at a time and insert them into Neo4J as fast as possible. Soon the program will hit Twitter’s rate limit cutoff, at which point, the script will wait until the rate limit has been lifted and will continue pulling down the rest of the data. All together, you can plan on getting around 200 updates per minute.

Start Infiltrating the Social Network!

Now for the fun part; let’s start putting some queries together and pulling back interesting data. In all of the example’s below, we will be using the default Neo4J browser which you’ll still find at http://localhost:7474/. Here’s we’re using the Cypher query language. This blog post won’t go into too much detail about Cypher syntax itself, but feel free to look at the very rich Neo4J documentation. Also, I’ll be using my own Twitter screen name “JnBrymn” as an example, so feel free to replace my screen name with your own and try the queries for yourself.

First off, let’s make sure the data we’ve ingested seems reasonable. The most obvious thing to do is to make sure we’re actually in the data set:

MATCH (n {screen_name:"JnBrymn" }) 
RETURN n

Up pops an orange node representing me. And if I click on the node, I see a list of all my metadata.

Screen Shot 2013-11-27 at 12.57.12 AM

I wonder just how many users we have indexed now?

MATCH (n) 
RETURN count(*)

7098 users, not bad. How many are you following?

MATCH (n {screen_name:"JnBrymn"})-[:FOLLOWS]->(o)
RETURN count(*)

371 – yep, that looks right. And check out how easy Cypher is — you’re basically drawing ASCII art of the node connections. So it’s easy to ask the next obvious question: How many are following me? Here I just switch the direction of the relationship arrow:

MATCH (n {screen_name:"JnBrymn"})<-[:FOLLOWS]-(o) 
RETURN count(*)

Hmm… only 10 followers. Am I really that unpopular? (Checking Twitter now.) No, says I’ve got 460 friends. Oh, that’s right, if you’ll remember, we’re only collecting outbound FOLLOWS relationships from our seed users (labeled as SeedNode). The reason for this is because some people, Justin Beiber for example, are followed by millions of Twitter users! And we certainly don’t want to keep track of that for now.

But all this makes me think, of the seed users that I follow, who does not follow me back?

MATCH (n {screen_name:"JnBrymn"})-[:FOLLOWS]->(o:SeedNode)
WHERE NOT (o)-[:FOLLOWS]->(n)
RETURN o.screen_name

This returns a single name: mesirii. This is Michael Hunger, one of the Neo4J hot shots. If he’s not following me back, then I’m definitely not doing a good job of infiltrating the Neo4J community yet. No matter… I bet he’s a @justinbeiber follower anyway… let’s check:

MATCH (n:SeedNode)-[:FOLLOWS]->(o {screen_name:"justinbieber"})
RETURN n.screen_name

Sadly… no one on our list follows Justin Bieber… I was sure I would have some good blackmail fodder there! (But hey, maybe you’ll discover some Beliebers in your own data set :P )

Hmm… well if I’m going to break into the Neo4J community, I need to find my likely vectors. Let’s create a list of all people who follow me and order them by the number of Neo4J people that they follow. Maybe I can get introductions through these friends:

MATCH (n:Neo)-[:FOLLOWS]->(m:SeedNode {screen_name:"JnBrymn"}),
      (n)-[:FOLLOWS]->(o:Neo)
RETURN count(*), n.screen_name
ORDER BY count(*) desc
LIMIT 10

This returns:

count(*) |  n.screen_name
---------+---------------
13       |  wefreema
11       |  technige

Sweet, so my friends wefreema and technige look like my gatekeepers to the Neo4J community. The only thing left to determine is what people I need to connect to.

MATCH (n:Neo)-[:FOLLOWS]->(o)
RETURN count(*), o.screen_name
ORDER BY count(*) desc
LIMIT 10

This query enumerates the most popular people among the Neo4J community based upon who my Neo seed nodes are following. And the results of this query look like this:

count(*) |  n.screen_name
---------+---------------
13       |  mesirii
12       |  emileifrem
12       |  jimwebber
12       |  digitalstain
11       |  apcj
11       |  cleishm
11       |  pandamonial
11       |  iansrobinson
11       |  p3rnilla
11       |  neo4j

As expected, plenty of these people are SeedNodes that I selected because I already knew them to be leaders in the community: mesirii, emileifrem, jimwebber, p3rnilla, neo4j. But who are these guys: digitalstain, apcj, cleishm, pandamonial, iansrobinson? After quickly looking them up on Twitter, I think we’ve discovered some new, key players in the Neo4J space.

Conclusion

This is only an intro to Neo4J. There are plenty of things that we could have talked about here: I could have gone into much more detail about the Cypher query syntax, I could have added indexes to speed up query times, and I could have put together some even crazier Cypher queries that make use of the broader Cypher syntax. But this is a good start. I think that you’ll agree: by looking at your own Twitter social graph, you’ll immediately think of questions that you want to ask and you’ll get a better understanding of what possibilities are out there.

Want to learn more about Cypher? Well I might just be co-authoring a book on that very subject! Stay tuned.

Update – Crowdsourcing a Collection of Key Community Figures

Apparently some people are already using this post to search through their own communities of interest. Let’s help each other out. If you’re tracking a community, then comment below with the Twitter screen names of the key figures from the community. I’ll edit the comments later to coalesce clean lists.


Check out my LinkedIn Follow me on Twitter

13 comments on “Quick Start with Neo4J using YOUR Twitter Data

  1. Hmm, I’m getting the following error on the twitter scrapper script (which I’ve called neo4j.py here)

    LifeintheAirAge:python Administrator$ python neo4j.py
    Traceback (most recent call last):
    File “neo4j.py”, line 123, in
    insert_user_with_friends(‘softwaredoug’,[“OSC”])
    File “neo4j.py”, line 100, in insert_user_with_friends
    create_or_get_node(twitter_user,user_labels)
    File “neo4j.py”, line 84, in create_or_get_node
    n=neo4j.CypherQuery(graph_db,query_string).execute_one(data)
    File “/Library/Python/2.7/site-packages/py2neo/neo4j.py”, line 1070, in execute_one
    return self.execute(
    params).data[0][0]
    File “/Library/Python/2.7/site-packages/py2neo/neo4j.py”, line 1061, in execute
    return CypherResults(self._execute(**params))
    File “/Library/Python/2.7/site-packages/py2neo/neo4j.py”, line 1043, in _execute
    raise CustomCypherError(e)
    py2neo.neo4j.SyntaxException: Invalid input ‘u': expected whitespace, comment or SET (line 3, column 19)
    ” ON CREATE u SET”
    ^

  2. I think the fix is,

    Line 62 should read

    ON CREATE SET (rather than ON CREATE U SET )

    Line 81 should read

    “”” + ((“ON MATCH SET\n u:”+’,u:’.join(labels)) if labels else ”) +”””

    rather than “”” + ((“ON MATCH u SET\n u:”+’,u:’.join(labels)) if labels else ”) +”””

  3. @AndyC Yup, you nailed it. This was one of the subtle changes that happened between 2.0.0-M06 and 2.0.0-RC1. Another nice syntax change is that you can now include simple matching in the MATCH clause and avoid the need for WHERE clauses. I’ll fix the blog post shortly.

  4. Key Community Figures

    OpenSource Connections

    jnbrymn dep4b danielbeach o19s scottstults softwaredoug patriciagorla jwoodell

    Solr

    treygrainger thelabdude gsingers otisg heismark _hossman irnnr LuceneSolrRev tflobbe dep4b jnbrymn softwaredoug sstults ErikHatcher lucidworks lucene_solr o19s

    Neo4J

    mesirii emileifrem jimwebber p3rnilla neo4j digitalstain apcj cleishm pandamonial iansrobinson peterneubauer

    Cassandra

    TheLastPickle PlanetCassandra pcmanus shirleman aaronmorton PatrickMcFadin DataStax patriciagorla jnbrymn AndyCobley spyced

    Data Science

    fivethirtyeight hmason John4man josh_wills peteskomoroch ptwobrussell strataconf ted_dunning wesmckinn

    WANTED

    • Hadoop
    • Angular
    • ElasticSearch
  5. Great Post! I’m currently populating the database with TwitterScraper.py and waiting on twitter limits-so no graphing yet. Everything else very easy to follow, and I’m brand new to Neo4J. The hardest part for me was installing Java SDK, actually. :) Thank you for the great tutorial!

  6. Hi John,

    I’m getting the error below on the twitter scrapper script – only change made to the code you provided was to swap in my keys.

    Environment is Mac running 10.8.5 with Neo4j installed via brew and Python 2.7 is via Anaconda.

    Traceback (most recent call last):
    File “TwitterScraper.py”, line 118, in
    insert_user_with_friends(‘softwaredoug’,[“OSC”])
    File “TwitterScraper.py”, line 87, in insert_user_with_friends
    create_or_get_node(twitter_user,user_labels)
    File “TwitterScraper.py”, line 74, in create_or_get_node
    n=neo4j.CypherQuery(graph_db,query_string).execute_one(data)
    File “/Users/x/anaconda/lib/python2.7/site-packages/py2neo/neo4j.py”, line 1070, in execute_one
    return self.execute(
    params).data[0][0]
    File “/Users/x/anaconda/lib/python2.7/site-packages/py2neo/neo4j.py”, line 1061, in execute
    return CypherResults(self._execute(**params))
    File “/Users/x/anaconda/lib/python2.7/site-packages/py2neo/neo4j.py”, line 1043, in _execute
    raise CustomCypherError(e)
    py2neo.neo4j.SyntaxException: expected START or CREATE
    ” MERGE (u:User {id_str:{id_str}}) ”
    ^

  7. It looks like it’s erroring on the syntax. I suspect you’re using a different version of Neo than me? I’m using neo4j-community-2.0.0-RC1

    Also, just in case, here’s my requirements.txt file
    py2neo==1.6.1
    tweepy==2.1

  8. That was indeed the problem – the brew version is 1.9.5. Manual install on Mac is the only way to go until 2.0.0 is official I guess. Loading now. Thanks!

  9. Great! Once you’ve got it up and running, tell me if you come up with anything interesting. I came up with this last night.

    //crazy friend recommender
    //collect similar people from Neo community based upon who we follow
    match (a:SeedNode)-[:FOLLOWS]->(b)<-[:FOLLOWS]-(c:Neo)
    where a.screen_name="JnBrymn"
    with count(*) as count,c
    //sort them by number of friends we share and limit to top 10
    order by count desc
    limit 20
    //now find all the friends that all of them follow
    //so long as I don't already follow them
    match (c)-[:FOLLOWS]->(d),(f)
    where f.screen_name="JnBrymn"
    and not (f)-[:FOLLOWS]->(d)
    return count(*), d.screen_name
    order by count(*) desc
    limit 100
    

    It finds the people within the Neo community who are most similar to me based upon who we mutually follow. And then it finds a group of people that they most commonly follow – but I don’t. And then it sends me a list ordered by popularity.

  10. Pingback: How-to Guide: Explore your Twitter Network with Neo4j - Neo Technology

  11. Hey,

    I’m getting the following error when running the scraper script:

    Traceback (most recent call last):
    File “TwitterScraper.py”, line 26, in
    “””).execute()
    File “/usr/local/lib/python2.7/site-packages/py2neo/neo4j.py”, line 1061, in execute
    return CypherResults(self._execute(**params))
    File “/usr/local/lib/python2.7/site-packages/py2neo/neo4j.py”, line 1043, in _execute
    raise CustomCypherError(e)
    py2neo.neo4j.SyntaxException: string matching regex $' expected butO’ found

    Think we should have better error message here? Help us by sending this query to .

    Thank you, the Neo4j Team.

    ” CREATE CONSTRAINT ON (u:User) ”
    ^

  12. Oh good call. You get that error the second time you run the script because there already exists a constraint upon unique users. To keep things simple, just surround the statement with a try/except. I’ll fix the script.

  13. Hi,

    I found this tutorial and although looks exciting I am having an issue with the TwitterScraper.py script.

    I have installed Python 2.7, Neo4j Community 2.0.1, setup Tweepy and Py2Neo on Windows 7. I have the started the Neo4J server on localhost/7474. When I run the TwitterScraper.py script, it just freezes. If I Ctrl-C it, I get that it was interrupted at line 83, time.sleep(60 * 16). If I decrease the timeout to just 60, then it tells me that friends is used without been initialized first.

    Any help to get this running is appreciated.
    Thanks,
    Boris

Comments are closed.

Developed in Charlottesville, VA | ©2013 – OpenSource Connections, LLC