Youve got a problem: You have 1 buzzillion documents that must all be classified. Naturally, tagging them by hand is completely infeasible. However you are fortunate enough to have several thousand documents that have already been tagged. So why not…
Build a k-Nearest Neighbors Classifier!
The concept of a k-NN document classifier is actually quite simple. Basically, given a new document, find the k most similar documents within the tagged collection, retrieve the tags from those documents, and declare the input document to have the same tag as that which was most common among the similar documents. Now, taking a page from Taming Text (page 189 to be precise), do you know of any opensource products that are really good at similarity-based document retrieval? Thats right, Solr! Basically, given a new input document, all we have to do is scoop out the “statistically interesting” terms, submit a search composed of these terms, and count the tags that come back. And it even turns out that Solr takes care of identifying the “statistically interesting” terms. All we have to do is submit the document to the Solr MoreLikeThis handler. MoreLikeThis then scans through the document and extracts “Goldilocks” terms – those terms that are not too long, not too short, not too common, and not too rare… theyre all just right.
Finding a Data Set and Prepping Solr
For this little experiment we need a good example data set. Lets use the same Sci-Fi StackExchange dataset that weve used several times in the past. This is a great dataset; it contains roughly 18,000 question and answer documents that cover every corner of sci-fi imaginable. Its nerdy. I love it. And importantly for the task at hand, of the 18,000 documents, roughly 6,000 are questions which come with a Tags field. This is precisely what we need to build our Solr/Python-powered document classifier. To give you a little idea of what youll find in the dataset, consider document 8723 which we will use further down as our example:
- Title: Would Star Trek holodecks physically affect you once you exit the Holodeck?
- Body: Would Star Trek holodecks physically affect you once you exit the Holodeck? Meaning, if someone programmed a holodeck to dump a bucket of water over you head, would you have wet hair (outside later on)?
- Tags: [star-trek] [holodeck]
… I warned you that this is deep nerdery!
If you wish to follow along, pull down the Solr Sci-Fi git repo. In the README youll find the directions for indexing the documents. The only additional setup required is to enable the MoreLikeThis requestHandler within solrconfig.xml
Using Solrs MoreLikeThis Handler
Now (after having restarted Solr) you can issue MoreLikeThis queries like this:
http://localhost:8983/solr/mlt #Use MoreLikeThis ?q=Id:8723 #Classify document 8723 &mlt.fl=Title Body #Use these text fields for classification &fl=Tags #Only display the Tags field &rows=10 #Get the 10 most similar documents &fq=Tags:* #Only retrieve documents that have Tags &mlt.interestingTerms=list #Display statistically interesting words
This query basically identifies which document we intend to classify, specifies which fields to use for classification, and displays all the tags for the 10 most similar documents. Heres what the results looks like:
star-trek holodeck star-trek star-trek-tng holodeck star-trek star-trek-tng holodeck star-trek holodeck star-trek star-trek star-trek-tng technology holodeck star-trek holodeck star-trek holodeck star-trek time-travel harry-potter diagon-alley magical-transportation star-trek technology weapon holodeck exit affect physic trek star
The first section named “match” contains the Tags field of the document that were searching for. Obviously, in practice our documents arent going to go into the Tagger already pre-tagged (duh), but when building a simple classifier for the purpose of testing the concept, this is the easiest way to go. In production you would post text to the MoreLikeThis handler as a content stream as mentioned in the Solr wiki.
The next section named “response” is the list of the Tags field of the 10 most similar documents. And just as wed hoped, the most common tag is [star-trek]; and the second most common tag is [holodeck]! Now obviously, this isnt doing any classification just yet. Thats what were going to use Python for soon.
The final section lists all of the statistically interesting terms. Though not strictly necessary, this provides insights into how MoreLikeThis is working and how it may be tuned for better precision or recall. For more information on tuning MoreLikeThis, please again refer to the Solr wiki.
Building and Demonstrating the Actual Classifier
Now for the fun part! Having indexed the documents, and demonstrated the functionality of the MoreLikeThis handler, its conceptually straightforward to put it all together. While you can find all of the code required to create in my iPython notebook, the crux of it is here:
def classifyDoc(self,docId, method="best" #or "sorted" or "details" ): #send the MLT query to Solr params = {"q": self.idField + ":" + docId, "mlt.fl": self.mltFields, "fl": self.tagField, "fq": self.filterQuery, "rows": self.numNearest, "wt":"json" } resp = sess.get(url=self.mltUrl,params=params) #Perform error checking if resp.status_code != 200: raise IOError("HTTP Status " + str(resp.status_code)) json = resp.json() if int(json["match"]["numFound"]) == 0: raise RuntimeError("no document with that id") if int(json["response"]["numFound"]) == 0: raise RuntimeError("no interesting terms in document") #If no errors, then collect and count tags for each similar document tagDict = defaultdict(int) for tagList in json["response"]["docs"] : for tag in tagList[self.tagField].split(' '): tagDict[tag] += 1 #Return the best tag, all of the tags sorted best #to worst, or the list of tags and their count if method == "best": return max(tagDict, key=tagDict.get) elif method == "sorted": return sorted(tagDict, key=lambda x : tagDict[x], reverse=True) elif method == "details": return tagDict
Upon quickly reading though the code, youll see that theres really not much to it. It pulls down the same MoreLikeThis results as listed above, counts up the tags, and returns the most common tag.
But the question remains… does it work? To test this, I first instantiated our SciFi classifier and checked it against our example document above
print c.classifyDoc("8723",method="sorted")
Sure enough, the first two items in the list are [star-trek] and [holodeck]. Great, now lets test this against a larger set. First, though, we need to clearly define what a correct classification entails so that we will understand how to interpret our results. Therefore, given a classifier tag and a list of true tags, we define a correct classification to have occurred whenever the classifier tag is contained with the true tags. (And referring to the iPython notebook, you can see the classifierTester
function in which this is all defined.) Running our classifier over all Tagged documents we see the following print out:
4041 out of 5805 correct. That's 69.6124031008%
70%, not bad, especially when you consider that vagaries of some of these tags. And the relatively high possibility that some of these questions are themselves will be tagged wrong to begin with. To get an idea of how accurate the classifier can be, I performed several tests on more constrained document sets. I found that if we only look at the questions for more popular topics (Harry Potter, story identification, Star Trek, and Star Wars), then the accuracy jumps up considerably:
2136 out of 2344 correct. That's 91.1262798635%
Obviously, we cant know a priori that a given document is in this more popular subset, but this does bring up an important point: k-NN classification will be most accurate for documents that are representative of the training data set.
Conclusions
Easy wasnt it? But there are still several things that you can do to improve the classifier for your purposes. There is still further information available from the data that were not making use of! In particular, it shouldnt be difficult to extend the current classifier so that it not only classifies the document, but it also conveys information about the certainty of prediction. This would be especially useful in building a human-in-the-loop classifier – basically you allow the computer to classify all the ones that are “easy” and then allow a human to review all the document that the classifier is less certain about.
Tell us what you think! Do you have a classification problem that youre trying to solve? What is it? Have you found this article helpful? We want to hear about it!