Classifying Non-Patent Literature to Aid in Prior Art Searches

September 29, 2013 John Berryman
Category: Uncategorized

Before a patent can be granted, it must be proven beyond a reasonable doubt that the innovation outlined by the patent application is indeed novel. Similarly, when defending ones own intellectual property against a non-practicing entity (NPE – also known as a patent troll) one often attempts to prove that the patent held by the accuser is invalid by showing that relevant prior art already exists and that their patent is actual not that novel.

Finding Prior Art

So where does one get ahold of pertinent prior art? The most obvious place to look is in the text of earlier patents grants. If you can identify a set of reasonably related grants that covers the claims of the patent in question, then the patent may not be valid. In fact, if you are considering the validity of a patent application, then reviewing existing patents is certainly the first approach you should take. However, if youre using this route to identify prior art for a patent held by an NPE, then you may be fighting an uphill battle. Consider that a very bright patent examiner has already taken this approach, and after an in-depth examination process, having found no relevant prior art, the patent office granted the very patent that you seek to invalidate.

But there is hope. For a patent to be granted, it must not only be novel among the roughly 10Million US Patents that currently exist, but it must also be novel among all published media prior to the application date – so called non-patent literature (NPL). This includes conference proceeding, academic articles, weblogs, or even YouTube videos. And if anyone – including the applicant themselves – publicly discloses information critical to their patents claims, then the patent may be rendered invalid. As a corollary, if you are looking to invalidate a patent, then looking for prior art in non-patent literature is a good idea! While tools are available to systematically search through patent grants, it is much more difficult to search through NPL. And if the patent in question truly is not novel, then evidence must surely exists – if only you knew where to look.

Locating Prior Art in Non-Patent Literature via Automatic Classification

Knowing where to look for prior art within NPL is the big challenge. However, the goal of this post is to put you on the right path by proposing a method for automatically classifying NPL as if it were a part of the corpus of patents. First a little background on patent classification: Patents are a very well curated set of documents. Each patent is tagged with one or more classifications from a very large, hierarchical set of patent classifications. The set of classifications is so large, in fact, that the specification text of classifications alone consumes more than 100MB of disk space. Large and cumbersome though it is, this classification system provides an order to the chaos and makes it much easier for patent examiners to find the documents they are looking for and to do their job.

If NPL was tagged with just the same classifications as the actual patents, then it stands to reason that patent researchers would have a much easier time identifying pertinent prior art. Rather than having to shovel through all NPL, the research can at least narrow the corpus down to only those areas of art that are pertinent to the IP they are investigating.

It turns out that automatically tagging NPL may not be so hard as you would expect. We already have a training set; the 10Million hand-tagged US patents serves this purpose well! So, taking a chapter out of Taming Text, is it fairly straightforward to create a classification engine using Solr and a programming language of your choice. As a matter of fact weve already created a post detailing exactly how to accomplish this.

Naturally, building a full-fledged Non-Patent Literature research engine will take plenty of effort. Part of the building this particular classifier it to first index all available patents into Solr, and from our experience, this is a grand challenge in and of itself. Patent storage formats have changed significantly over the years. And it currently is not even feasible to index all documents, because patents prior to 1970 – if theyre digitized at all – are just images of the patent pages themselves. Indexing these documents would require a preliminary OCR process before there is even text to index. Once the classifier is built, you then need to set up a pipeline to get NPL documents to the classifier. This likely means pulling in document from even more disparate sources – to start with, peer reviewed publications. And finally, once the classifier is built and documents are being classified, getting things “just right” will take a significant amount of tuning. For instance, you may need to build a significant list of synonyms (though we have another post that might help there).

The main point is not that we can make this easy, but rather, we can at least help make the process more manageable. Are you working through similar problems related to IP research? We would love to hear about it!

Check out my LinkedIn Follow me on Twitter