Indexing Information about people needs a “time axis”

Eric PughDecember 9, 2008

I bet most people have done the vanity search on Google for themselves, I know I have. The problem with most indexing systems is that they go out and collect lots of information, but most of that information doesn’t have have any sense of time. They are just random points of data. But, as the years pass, and we put more of ourselves on the Internet, we mostly want to build up a picture not of ALL the data about a person, but a picture of person based on data that is applicable RIGHT NOW. For example, in my vanity search, today my blog on JRoller comes up first. But I haven’t blogged there since March 2007, and the things I am interested in right now are better exhibited by links 2, 8, and 9. Being, respectively, my company OpenSource Connections, Open Source in the Federal Government, and Ruby on Rails. We need to be able to cluster and show data about people, but also be able to plot it on a timeline. So that older data doesn’t overwhelm the newer information. On Amazon, I am still getting recommendations for Java books, even though I am a Rubyist now!

Of course, adding “time” to data is hard. For some data you could base the time of a piece of data about something based on the context it is in: “I graduated in 1994″. Alternatively you could try and infer date based on when content was created, like in an RSS article. Or, hopefully have some sort of meta tag specifing when stuff was created.

For HighTechCville, my research project, I am struggling with the fact that a lot of companies’ address data in HTC is out of date because the data source, a survey taken by CBIC, is a couple years out of date, and based on older tax records. While I am preserving metadata about when information is added and changed, that doesn’t really give me “true” sense of what “date” goes with each data source.

A recent article on O’Reilly by Nick Bilton talks about the value of Twitter having a constant stream of information that CAN be dated, because it’s all real time “what am I interested in Right Now” and can provide that timeline of changing user data. But for a project like HTC that is trying to backwards infer that information, it’s a lot harder, and a lot fuzzier!

Any great suggestions, please leave them in the comments!




More blog articles:


Let's do a project together!

We provide tailored search, discovery and analytics solutions using Solr and Elasticsearch. Learn more about our service offerings