As a search guy, I think of data as something to exploit in order to solve some business objective, with a focus on moving data along the axis of data becoming information in order to facilitate understanding.
This week I had the opportunity to finally visit Austin, TX, a city Ive been hearing great things about, when I attended the Enterprise Data World conference.
I was a bit of an oddball at the conference, since the focus was very much on the state of your data, and much less on the exploitation of your data to drive your business. However, the primary concern of the data folks is to make data as perfect as possible, in and off itself, with the idea that the rest of it comes out of that naturally. I see a list of mailing addresses and think “Hey, can we figure out which cities are represented the most?”. Whereas a data person is focused more on “here is a list of addresses, are they accurate and up to date?”. As someone who deals with messy data, I applaud folks who try to fix it, though it seems like a sissyphean task!
My talk was on GPSN, a search engine for Chinese patents run by the USPTO, and while the audience was small, I had some wonderful questions at the end, and the discussion turned to eDiscovery, and what is preventing many of the concepts that have been so successful in eDiscovery from spreading to other areas of search. I didnt catch the name of the woman who brought up eDiscovery, but she had some great ideas, and I appreciated the discussion. My slides, for some reason, didnt show up on the website, so Ive posted the PDF of them on Slideshare:
In talking to other attendees, I saw a difference between the world I normally operate in and the world of most of the other attendees operate in is that most of my data is very messy and I am focusing on massaging that messy data to pull out some reasonable metadata that gives me more structure, and thus move my data along the axis to the “information” stage. Whereas the concerns of most of the attendees isnt to extract metadata, but instead to corral all the various forms of what is fairly structured data into one cohesive whole. They are often drowning in too much metadata! Hence topics like Master Data Management came up frequently. In search we typically just throw it all into a single index, and let the queries sort it out! The level of care and detail to curating all the data isnt normally part of the scope of work.
A couple of other things jumped out at me. I attended the panel Enterprise Semantics: Can Smarter Data Really Add Value? to see if the Semantic Web has become any easier to use. I shuddered when one of the panelists, when asked about success stories, listed some companies who “went all in on semantic web technologies”. There remains no reasonable path, in general, for folks to put their toe into semantic web, and then gradually adopt it. It remains an all or nothing proposition. There is amazing things we could do if we had better understanding of what people are asking for, but there isnt any reasonable path forward. And yes, Siri is great example of semantics being applied. But most companies are not Apple! I remain interested in seeing if some of the efforts like Schema.org lower the barrier to using semantic technologies.
The other session that I really enjoyed was MDM: Master Data Management or Massive Data Mistake?. While the end of it was a marketing pitch for EnterpriseWeb, the focus of the talk was on the fact that the world is not a static place. And yet, many of the approaches that we have for dealing with complexity attempts to push the world into a more static place. As Jason Bloomberg (@TheEbizWizard) put it, we define contracts between systems. We come up with DTDs. Ideally everything is very flexible, but building something in a very flexible manner is expensive. Expensive to test. Expensive to code. Expensive to think about! His solution is something they call “Extreme Late Binding” where every time a system interacts with another system, you look up all the schemas for your data, at that immediate time, and then use that to figure out your execution path. Think the goal processing logic of Ant, applied to data movements instead of compiling code. Weve been doing some Apache Camel work, and while powerful, I can see how at a certain point, the massive number of interactions causes everything to break down. This “extreme late binding” makes me think of HATEOS (or Hypermedia driven system) and he did confirm that much of the lookup logic in their product is based on HATEOS principles. What struck me about his talk was pointing out that the world is not static, and we whiteboard out systems as if they are. Search brings some very powerful tools to deal with the messiness of data.
Lastly I had a great conversation with a guy who does significant data work with McDonalds, and learned a lot about a company that I thought I knew about because I see the Golden Arches everywhere, but had no idea all that goes into making a hamburger and fries from a data perspective!
I think I got a bit on my soapbox about how we need to be able to put a “confidence” ranking on our data. A frequent issue that we see is that we have 1 batch of data that has lots of noise, but a couple of golden nuggets. Another batch of data is fairly clean, but fairly boring. Right now, when we return results, unless you are really familiar with the data, you dont know that the hits from the noisy datset may be much more suspect then the hits from the boring system. We need to have a confidence ranking on our results that tells our users, this result may be perfect, or a complete red herring! To get to this level of sophistication, we all need to become data people. Business users who are data savvy will succeed over those who just accept a report at face value.
Austin was a great city to host a conference, though I wish the hotel had been more centrally located. Really enjoyed me time there, and it was great to catch up with folks!