Charlottesvilles Big Data Meetup – What is a Data Scientist?

January 31, 2013 Doug Turnbull
Category: Big Data

Scott and I ventured out of the office yesterday evening to check out a new group starting up– Charlottesvilles Big Data Group. The most exciting thing about the group was how diverse it was. Folks from all over the spectrum from hardcore science, to Bioinformatics, to entreprenuers trying to build an awesome product, to folks like Scott and I focused on customer solutions.

The group engaged in a fascinating discussion on what exactly a “Data Scientist” does. Data science being a crucial facet to the Big Data world, it was an interesting discussion about the roles and tools employed by the data scientist as well as the ways engineers supported data scientists in their work.

What is a Data Scientist?

Something that really stood out from this discussion is how a data scientist really contrasts from the traditional “experimental” scientist. The traditional scientist builds an experimental design based on a hypothesis. An experimental design is established whereby all the independent variables – those that can influence the outcome – can be controlled. So we have a closely watched, sterile lab where every variable that might influence a result from humidity to temperature to smell are extremely tightly controlled. By changing just one or two of these variables while holding all other variables steady, we can observe a resulting, “dependent”, variable and make a claim that there is a very strong likelihood and magnitude of a causal relationship between the independent variable(s) we tweaked and the resulting dependent variable we observed. So, for example, we could say that for every 10% of humidity increase, with all other variables steady, 5% more rats fail to solve a maze.

How does a data scientist differ? Well its still hypothesis driven research. The difference is in the nature of the experimental design. Instead of running an experiment in a sterile lab where we are carefully controlling everything from the humidity to the temperature to the expression on the experimenters face, we instead have a giant mass of data potentially from uncontrolled conditions. The experiment becomes a matter of combing through massive amounts of data after the fact. For example, finding enough cases where rats ran through the maze at a given temperature, light level, and all other variables constant, except for a varying humidity. Because we simply have massive and massive amounts of data, there are enough times of all things being equal except humidity, that we can go back and make a statistically significant assertion about what rats do when the humidity changes.

This is particularly useful when there is simply no way to control all the independent variables, and notions of causality are weaker. For example, tracking children through education programs. There is simply no way to setup an experimental design where we force one set of children to undergo one set of circumstances and force another set of children to go through another. Moreover, we cant ethically create an experiment that ensured each child had the exact same socioeconomic background, home life, cultural background, exercise, environment, and all the other dozens of factors that might influence their education outcome. So our only option is to collect tons and tons of data about kids, and see what shakes out. There may be enough times when certain variables are held steady except for one that a definite outcome could be measured.

Exploring The Data Scientists Toolbox

All this talk about experimental design being fairly abstract, its important to keep in mind that dealing with, warehousing, and processing massive amounts of data efficiently to answer the data scientists questions is in itself a hard problem. Through the Charlottesville Big Data group, Scott and I are very excited to learn more about how the diverse groups in the Big Data world deal with Big Data at scale and how that contrasts with some of the customer facing Big Data/distributed search problems weve encountered.

At the core of the problem is knowing what data structures are the best tool for the job. At OpenSource Connections, we have a pretty broad understanding of the strengths and weaknesses of various solutions such as distributed search, NoSQL databases, relational databases, and plain flat-file logs in Hadoop using Map Reduce. Matching up hard data problems with the right solution using the right data structure requires crucial collaboration between everyone. The discussions around what tools such a diverse group have used to solve their problems was and will be a very powerful component of the group. For example, what are the similarities and differences between natural language processing and search at scale vs collecting raw numeric statistics? Are there things that the two groups could learn from each other when it comes to how data is stored and processed?

Another crucial tool in the toolbox is data visualization. Were excited to contrast our experience in user-focused, discovery oriented UI design with the visualizations scientists use to explore their data sets and reach interesting conclusions. One potential fascinating area that came out of our discussions was how does one visualize both the complete data set while allowing for exploration of orders of magnitude smaller features of the data? This is of extreme interest when it comes to visualizing DNA or exploring features of astronomy. The issues of scale are massive and difficult for the human mind to comprehend spatially. Are there lessons about how this problem was solved that can inform others doing Big Data visualization?

Next Time – Tools & Workflow

Were extremely thrilled to continue exploring these issues. At the next meetup, were hoping to explore how everyones workflow differs. What does the pipeline of a Bioinformatics person look like? How does it differ from the pipeline we use when analyzing text before putting it into Lucene? Also of crucial importance, what tools are used along that pipeline? Are any of the tools favored by one domain potentially applicable to another domain?

In short, its a very good time to be a data scientist or big data software engineer in Charlottesville, we hope youll join us for the next meetup!