Recap of VLDS Insights Conference - OpenSource Connections

July 7, 2015 Eric Pugh
Category: Lucene

VLDS Insights Conference Recap

Yesterday I had the privilege of attending the VLDS Insights conference. VLDS is the Virginia Longitudinal Data System, which provides researchers and policy makers anonymized data collected from Virginia’s schools and workforce training initiatives.

As we do quite a bit in the analytics space, even in our search engine work, I went because I was curious to learn more about this very large dataset, and how they disseminate this very valuable data set without violating the very valid privacy concerns around school data. I learned about different approaches used by the researchers working with the data, and how the data was hashed to provide a “double blind” data set.

Learning about the VR ROI Project.

I sat in on a fascinating talk about how do you measure the return on investment for some of these large programs. My first challenge was to figure out what the acronym VR meant! Turns out VR means Vocational Rehabilitation.

Started in 2010, and was meant to: test applicability of valid methodologically rigorous process for assessing ROI at state agency level.

The speaker also stated that this was:

Not your grand father’s ROI.

Focused on program level ROI, wrapping in all of the costs. Look at impact of VR from the very beginning, so from the applicant perspective, versus being based just on the people who graduate the program. Able to provide an ROI on each individual person, not just groups of people.

This translates into having a LOT of depth in your ROI calculations. For example, when you look at “Training”, it turns out on a quarterly earning short run, training is bad. But on a longer term basis, it’s very valuable. Which makes sense, because while the person is in training, they are not earning as much. While this might be “blindingly obvious” conclusion, for someone embarking on training, they need to plan for reduced income.

Being able to calculate ROI allows you to make a elevator pitch like this:

80% of VR applicants in 2000 earned more as a result of VR services. For every $1,000 spent by DARS, the average (median) consumer earned $7,1000 more over 10 years. The top 10% earned $45,100 (or more) over the same period.

This is empirical knowledge about the value of the programs, not just intuitive knowledge.

Using VLDS to Predict 8th Grade Outcomes for Virginia’s Preschoolers

I initially went to this session because my colleague @softwaredoug did a simple Student Dropout Predictor using Elasticsearch and a public data (Source available at http://github.com/o19s/student-dropout-predictor).

I was curious to learn more about what a Research Scientist, working with these large data sets, actually does. It was interesting hearing, yet again, that much of being a Research Scientist is actually being a Research Data Janitor, since significant effort is put into cleaning up the data. The older VLDS datasets have lower quality then the newer ones, and the underlying data models have evolved over the years since data was collected.

I did get the feedback that working with the VLDS datasets is relatively cumbersome, and that is potentially by design. You have to be an accredited researcher. You need to be fairly specific about what data within the full universe of information that VLDS contains that you want to use in your research. There doesn’t exist an easy search engine style interface that lets you quickly click around and pick some facets like schools in southwest Virginia and schools in northern Virginia, and then compare information like dropout rates. Probably because computed data points like “drop out rate” don’t exist in VLDS, just the raw underlying data.