Blog

Expanding data from a frequency table into case form in R and STATA

I’ve been learning a bit of R this summer. R is a pretty powerful language and environment used for statistical computing. I chose it because it’s open-source (and therefore free!). I’ve been using RStudio in conjunction with R — RStudio has a more intuitive GUI that’s a little easier for me to use as a beginner.

As I mentioned in earlier posts, part of my project involves regression analysis. This type of analysis is usually done on case data. However, my datasets yielded frequency tables, and I found myself needing a way to expand these tables into case form.

The answer is a function called expand.dft that is part of the vcdExtra package for R. Downloading vcdExtra is simple, but keep in mind that you may need to install a few dependencies to get it to work correctly.

Likewise, using vcdExtra is pretty simple. Make sure you know which column in your raw data contains the frequency numbers, and name that column Frequency for simplicity’s sake.I started with a frequency table that looked a little like this:

Gender Enrollment Frequency
Male 0 50
Male 1 25
Female 0 50
Female 1 15

Here’s the code:

>FrequencyTable  read.csv(rawdata.csv, header = TRUE)>Library(vcdExtra)>CaseForm  expand.dft(FrequencyTable, freq=Frequency)

Easy peasy! The end result is a table of 140 rows (50+25+50+15), one for each person represented in the original frequency data.

Another note: I’m working with a large dataset that contains millions of points. Expanding that data to case form, then running a regression on it can be quite a task for a computer that doesn’t have a ton of RAM. So an alternative, if you are lucky enough to have access to STATA, is to simply open your dataset and type “expand Frequency” into the command area to automatically expand the data in your frequency column into instances of case data.

To check whether your data’s been expanded properly, simply ensure that the number of rows in your new dataset are equal to the sum of the cells in your frequency column.