I’m Samia, and I’m analyzing racial, gender, and ethnic representation in cancer clinical trials.

My name is Samia, and I am being jointly hosted by OSC and Sartography, LLC as a DataStart intern for the next couple of months. You can read more about the brand new DataStart program here. There are 6 of us interns scattered about the Southern states this summer, and we’re all working on different data-intensive problems in various sectors.

I’m a master’s student at the University of Georgia, where I study biomanufacturing and bioprocessing with a focus on pharmaceutical and medical device regulatory affairs. My undergraduate and early career background is in biochemistry and organic chemistry, but I have business experience in project coordination, scientific technical writing and research product consulting. All this to say that I am no data scientist, so I am fully prepared to leave my comfort zone this summer and pick up some new skills!

My project is an analysis of racial, gender, and ethnic representation on cancer clinical trials conducted between 2002 and 2012. This topic is near and dear to my heart, not just for personal reasons, but also as someone who’s worked in clinical development. I believe strongly that all people have a right to equitable access to the benefits of cancer research, and one of the first steps is ensuring that new drugs are tested in the people they are meant to help.

At OSC, I will be learning to use data management tools to get a basic handle on my data before I can begin my analysis proper. I’ll be wrangling three main datasets:

Aggregate Analysis of (AACT): This is a relational database made available by the Clinical Trials Transformation Initiative, a consortium of about 60 federal and private organizations, specifically for the purpose of facilitating analysis.

Surveillance, Epidemiology, and End Results (SEER) data: The National Cancer Institute’s SEER program collects and distributes data on cancer incidence and survival in the United States from cancer registries covering about 28% of the United States population. This data is made available to researchers upon written request. The NCI helpfully bundles it with a proprietary statistical software package.

U.S. Census Population Data (With Bridged Race Categories) for 2002-2012: This is census data that has been standardized a bit to fit the government’s incredibly clunky set of racial categories. The numbers for people 85 and over are kept in separate tables, so I had to contact a scientist at the CDC for those!

I’m really excited to work on this project and journal some of my experiences as a complete newbie to the data science field!