My summer project is supporting our Data Science intern Samia in her quest to understand if the racial, gender and ethnic representation in cancer clinical trials reflects the general distribution in the population at large who have these cancers.
One of the key datasets that she is working with is the data from ClinicalTrials.gov. The database is called the Aggregate Analysis of ClinicalTrials.gov or AACT database. Initially we looked at some pipe delimited dumps of the data, but then I realized to query over that was going to be a pain of reverse engineering the schemas and primary and foreign keys. Fortunantly there is a dump of the original database available. So, it was time for me to take the crash course on learning how to restore Oracle databases! This blog post covers the steps that I took.
What Version of Oracle? How do I install Oracle on OSX?
I was almost defeated before I began, because I started googling around to find out how to install Oracle on OSX, and it turned out to be a daunting task. There are freaking 10 different guides about this on Oracle’s site! But then I remembered that I don’t actually care about the nitty gritty of Oracle. I just want to have it running as a black box. Enter Docker, specifically a Oracle 11G docker image kindly made by wnameless.
I read the README and fired up Oracle via Docker:
docker run -d -p 49160:22 -p 49161:1521 --name oracle -e ORACLE_ALLOW_REMOTE=true wnameless/oracle-xe-11g
Notice the proxying of the ports? The SSH port is proxied to 49160 and the SQL driver port is proxied from 1521 to 49161. The magic variable ORACLE_ALLOW_REMOTE was passed in as well. You can log into the running process via:
docker exec -it oracle bash
The next challenge was that to interact with Oracle you need sqlplus, a tool I haven’t used in years. I tried a quick
brew install sqlplus but no joy. Again, Docker to the rescue:
docker run --interactive guywithnose/sqlplus sqlplus system/[email protected]:49161/xe
Thank you guywithnose for packaging up sqlplus and making it easy to connect to my locally running image.
Okay, time to restore an Oracle dump file
This turned out to be quite cumbersome, and the documentation on AACT wasn’t perfect. However, here are the set of steps that I did to make it work.
First I had to get the dump file downloaded. I ran a series of
docker exec commands to install some prerequisities and then download the file.
docker exec oracle apt-get install unzip docker exec oracle wget http://library.dcri.duke.edu/dtmi/ctti/2016_Refresher/AACT201603_oracle.ZIP docker exec oracle unzip AACT201603_oracle.ZIP
Then create the file system structure that the new database will be restored into:
docker exec oracle mkdir -p /data docker exec oracle chmod -R 777 /data docker exec oracle mv aact.dmp /data
Now it’s time to set things up in the Oracle database side of things. Fire up sqlplus again via
docker run --interactive guywithnose/sqlplus sqlplus system/[email protected]:49161/xe. And then run these SQL commands:
create directory data as '/data'; create tablespace CLINTRIALSGOV_OUT_D datafile '/data/data.dbf' size 800M extent management local autoallocate; ALTER DATABASE DATAFILE '/data/data.dbf' AUTOEXTEND ON NEXT 1M MAXSIZE 12024M; create profile DCRI_SERVICE limit;
Okay, now you are ready to run the import program. I tried to run it via a
docker exec but had various difficulties. So just logged in and then ran it.
docker exec -it oracle bash impdp system/[email protected] directory=data dumpfile=aact.dmp schemas=clintrialsgov_out logfile=import_aact.log
You can monitor the progress via some queries:
SELECT FILE_NAME, BYTES FROM DBA_DATA_FILES WHERE TABLESPACE_NAME = 'CLINTRIALSGOV_OUT_D';
Okay, but now it’s just running on my local laptop! How is Samia going to access it?
This also stumped me a bit. I started out messing around with Google Compute Engine and Kubernetes, but honestly, since I am running a simple 1 node database, I don’t need all the magic of Kubernetes. The easiest path to putting Oracle online was to use Docker Cloud.
First off I went to the Cloud Settings and gave Docker Cloud my AWS keys.
Second I went and create a new Service. I searched Docker Hub for wnameless/oracle-xe-11g and found it.
In the General Settings I gave it an alias of “oracle”, and kept all the other defaults. Under Ports I said to publish the Oracle ports of 1521 and 22. I also added the
ORACLE_ALLOW_REMOTE=true as a Environment Variable. Clicked Create and Deploy and in a few minutes, like around 5 or 7, I had a online Oracle node.
To load the data, I followed all the steps above, except that instead of doing
docker exec I had to do
docker-cloud exec. To connect to the oracle container running in Docker Cloud you need it’s name, which you can figure out via:
docker-cloud container ps
To log on you don’t have a
docker-cloud exec oracle-xe-11g-13db8b82-1 bash
Do all of that and you should have the AACT database online! This isn’t a perfect solution, for example I’m not mounting a persistent Volume to store the data on. So if the container goes down, so does the dataset. But this is good enough for playing with the data. In the next edition, I’ll show you how I connect up Apache Zeppelin and some Spark queries!