How cloud computing is revolutionizing pharma and genomics

Yesterday I attended an event hosted by Booz Allen/Amazon around Big Data and Cloud Computing for life sciences. It was a fascinating event that brought together folks from data science, software, and pharma background.

The takeaway was clear: Cloud Computing, Full Text Search, and Big Data are revolutionizing genomics and pharmaceutical research. So much computing power and so much data is in the hands of todays researchers. Tools like Hadoop, Mahout, and Solr stand ready to ensure that the medicine of tomorrow will look nothing like the medicine of today.

Below is a breakdown of the technologies presented.


What a computer might have looked like when the human genome was originally sequenced.

Dr. Alex Dickinson from Illumina showed off their BaseSpace cloud platform for genomics. BaseSpace provides a platform for cloud analytics of gene sequences. Instead of needing to create and manage a massive super computing cluster like the original crackers of the human genome, BaseSpace provides as seamless and scalable EC2 backed platform to help researchers and clinicians analyze gene sequences.

BaseSpace transforms the advanced and ivory tower field of genomics computing into something accessible to patients, clinicians, and researchers. They even have an app store for providing custom applications around genomics. So if you want to write an application that utilizes the latest bioinformatics algorithms and tools, it might be a fun weekend hack instead of a million dollar research project. The potential for improving lives and patient outcomes is massive.

I had a fascinating discussion with Dr Dickinson and Sriram Sridhar at Booz about how Lucene might also be a potential tool for analyzing sequences of genes. Lucene provides a set of data structures ideal for dealing with sequences of symbols, allowing efficient lookup and sequence pattern matching of gene sequences. For example, if there was a gene sequence a researcher wanted to use to lookup all the samples that had that sequence, they might be able to index all genes into Lucene and perform a query to lookup those samples. Additionally Lucene/Solr provide an ability to add parameters such as Levenshtein distance to a query, allowing a researcher to lookup DNA sequences very similar to a query sequence. Hopefully this is a conversation we can continue, theres definite potential for analytics tools here.

Partnering with LucidWorks and Booz Allen, I worked with a cross-company team to develop a new drug recommender for pharma researchers using Lucids Big Data platform. In the demo, users search for an indication such as “headache” against drug product labels from Dailymeds database of product labels. Based on active ingredients already approved for headache, for example, acetimeophan or ibuprofin, the demo uses a similarity metric to identify candidate compounds from the PubChem dataset, scoring each compound with how similar it is to currently approved compounds.


One of these could be the next big headache drug!

Additionally, through natural language processing, weve linked additional data sets such as safety data through FAERS, NIH Clinical Trials, and NIH Funding Reports to give researchers additional information about candidate compounds at a single glance. This part of the project was fun as it incorporated building out some D3 visualizations to give users a responsive, discovery oriented user interface. This demonstrates a core part of our mission at OpenSource Connections – linking backend data structures with rich, user-focused front-ends.

The demo gives researchers powerful leads into potential new areas of research, potentially having a real impact on peoples lives. It demonstrates that full-text search and big data can be brought together to enhance one another, with big data providing a lot of the backend scale, smarts, data structures and search providing a natural invitation for users to interact with large sets of data. This demo also demonstrates how full-text search is more than just a dumb search box. Full-text search provides friendly, natural-language aware ways perform analytics and discovery with features such as facets, deep text analysis, and an ability to index most, if not all, of your data.

Cycle Computing makes supercomping cheap and available

Cycle Computing showed how its platform affordably gathers EC2 spot instances to maximize compututational power to work on computationally hard problems in the pharma space. By doing this, Cycle can capture a massive amount of computational power. Their platform deals with the “spotty” nature of these instances – adding fault tolerance and managing the dynamic nature of the instances. Cycles platform can also balances stable local resources in conjunction with spotty cloud resources to provide these services. I dont think anyone puts it better than Cycle themselves:

Utility Supercomputing is the on-demand creation and use of high performance computational environments equivalent to those on the Top 500 Supercomputer list, as a metered, pay for use service.

Cycle provides a nice suite of tools to help manage the large cluster. They showed off how their UI can show you the status of 10000 simultaneously running EC2 instances by showing users a nice grid based visualization. Additional features include deployment integration of Spot instances with Chef. Its fascinating how simple and accessible massive amounts of computational power is becoming. Elastic, super computing capacity is truly becoming an everyday phenomena, and Cycle is showing us how we can maximize whats out there to perform amazing analytics.

To sum up

This was a great event and it was amazing to get to meet professionals from so many diverse backgrounds. Thanks so much to Booz Allen Hamilton and Amazon for hosting the event, I certainly learned a lot. Im really excited about where the future is going to take us now that so much computational power is in the hands of so many smart researchers. The implications are simply revolutionary.