Open Genomics @open-genomics - Tumblr Blog

Global Alliance for Genomics and Health at UCSC is a pioneering organization pwrking with genomics data.

(via https://www.youtube.com/watch?v=7vky0HWxMiE)

This is a video that our friend Kathleen Hayes at Typesafe helped to make. It sumamrizes the ideas of Open Genomics clearly and concisely. The biology side is brilliantly explained by the folks from Driver, and the way AMPLab OSS approaches it by Matt Massie and Frank Nothaft. The key opportunity, how OSS developers can help, is outlined by David Patterson, Matei Zaharia, and Martin Odersky. Alexy Khrabrov invites developers to join Open Genomics to learn more about the space and figure out ways to contribute.

#video #scala #spark #amplab #driver

Open Genomics is coming together

Here's a quick history of our community, its motivation, the people who started it, and plans for the future at the moment of inception.

It all started with John St John of Driver Genomics connecting around Scala By the Bay 2014, and sharing his plans about using Scala, Play, and Spark to organize genomics data. I've learned a lot, shared some pointers, and we welcomed several Driver developers at the conference, and its Fast Track to Scala and Fast Track to Spark courses -- the first run together with Typesafe, the latter in together with Databricks. Stewart Stewart from Driver (yes, Stewart^2) took the Spark one.

Come 2015, Stewart Stewart organized an SF Scala Puzzle Night at Driver's office near UCSF Mission Bay. As an intro to Driver, we got a tour of the facilities. A part of it was the cofounders telling us what they do and why they do it. A lot of people got quiet listening to it, seeing cancer patients on screen who could have been treated if personalized medicine existed -- but whose genomes weren't even sequenced. I've been personally moved by those stories, and realized that Driver's work, with broader set of related efforts in biotech, including AMPLab, is the best way forward to defeat cancer faster. Having started and run four open-source meetups -- SF Scala, SF Spark, SF Text, and Reactive Systems -- I know how to mobilize developer communities, and I've immediately resolved to help Open Genomics tribe to come together from our meetups and broader Bay Area.

John recommended Matt Massie's group at AMPLab as the next stop. I've visited Matt with his colleague Frank Nothaft, and once we learned that we all prefer Spark for compute and Parquet+Avro as our data format, we'd realized we were brothers. We planned a series of joint activities, including talks at the coming Text By the Bay conference, and a working group meeting preceding it.

It turns out Natural Language Processing, which is all about strings, developed technologies directly applicable to genomics. UPenn, one of the leaders in NLP, has a long history of knowledge transfer between Computational Linguistcs and Bioinformatics. My UPenn Ph.D. thesis advisor, Prof. Lyle Ungar, and a member of my Ph.D. committee, also the Director of Linguistic Data Consortium, Mark Liberman, were both speakers at Text By the Bay, with Mark, one of the key authorities in the field, delivering the opening keynote. We all met on April 23, 2015, at Galvanize, for a whole-day working group formation. Other members included John St John of Driver, Frank Nothaft of AMPLab, Jeff Lerman of QIAGEN, Malcolm Greaves from Nitro, Mike Tamir and Nir Kaldero from Galvanize.

We discussed what's needed to help OSS developers help fight cancer. They need to ramp up on biology first, which is more than perusing a README file. Conversely, for biologists to get to the point of appreciating why Parquet+Avro is a reasonable data format, probably takes a similar ramp. We decided to continue working together, and hold a series of meetups to extend it to a group of volunteers who will help with gathering the resources needed.

Frank's and John's talks at Text By the Bay were fantastic, and had the same effect on the conference attendees as they had on me.

The first Open Genomics meetup was held at SF Spark and Friends meetup on July 23rd, 2015, at Chartboost, and our signup form had 20 members joining. The path forward will include a stable group of volunteers gathering links and resources on the web site, collaboration with universities and startups on hackathons, a Kaggle competition -- or a series -- and perhaps forming a nonprofit around this work.

As a meetup organizer, I want to dedicate my time and effort to Open Genomics community as a way of giving back, and our working group is looking for a standing committee of volunteers to both move this forward and decide the formal shape of this work. Please sign up for the group and we'll make progress together.

Welcome to the first Open Genomics meetup at Chartboost. Please fill out the form to keep in touch and help the researchers defeat cancer with open-source software! Share this link!

Join Open Genomics

http://bit.ly/join-open-genomics

Open Source communities come together because they want to improve the world. In Scala community, we have a tremendous potential to improve not only the way software industry works, but also save human lives -- literally. Scala and Spark are increasingly becoming the choices of data mining for genomics.

There are several startups in SF Scala, SF Text, and SF Spark and Friends meetups using Scala and Spark for genomics. Furthermore, AMPLab, the UC Berkeley labs where Spark was born, has a group devoted to solving genomics problems with Scala and Spark -- defining the data formats and creating algorithms for common genomics tasks. At Text By the Bay we created a working group for Open Genomics with Scala and Spark, spanning two universities and three startups. Two of the participants are presenting at this meetup. We encourage everyone who wants to help defeat cancer and other ilnesses with Scala and Spark to join forces and collaborate on it.

(1) Scalable Genome Analysis With ADAM

Thanks to substantial improvements in the cost and throughput of DNA sequencing machines, genomic data may soon make personalized medicine a reality. However, significant processing is needed to turn raw DNA strings captured by sequencers into clinically useful data, and modern DNA processing software can take up to a week to run. In this talk, we'll look at how we reconstruct genomes from the raw sequence data, and we introduce ADAM, an Apache Spark-based API for accelerating genome processing pipelines.

Frank Austin Nothaft is a MS/PhD student in Computer Science at UC Berkeley. Frank's research focuses on optimizing commodity distributed systems for scientific applications, and then using these systems to explore biological phenomena. Frank works with Professor David Patterson in the AMPLab and the ASPIRE lab, and is supported by the NSF Graduate Research Fellowship. Frank has also been an IC Design engineer at Broadcom Corporation since 2011, focusing on mixed-signal design automation. Frank completed his Bachelors of Science with Honors in Electrical Engineering at Stanford University, and was advised by Professor William J. Dally.

(2) A High Level Overview of Genomics in Personalized Medicine

Nearly twenty years ago president Clinton announced the completion of one of the largest public/private collaborative efforts in history, the first draft of the human genome. This work promised to bring forth a new era of totally personalized medicine, where the unique blueprint for your body is used to determine the most effective treatment options for you as an individual. Finally this promise is starting to be realized in the field of oncology, among others. I will give a high level overview of medical genomics with an emphasis on my area of expertise, using it to guide decision making in oncology.

John St. John is the Director of Bioinformatics at Driver Group, a new startup in the cancer genomics and therapeutics space. Driver Group is currently delivering cutting edge personalized drug recommendations to cancer patients, and identifying opportunities to bring new kinds of drugs to cancer patients when we discover a need.

Trending Blogs

Recently Viewed Blogs

Open Genomics