Spurring Next-Generation Genomics Analysis with Apache Spark
Lightning-Fast Unified Analytics Engine
Apache Spark/Hadoop is emerging as a standard framework for genomic data processing, where Apache Spark is an open-source general-purpose cluster computing engine, with built-in modules for streaming, SQL, machine learning and graph processing. ATGENOMIX data parallelization harnesses these advanced computing technologies as one complete workflow to accelerate your existing pipelines by orders of magnitude while adding scalability and reliability to the open-source tools.
Big Data Genomics
ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis across cluster/cloud computing environments. ATGENOMIX builds fast alignment sorting, partitioning, and region query upon the ADAM APIs and stores data in the columnar Apache Parquet format. The technology maximizes parallel data I/O performance while eliminating the cost and bottleneck of legacy genomic file accesses.
A Puzzle with Millions to Billions of Pieces from Scratch
De novo assembly is one of the most fundamental problems in bioinformatics and string graph is the lossless data representation used by many best practice assemblers, such as SGA and LSG. Compared to de Brujin graph, string graph is more computational intensive; therefore, any solution to accelerate string graph construction is desired.
Apache Spark is a fast and general-purpose engine for large-scale data processing based on the MapReduce model, an in-memory computing framework for efficient data processing with parallel and distributed algorithms. Atgenomix data scientists implement a novel approach, called GraphSeq, to construct string graph for a genome in parallel, based on proprietary parallel suffix array algorithm on Spark. GraphSeq is >13X faster than SGA overlap implementation and computes the string graph of the 38X WGS PE data (NA12878 provided by 10X Genomics) in ~2 hours.
GraphSeq tool kit is publicly available on the Atgenomix GitHub repository for noncommercial uses.
Cohort-based Annotation of Genetic Variants
Researchers need to easily annotate and interpret genetic variants derived from a large quantity of personal genomes. ATGENOMIX develops an integrated interface to annotate the variants based on curated databases as well as in silico estimation on the effects of the variants. We adopt the scalable cluster computing framework, Apache Spark, and incorporate several parallel algorithms to speed up the process of variant annotation and interpretation. The key advances include efficient annotation on large structural variations, diverse combinations of variant filters, easy incorporation with a vast amount of public databases, and scalable architecture of analyzing hundreds of human whole genomes simultaneously. The generated annotation will then be stored in Elasticsearch for real-time query and exploratory analysis.