Discover Top Posts Tagged with #variant calling

요점

- 샘플 수 증가, 분석 데이터 양 증가에 따른 많은 연산, I/O 등의 부하 즉, 하드웨어 요구량 증가

- 분산 시스템, Multi-core, cluster 개념에서의 Genome Analysis 고민

- 단계별로 Parallel 가능한지 여부에 따라 Multi-core, Single-core, Per-sample cluster로 나누어 각 단계별 최대효율을 내는 하드웨어 요구 부응함.

- 초기에 우리 연구소와 같이 Chromosome단위로 나누어 분석 진행했다면 현재 callable region으로 나누어 진행함. (Repeat과 같은 Align이 힘든 영역 단위로 나누거나 target region등으로 나눔)

applying base recalibration, de-duplication, realignment and variant calling

- 또 한가지는 Streaming을 사용함 (Pipe), 이것역시 우리가 이미 채택한 방법. I/O 부하를 감소시킴.

- Distributed file system에 대한 고려. 우리가 테스트중인 방식이기도한 GlusterFS (유사한 기능을 위한 것으로써 MooseFS등이 있음) 등을 사용함. SSD를 사용해보는것도.

흥미로운 점 및 자체 결론 (From. riginal Post)

Some interesting conclusions:

Scaling single samples to additional cores (16 to 96) provides a 40% reduction in processing time due to increased parallelism during post-processing and variant calling.

Lustre provides the best scale out from 1 to 30 samples, with 30 sample concurrent processing taking only 1.5x as along as a single sample.

NFS provides slightly better performance than Lustre for single sample scaling.

In contrast, NFS runs into scaling issues at 30 samples, proceeding 5.5 times slower during the IO intensive alignment post-processing step.

#pipeline #bioinformatics #parallel #variant calling

The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

#GATK #bioinformatics #sequencing #variant calling

요점