Discover Top Posts Tagged with #ashleybateman

#ashleybateman #microbialTransplants

Bioinformatics Pipeline

Raw Data -> OTU table & IDE

Using: ACISS, QIIME, FastX Toolkit

I received 137.9 GB of sequence data from the Genomics Core at the University of Oregon after submitting 36 samples for paired end 150 bp reads on the HiSeq run in MiSeq mode.

It was run with a companion of genomic DNA, due to the fact that a HiSeq can’t cope without sufficient sequence complexity. At first the sequence data was provided to me already demultiplexed, and this initially presented a problem because of the pipeline that I had originally vetted through QIIME.

I talked to Nick Stiffler, and he concluded the following: “The qiime docs show this when using files that were already demultiplexed, we need to create a special fasta label that includes the fields from the mapping file: http://qiime.org/documentation/file_formats.html-already-demultiplexed-samples. It doesn’t look like there is a script to handle this, but we could write one fairly easily.”

In the meantime, he gave me the raw data to use. In the future, I would have to use the QIIME script referenced above, along with a custom script to create a special fasta label.

It is important to remember that when logging on to the htseq server to download your data, you need to do so from your ACISS account, otherwise you will get an error: permission denied.

starting over with raw data (not demultiplexed)

Log into htseq server where the sequence data is, and download all relevant files.

scp [email protected]:/raid6p1/hiseq/ 140117_SN747_0364_BH8D14ADXX/ashley/Project_H8D14ADXX/ Sample_lane1/*.gz

Concatenate Read 1, Read 2 (barcode reads), and Read 3

cat *_R1_*.fastq < R1combined.fastq (Read 1) cat *_R2_*.fastq < R2combined.fastq (Barcodes) cat *_R3_*.fastq < R3combined.fastq (Read 3)

paste barcodes onto Read 1 sequence file

paste -d '\0' RawData_1.29.14/R2combined.fastq RawData_1.29.14/R1combined.fastq > PSG_R2R1.fastq

gets rid of double plusses

perl -i.bak -p -e 's/^\+\+\n/\+\n/' PSG_R2R1.fastq

Barcode split Read 1 with FastX:

PSG_R2R1.fastq (Barcodes + Read1 sequence)

Barcodes_rc8.txt (list of barcodes)

mkdir bcfile_PSG_R2R1

cat PSG_R2R1.fastq | fastx_barcode_splitter.pl --bcfile Barcodes_rc8.txt --prefix bcfile_PSG_R2R1/ --bol --mismatches 0

Barcode Count Location EdtpdT0 95656 bcfile_PSG_R2R1/EdtpdT0 EdtpdT2 158291 bcfile_PSG_R2R1/EdtpdT2 EdtpdT4 77924 bcfile_PSG_R2R1/EdtpdT4 EdtpdT8 115113 bcfile_PSG_R2R1/EdtpdT8 EdtpmT0 88206 bcfile_PSG_R2R1/EdtpmT0 EdtpmT2 132743 bcfile_PSG_R2R1/EdtpmT2 EdtpmT4 80860 bcfile_PSG_R2R1/EdtpmT4 EdtpmT8 73725 bcfile_PSG_R2R1/EdtpmT8 EdtpsT0 156315 bcfile_PSG_R2R1/EdtpsT0 EdtpsT2 82321 bcfile_PSG_R2R1/EdtpsT2 EdtpsT4 79881 bcfile_PSG_R2R1/EdtpsT4 EdtpsT8 76491 bcfile_PSG_R2R1/EdtpsT8 EmtpdT0 254247 bcfile_PSG_R2R1/EmtpdT0 EmtpdT2 100645 bcfile_PSG_R2R1/EmtpdT2 EmtpdT4 45179 bcfile_PSG_R2R1/EmtpdT4 EmtpdT8 107926 bcfile_PSG_R2R1/EmtpdT8 EmtpmT0 137704 bcfile_PSG_R2R1/EmtpmT0 EmtpmT2 223040 bcfile_PSG_R2R1/EmtpmT2 EmtpmT4 40359 bcfile_PSG_R2R1/EmtpmT4 EmtpmT8 258230 bcfile_PSG_R2R1/EmtpmT8 EmtpsT0 79289 bcfile_PSG_R2R1/EmtpsT0 EmtpsT2 118713 bcfile_PSG_R2R1/EmtpsT2 EmtpsT4 40528 bcfile_PSG_R2R1/EmtpsT4 EmtpsT8 59409 bcfile_PSG_R2R1/EmtpsT8 EstpdT0 102280 bcfile_PSG_R2R1/EstpdT0 EstpdT2 77238 bcfile_PSG_R2R1/EstpdT2 EstpdT4 136706 bcfile_PSG_R2R1/EstpdT4 EstpdT8 76222 bcfile_PSG_R2R1/EstpdT8 EstpmT0 60302 bcfile_PSG_R2R1/EstpmT0 EstpmT2 40097 bcfile_PSG_R2R1/EstpmT2 EstpmT4 42017 bcfile_PSG_R2R1/EstpmT4 EstpmT8 77362 bcfile_PSG_R2R1/EstpmT8 EstpsT0 51933 bcfile_PSG_R2R1/EstpsT0 EstpsT2 15610 bcfile_PSG_R2R1/EstpsT2 EstpsT4 155800 bcfile_PSG_R2R1/EstpsT4 EstpsT8 45135 bcfile_PSG_R2R1/EstpsT8 unmatched 169988647 bcfile_PSG_R2R1/unmatched total 173552144

Looks good! (consider that unmatched includes my genomic companion in the lane, and phiX).

Used prinseq to look at quality of Read 1: http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1

concatenate all barcode split reads (read 1 + barcode) together

cat *.fastq > Read1and2CombinedPSG.fastq

perl prinseq-lite.pl -verbose -fastq bcfile_PSG_R2R1/Read1and2CombinedPSG.fastq -graph_data Read1and2CombinedPSG.fastq.gd -out_good null -out_bad null

quality of read 1 looks good.

Repeat same process with Read 3

Paste barcodes onto Read 3 sequence file

paste -d '\0' RawData_1.29.14/R2combined.fastq RawData_1.29.14/R3combined.fastq > PSG_R2R3.fastq

Gets rid of double plusses

perl -i.bak -p -e 's/^\+\+\n/\+\n/' PSG_R2R3.fastq

Barcode split with FastX: * PSG_R2R3.fastq (Barcodes + Read3 sequence) * Barcodes_rc8.txt (list of barcodes)

mkdir bcfile_PSG_R2R3

cat PSG_R2R3.fastq | fastx_barcode_splitter.pl --bcfile Barcodes_rc8.txt --prefix bcfile_PSG_R2R3/ --bol --mismatches 0

abateman@cn144 PilotSamplingGrad2]$ cat PSG_R2R3.fastq | fastx_barcode_splitter.pl --bcfile Barcodes_rc8.txt --prefix bcfile_PSG_R2R3/ --bol --mismatches 0

Barcode Count Location EdtpdT0 95656 bcfile_PSG_R2R3/EdtpdT0 EdtpdT2 158291 bcfile_PSG_R2R3/EdtpdT2 EdtpdT4 77924 bcfile_PSG_R2R3/EdtpdT4 EdtpdT8 115113 bcfile_PSG_R2R3/EdtpdT8 EdtpmT0 88206 bcfile_PSG_R2R3/EdtpmT0 EdtpmT2 132743 bcfile_PSG_R2R3/EdtpmT2 EdtpmT4 80860 bcfile_PSG_R2R3/EdtpmT4 EdtpmT8 73725 bcfile_PSG_R2R3/EdtpmT8 EdtpsT0 156315 bcfile_PSG_R2R3/EdtpsT0 EdtpsT2 82321 bcfile_PSG_R2R3/EdtpsT2 EdtpsT4 79881 bcfile_PSG_R2R3/EdtpsT4 EdtpsT8 76491 bcfile_PSG_R2R3/EdtpsT8 EmtpdT0 254247 bcfile_PSG_R2R3/EmtpdT0 EmtpdT2 100645 bcfile_PSG_R2R3/EmtpdT2 EmtpdT4 45179 bcfile_PSG_R2R3/EmtpdT4 EmtpdT8 107926 bcfile_PSG_R2R3/EmtpdT8 EmtpmT0 137704 bcfile_PSG_R2R3/EmtpmT0 EmtpmT2 223040 bcfile_PSG_R2R3/EmtpmT2 EmtpmT4 40359 bcfile_PSG_R2R3/EmtpmT4 EmtpmT8 258230 bcfile_PSG_R2R3/EmtpmT8 EmtpsT0 79289 bcfile_PSG_R2R3/EmtpsT0 EmtpsT2 118713 bcfile_PSG_R2R3/EmtpsT2 EmtpsT4 40528 bcfile_PSG_R2R3/EmtpsT4 EmtpsT8 59409 bcfile_PSG_R2R3/EmtpsT8 EstpdT0 102280 bcfile_PSG_R2R3/EstpdT0 EstpdT2 77238 bcfile_PSG_R2R3/EstpdT2 EstpdT4 136706 bcfile_PSG_R2R3/EstpdT4 EstpdT8 76222 bcfile_PSG_R2R3/EstpdT8 EstpmT0 60302 bcfile_PSG_R2R3/EstpmT0 EstpmT2 40097 bcfile_PSG_R2R3/EstpmT2 EstpmT4 42017 bcfile_PSG_R2R3/EstpmT4 EstpmT8 77362 bcfile_PSG_R2R3/EstpmT8 EstpsT0 51933 bcfile_PSG_R2R3/EstpsT0 EstpsT2 15610 bcfile_PSG_R2R3/EstpsT2 EstpsT4 155800 bcfile_PSG_R2R3/EstpsT4 EstpsT8 45135 bcfile_PSG_R2R3/EstpsT8 unmatched 169988647 bcfile_PSG_R2R3/unmatched total 173552144

Used prinseq to look at quality of Read 3: http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1

Concatenate all barcode split reads (read 1 + barcode) together

cat *.fastq > Read1and3CombinedPSG.fastq

perl prinseq-lite.pl -verbose -fastq bcfile_PSG_R2R3/Read1and3CombinedPSG.fastq -graph_data Read1and3CombinedPSG.fastq.gd -out_good null -out_bad null

quality looks HORRIBLE.

Moved forward using Read 1 only for analysis.

Run QIIME barcode splitting to move forward with QIIME pipeline

This script performs demultiplexing of fastq sequence data where barcodes and sequences are contained in separate fastq runs (common on Illumina).

split_libraries_fastq.py --barcode_type 8 --max_barcode_errors 0 -q 25 -i RawData_1.29.14/R1combined.fastq -b RawData_1.29.14/R2combined.fastq -o PSG_split_lib_output/ -m MappingFilePSG.txt

Pick open reference otus

The preferred method by QIIME developers, this script clusters reads against a reference sequence collection, and any reads which do not hit in the reference collection are subsequently clustered de novo. It includes taxonomy assignmnet, sequence alignment, and tree-building steps.

pick_open_reference_otus.py -i PSG_split_lib_output/seqs.fna -o PSG_picked_otus/ -r gg_97_otus_4feb2011.fasta -f

Because this script will take more than 4 hours to run on ACISS, it is necessary to use a long node, as shown below.

qsub -q longgen OTUpickscript.sh

OTUpickscript.sh:

!/bin/bash -N Ashley_OTU -q longgen ### -l nodes=1:ppn=4 -d /home16/abateman/PilotSamplingGrad2 -e /home16/abateman/PilotSamplingGrad2 -o /home16/abateman/PilotSamplingGrad2 ## -ea ## -M [email protected]``<br />

module load qiime/1.7.0 echo "A" mkdir rm -rf /tmp/PSG_picked_otus mkdir /tmp/PSG_picked_otus echo "B" pick_open_reference_otus.py -i PSG_split_lib_output/seqs.fna -o /tmp/PSG_picked_otus/ -r gg_97_otus_4feb2011.fasta -f

echo "C" cp -r /tmp/PSG_picked_otus PSG_picked_otus_1 echo "D"

To check on progress while running on ACISS:

qstat -anu abateman ssh bn67 whatever bn number it says ls /tmp/"PSG_picked_otus/" top``<br />

Result:

All other files created as normal, but the log file said an error was raised during pick_open_reference_otus.py

I think next time I need to change the QIIME config and/or parameter file in order to point QIIME to the appropriate (-t) template.

Below is the step that the error was raised on:

align_seqs.py -i /tmp/PSG_picked_otus//rep_set.fna -o /tmp/PSG_picked_otus//pynast_aligned_seqs

So instead I tried running these last few steps by themselves, because I knew that I wanted to use at least a tree for my next script, core_diversity_analyses.py

align_seqs.py -i PSG_picked_otus_1/rep_set.fna -o PSG_picked_otus_1/pynast_aligned_seqs -t core_set_aligned.fasta.imputed

filter_alignment.py -i PSG_picked_otus_1/pynast_aligned_seqs/rep_set_aligned.fasta -m gg_lanemask_in_1s_and_0s.txt -o PSG_picked_otus_1/filtered_alignment/

make_phylogeny.py -i PSG_picked_otus_1/filtered_alignment/rep_set_aligned_pfiltered.fasta -o PSG_picked_otus_1/rep_phylo.tre

Running core diversity analysis is a good way to do some basic data exploration.

core_diversity_analyses.py -i PSG_picked_otus_1/otu_table_mc2_w_tax.biom -t PSG_picked_otus_1/rep_phylo.tre -o PSG_corediv1000/ -m MappingFilePSG.txt -e 1000

Not working: missing script on ACISS servers in order to run this sucessfully. Ran it locally on MacQiime instead.

core_diversity_analyses.py -i otu_table_mc2_w_tax.biom -t rep_phylo.tre -o PSG_corediv1000/ -m MappingFilePSG.txt -e 1000

#ashleybateman

Trending Tags

Recently Viewed Tags

#ashleybateman