#ashleybateman #microbialTransplants
Raw Data -> OTU table & IDE
Using: ACISS, QIIME, FastX Toolkit
I received 137.9 GB of sequence data from the Genomics Core at the University of Oregon after submitting 36 samples for paired end 150 bp reads on the HiSeq run in MiSeq mode.
It was run with a companion of genomic DNA, due to the fact that a HiSeq can’t cope without sufficient sequence complexity. At first the sequence data was provided to me already demultiplexed, and this initially presented a problem because of the pipeline that I had originally vetted through QIIME.
I talked to Nick Stiffler, and he concluded the following: “The qiime docs show this when using files that were already demultiplexed, we need to create a special fasta label that includes the fields from the mapping file: http://qiime.org/documentation/file_formats.html-already-demultiplexed-samples. It doesn’t look like there is a script to handle this, but we could write one fairly easily.”
In the meantime, he gave me the raw data to use. In the future, I would have to use the QIIME script referenced above, along with a custom script to create a special fasta label.
It is important to remember that when logging on to the htseq server to download your data, you need to do so from your ACISS account, otherwise you will get an error: permission denied.
starting over with raw data (not demultiplexed)
Log into htseq server where the sequence data is, and download all relevant files.
scp [email protected] :/raid6p1/hiseq/
140117_SN747_0364_BH8D14ADXX/ashley/Project_H8D14ADXX/
Sample_lane1/*.gz
Concatenate Read 1, Read 2 (barcode reads), and Read 3
cat *_R1_*.fastq < R1combined.fastq (Read 1)
cat *_R2_*.fastq < R2combined.fastq (Barcodes)
cat *_R3_*.fastq < R3combined.fastq (Read 3)
paste barcodes onto Read 1 sequence file
paste -d '\0' RawData_1.29.14/R2combined.fastq RawData_1.29.14/R1combined.fastq > PSG_R2R1.fastq
gets rid of double plusses
perl -i.bak -p -e 's/^\+\+\n/\+\n/' PSG_R2R1.fastq
Expand
Barcode split Read 1 with FastX:
PSG_R2R1.fastq (Barcodes + Read1 sequence)
Barcodes_rc8.txt (list of barcodes)
cat PSG_R2R1.fastq | fastx_barcode_splitter.pl --bcfile Barcodes_rc8.txt --prefix bcfile_PSG_R2R1/ --bol --mismatches 0
Barcode Count Location
EdtpdT0 95656 bcfile_PSG_R2R1/EdtpdT0
EdtpdT2 158291 bcfile_PSG_R2R1/EdtpdT2
EdtpdT4 77924 bcfile_PSG_R2R1/EdtpdT4
EdtpdT8 115113 bcfile_PSG_R2R1/EdtpdT8
EdtpmT0 88206 bcfile_PSG_R2R1/EdtpmT0
EdtpmT2 132743 bcfile_PSG_R2R1/EdtpmT2
EdtpmT4 80860 bcfile_PSG_R2R1/EdtpmT4
EdtpmT8 73725 bcfile_PSG_R2R1/EdtpmT8
EdtpsT0 156315 bcfile_PSG_R2R1/EdtpsT0
EdtpsT2 82321 bcfile_PSG_R2R1/EdtpsT2
EdtpsT4 79881 bcfile_PSG_R2R1/EdtpsT4
EdtpsT8 76491 bcfile_PSG_R2R1/EdtpsT8
EmtpdT0 254247 bcfile_PSG_R2R1/EmtpdT0
EmtpdT2 100645 bcfile_PSG_R2R1/EmtpdT2
EmtpdT4 45179 bcfile_PSG_R2R1/EmtpdT4
EmtpdT8 107926 bcfile_PSG_R2R1/EmtpdT8
EmtpmT0 137704 bcfile_PSG_R2R1/EmtpmT0
EmtpmT2 223040 bcfile_PSG_R2R1/EmtpmT2
EmtpmT4 40359 bcfile_PSG_R2R1/EmtpmT4
EmtpmT8 258230 bcfile_PSG_R2R1/EmtpmT8
EmtpsT0 79289 bcfile_PSG_R2R1/EmtpsT0
EmtpsT2 118713 bcfile_PSG_R2R1/EmtpsT2
EmtpsT4 40528 bcfile_PSG_R2R1/EmtpsT4
EmtpsT8 59409 bcfile_PSG_R2R1/EmtpsT8
EstpdT0 102280 bcfile_PSG_R2R1/EstpdT0
EstpdT2 77238 bcfile_PSG_R2R1/EstpdT2
EstpdT4 136706 bcfile_PSG_R2R1/EstpdT4
EstpdT8 76222 bcfile_PSG_R2R1/EstpdT8
EstpmT0 60302 bcfile_PSG_R2R1/EstpmT0
EstpmT2 40097 bcfile_PSG_R2R1/EstpmT2
EstpmT4 42017 bcfile_PSG_R2R1/EstpmT4
EstpmT8 77362 bcfile_PSG_R2R1/EstpmT8
EstpsT0 51933 bcfile_PSG_R2R1/EstpsT0
EstpsT2 15610 bcfile_PSG_R2R1/EstpsT2
EstpsT4 155800 bcfile_PSG_R2R1/EstpsT4
EstpsT8 45135 bcfile_PSG_R2R1/EstpsT8
unmatched 169988647 bcfile_PSG_R2R1/unmatched
total 173552144
Looks good! (consider that unmatched includes my genomic companion in the lane, and phiX).
Used prinseq to look at quality of Read 1: http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1
concatenate all barcode split reads (read 1 + barcode) together
cat *.fastq > Read1and2CombinedPSG.fastq
perl prinseq-lite.pl -verbose -fastq bcfile_PSG_R2R1/Read1and2CombinedPSG.fastq -graph_data Read1and2CombinedPSG.fastq.gd -out_good null -out_bad null
quality of read 1 looks good.
Repeat same process with Read 3
Paste barcodes onto Read 3 sequence file
paste -d '\0' RawData_1.29.14/R2combined.fastq RawData_1.29.14/R3combined.fastq > PSG_R2R3.fastq
Gets rid of double plusses
perl -i.bak -p -e 's/^\+\+\n/\+\n/' PSG_R2R3.fastq
Barcode split with FastX: * PSG_R2R3.fastq (Barcodes + Read3 sequence) * Barcodes_rc8.txt (list of barcodes)
cat PSG_R2R3.fastq | fastx_barcode_splitter.pl --bcfile Barcodes_rc8.txt --prefix bcfile_PSG_R2R3/ --bol --mismatches 0
abateman@cn144 PilotSamplingGrad2]$ cat PSG_R2R3.fastq | fastx_barcode_splitter.pl --bcfile Barcodes_rc8.txt --prefix bcfile_PSG_R2R3/ --bol --mismatches 0
Barcode Count Location
EdtpdT0 95656 bcfile_PSG_R2R3/EdtpdT0
EdtpdT2 158291 bcfile_PSG_R2R3/EdtpdT2
EdtpdT4 77924 bcfile_PSG_R2R3/EdtpdT4
EdtpdT8 115113 bcfile_PSG_R2R3/EdtpdT8
EdtpmT0 88206 bcfile_PSG_R2R3/EdtpmT0
EdtpmT2 132743 bcfile_PSG_R2R3/EdtpmT2
EdtpmT4 80860 bcfile_PSG_R2R3/EdtpmT4
EdtpmT8 73725 bcfile_PSG_R2R3/EdtpmT8
EdtpsT0 156315 bcfile_PSG_R2R3/EdtpsT0
EdtpsT2 82321 bcfile_PSG_R2R3/EdtpsT2
EdtpsT4 79881 bcfile_PSG_R2R3/EdtpsT4
EdtpsT8 76491 bcfile_PSG_R2R3/EdtpsT8
EmtpdT0 254247 bcfile_PSG_R2R3/EmtpdT0
EmtpdT2 100645 bcfile_PSG_R2R3/EmtpdT2
EmtpdT4 45179 bcfile_PSG_R2R3/EmtpdT4
EmtpdT8 107926 bcfile_PSG_R2R3/EmtpdT8
EmtpmT0 137704 bcfile_PSG_R2R3/EmtpmT0
EmtpmT2 223040 bcfile_PSG_R2R3/EmtpmT2
EmtpmT4 40359 bcfile_PSG_R2R3/EmtpmT4
EmtpmT8 258230 bcfile_PSG_R2R3/EmtpmT8
EmtpsT0 79289 bcfile_PSG_R2R3/EmtpsT0
EmtpsT2 118713 bcfile_PSG_R2R3/EmtpsT2
EmtpsT4 40528 bcfile_PSG_R2R3/EmtpsT4
EmtpsT8 59409 bcfile_PSG_R2R3/EmtpsT8
EstpdT0 102280 bcfile_PSG_R2R3/EstpdT0
EstpdT2 77238 bcfile_PSG_R2R3/EstpdT2
EstpdT4 136706 bcfile_PSG_R2R3/EstpdT4
EstpdT8 76222 bcfile_PSG_R2R3/EstpdT8
EstpmT0 60302 bcfile_PSG_R2R3/EstpmT0
EstpmT2 40097 bcfile_PSG_R2R3/EstpmT2
EstpmT4 42017 bcfile_PSG_R2R3/EstpmT4
EstpmT8 77362 bcfile_PSG_R2R3/EstpmT8
EstpsT0 51933 bcfile_PSG_R2R3/EstpsT0
EstpsT2 15610 bcfile_PSG_R2R3/EstpsT2
EstpsT4 155800 bcfile_PSG_R2R3/EstpsT4
EstpsT8 45135 bcfile_PSG_R2R3/EstpsT8
unmatched 169988647 bcfile_PSG_R2R3/unmatched
total 173552144
Used prinseq to look at quality of Read 3: http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi?report=1
Concatenate all barcode split reads (read 1 + barcode) together
cat *.fastq > Read1and3CombinedPSG.fastq
perl prinseq-lite.pl -verbose -fastq bcfile_PSG_R2R3/Read1and3CombinedPSG.fastq -graph_data Read1and3CombinedPSG.fastq.gd -out_good null -out_bad null
Moved forward using Read 1 only for analysis.
Run QIIME barcode splitting to move forward with QIIME pipeline
This script performs demultiplexing of fastq sequence data where barcodes and sequences are contained in separate fastq runs (common on Illumina).
split_libraries_fastq.py --barcode_type 8 --max_barcode_errors 0 -q 25 -i RawData_1.29.14/R1combined.fastq -b RawData_1.29.14/R2combined.fastq -o PSG_split_lib_output/ -m MappingFilePSG.txt
The preferred method by QIIME developers, this script clusters reads against a reference sequence collection, and any reads which do not hit in the reference collection are subsequently clustered de novo. It includes taxonomy assignmnet, sequence alignment, and tree-building steps.
pick_open_reference_otus.py -i PSG_split_lib_output/seqs.fna -o PSG_picked_otus/ -r gg_97_otus_4feb2011.fasta -f
Because this script will take more than 4 hours to run on ACISS, it is necessary to use a long node, as shown below.
qsub -q longgen OTUpickscript.sh
!/bin/bash
-N Ashley_OTU
-q longgen
### -l nodes=1:ppn=4
-d /home16/abateman/PilotSamplingGrad2
-e /home16/abateman/PilotSamplingGrad2
-o /home16/abateman/PilotSamplingGrad2
## -ea
## -M [email protected] ``<br />
module load qiime/1.7.0
echo "A"
mkdir rm -rf /tmp/PSG_picked_otus
mkdir /tmp/PSG_picked_otus
echo "B"
pick_open_reference_otus.py -i PSG_split_lib_output/seqs.fna
-o /tmp/PSG_picked_otus/ -r gg_97_otus_4feb2011.fasta -f
echo "C"
cp -r /tmp/PSG_picked_otus PSG_picked_otus_1
echo "D"
To check on progress while running on ACISS:
qstat -anu abateman
ssh bn67 whatever bn number it says
ls /tmp/"PSG_picked_otus/"
top``<br />
All other files created as normal, but the log file said an error was raised during pick_open_reference_otus.py
I think next time I need to change the QIIME config and/or parameter file in order to point QIIME to the appropriate (-t) template.
Below is the step that the error was raised on:
align_seqs.py -i /tmp/PSG_picked_otus//rep_set.fna -o /tmp/PSG_picked_otus//pynast_aligned_seqs
So instead I tried running these last few steps by themselves, because I knew that I wanted to use at least a tree for my next script, core_diversity_analyses.py
align_seqs.py -i PSG_picked_otus_1/rep_set.fna -o PSG_picked_otus_1/pynast_aligned_seqs -t core_set_aligned.fasta.imputed
filter_alignment.py -i PSG_picked_otus_1/pynast_aligned_seqs/rep_set_aligned.fasta -m gg_lanemask_in_1s_and_0s.txt -o PSG_picked_otus_1/filtered_alignment/
make_phylogeny.py -i PSG_picked_otus_1/filtered_alignment/rep_set_aligned_pfiltered.fasta -o PSG_picked_otus_1/rep_phylo.tre
Running core diversity analysis is a good way to do some basic data exploration.
core_diversity_analyses.py -i PSG_picked_otus_1/otu_table_mc2_w_tax.biom -t PSG_picked_otus_1/rep_phylo.tre -o PSG_corediv1000/ -m MappingFilePSG.txt -e 1000
Not working: missing script on ACISS servers in order to run this sucessfully. Ran it locally on MacQiime instead.
core_diversity_analyses.py -i otu_table_mc2_w_tax.biom -t rep_phylo.tre -o PSG_corediv1000/ -m MappingFilePSG.txt -e 1000