Spent most of last week in Berkeley working with BE collaborators on a big meta-analysis project using the QIIME database. Getting data in was really tough, for multiple reasons, but much progress was made, and we came up with a reproducible workflow. Will post later.
Along the way I had to replace all of the barcodes in a fastq file with new barcodes, and after flailing at an attempt to code this in bash, I punted and wrote it in R:
setwd('&;/Dropbox/Meadow_to_qiimedb/Dust/sequences/2') # copy so old version is saved system("cp Kembel_etal_Dust2_INDEX.fastq index.fastq") # read both in as vector bcin <- read.table('old.txt', head=FALSE) bcout <- read.table('new.txt', head=FALSE) # format to vector bci <- as.character(bcin$V1) bco <- as.character(bcout$V1) n <- length(bci) # loop through with system sed command for(i in 1:n) { system(paste("sed s/", bci[i], "/", bco[i], "/ index.fastq > index2.fastq", sep='')) system("cp index2.fastq index.fastq") print(i) } # output index.fastq has all new barcodes
Of course R is not exactly the elegant way to do this, but it did work in the moment and was really quick to code. But this made me realize that I should really be more familiar with python - that is certainly a better tool for this sort of job. So I spent this weekend brushing up on python and came up with this script to do the same job:
#!/bin/python # # script to replace dna barcodes in a big file: # this was written and tested on a small test file, but # also worked on sequence files. # the test requires 3 files in the same directory # "roses.txt" <sequence file> that contains these 3 lines: # # roses are red, # violets are blue # roses are also white # # and "old.txt" <old barcodes> with these 3 lines: # # roses # violets # white # # and "new.txt" <new barcodes> with these 3 lines # # socks # shoes # stinky # # Caution! # since this was written for an index read # (following Caporaso 2012 methods) # it does not require that barcodes be at the beginning!! # import sys def Rep(filename): # new function to hold it all f = open(filename, 'rU') # open filename that is passed and read only seq = f.read() # read it and hold all in memory as one string o = open('old.txt', 'rU') # open old words n = open('new.txt', 'rU') # open new replace words out = open('rosesnew.txt', 'w') # create new file that will get filled old = o.read() # read in old words new = n.read() # read in new words old = old.split('\n') # split them both at line break new = new.split('\n') # otherwise appears in list for i in range(3): # loop through using numbers seq = seq.replace(old[i], new[i]) # so that 'i' can index 2 things out.write(seq) # write new output to file created above def main(): # boilerplate Rep(sys.argv[1]) if __name__ == '__main__': main()
So there are 2 ways to do this in the future.








