import antigravity @arklenna - Tumblr Blog

Manually setting project for virtualenv

Using virtualenvburrito/virtualenvwrapper, the project associated with a virtualenv is specified in:

~/.virtualenvs/VENVNAME/.project

#virtualenv #virtualenvwrapper #virtualenvburrito #python

Programmer, meet glob.

Copied from a Python library's documentation (details changed to protect the innocent):

"As an example of how useful Python libraries and syntax techniques can be, I will show how to create a list of all 'ext' files in the current directory using the os library:

import os the_files = [file for file in os.listdir(os.getcwd()) if file[-4:]=='.ext']

#glob #python #badpython

GSoC t-shirt

#gsoc #gsoc12 #gsoc2012

GSoC certificate

#gsoc #gsoc2012 #gsoc12

And the summer ends

The coordinate mapper, with updated documentation, is now located on [this branch](https://github.com/lennax/biopython/tree/f_loc4), awaiting the merging of Peter's f_loc4 branch. I've written an [entry](http://biopython.org/wiki/Coordinate_mapping) on coordinate mapping for the Cookbook. Additionally, at Peter's suggestion, I've written a clarification of strand as it relates to transcription and translation. It's available [here](https://docs.google.com/document/d/11R7EOJXn90lN5_SmaPOyN5rFfPQybbCbUBo6EY0R0pA/edit). It's been a great experience working with this project this summer. Thank you to everyone involved.

#BioPython #gsoc #gsoc12 #GSOC2012 #python

Strand

The summer is winding to a close. I've spent this week busy with orientation events and meetings for my upcoming PhD program. I hope to have time to continue to contribute to Biopython in my spare time, and ideally I would like to use and expand Biopython as a portion of my research. I have been considering how to handle gene strandedness. As long as I'm correctly interpreting the following position, my coordinate mapper should produce the correct coordinates with negative strand or mixed strand features. GenBank: join(complement(25..30), 36..40) Biopython: FeatureLocation(24, 30, -1) + FeatureLocation(35, 40) 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 <---------------- -------------> 5 4 3 2 1 0 6 7 8 9 10 I have to admit that it wasn't until I read a BioStar [post](http://biostars.org/post/show/3423/forward-and-reverse-strand-conventions/) earlier this week that I fully understood the relationship between plus/minus forward/reverse sense/antisense coding/template strands. So please let me know as soon as possible if I've made a mistake in the above code. `c2g` yields the correct genome position, but not the strand. I still need to integrate strand information into my `GenomePosition` object and/or partially merge it with `ExactLocation`. This weekend I intend to expand documentation and write a brief cookbook entry.

#BioPython #gsoc #gsoc12 #GSOC2012 #python

Coordinate Mapping update

Following extensive [discussion](http://biopython.org/pipermail/biopython-dev/2012-August/009849.html) on the dev list of the pros and cons of configuration classes/modules, I have refactored my [coordinate mapper](https://gist.github.com/3172753) to keep configuration as isolated as possible. All mapping functions use base 0 internally. Transformation to and from 1-based coords is allowed by custom MapPosition objects. (they are currently separate from the Seq* positions but could probably subclass ExactPosition). The MapPosition objects have to_dialect and from_dialect methods that automatically handle conversion between bases and other formatting details. There are two different ways a user can convert a coordinate from HGVS: # ... assuming cm is an instance of CoordinateMapper # Manually construct position from HGVS CDS_coord = CDSPosition.from_hgvs("6+1") genomic_coord = cm.c2g(CDS_coord) print genomic_coord.to_hgvs() # Pass dialect argument to mapping function genomic_coord = cm.c2g("6+1", dialect="HGVS") print genomic_coord.to_hgvs() Furthermore, the inheritance hierarchy is designed to allow a user to set a default string representation: # Set MapPositions to print as HGVS by default def use_hgvs(self): return str(self.to_hgvs()) MapPosition.__str__ = use_hgvs The [revision](https://gist.github.com/3172753/577b7c383e057b78cdcee64be33f18117a46faaf) as of this writing is passing tests using base 0. I have not yet implemented tests for `from_hgvs` or `to_hgvs`, but that's next on my list. I'm hoping to have time for strand and mixed strand, too. Update: The latest [revision](https://gist.github.com/3172753/7c5f285634b124b2ba2f65fd96114c441382a12d) now tests with default settings and HGVS settings.

#python #BioPython #gsoc #gsoc12 #GSOC2012

Coordinate Mapping

I have been expanding the [coordinate mapper](http://lists.open-bio.org/pipermail/biopython/2010-June/006598.html) Reece posted to the dev list a couple of years ago. It's currently living as a [gist](https://gist.github.com/3172753), although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of "test-driven development," but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy [before](https://gist.github.com/3172753/f12878bb9d34c524f7427fe7d0bde5747e7eb6d1) I started testing it with 1-based output. (More on that below.) I have modified it to work with the new `CompoundLocation` Peter is working on, while retaining the ability to provide exons as a sequence of pairs. ### Representing intron locations ### One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS [conventions](http://www.hgvs.org/mutnomen/refseq_figure.html), in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I'm using a custom object called `NonExonPosition`. Are there standards that use an alternative display for intron positions relative to CDS coords? ### Configuration ### I'm currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few). However, I have read a variety of StackOverflow threads that imply that if I'm trying to use globals or a singleton class, I'm doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a "more Pythonic" way to handle module-wide configuration? On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I'll likely use a class similar to `ExactPosition` (i.e. subclass `int`). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats? ### Other considerations ### * I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string '-1' to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn't make biological sense. * I haven't yet tackled circular genomes. Are they handled by `SeqFeature`/`FeatureLocation` yet? In GFF, the end is represented as an index greater than the "length" of the entire sequence. In order to handle this, it should only be a matter of `except`ing an `IndexError`. * I also haven't considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated. * Are there Biopython objects other than `FeatureLocation` that should be transformable? For example, is it worth attempting to map non-exact locations across coordinates? ------- All that said, I hope to have a production-ready coordinate mapper by the end of the week. Mailing list: * http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009847.html * http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009849.html

#python #BioPython #GSOC2012 #gsoc #gsoc12

Week 8.5

I previously proposed the implementation of a method for PyVCF that would quickly scan the entire file and provide useful summary statistics. The idea is shamelessly copied from [Brad's GFF parser](https://github.com/chapmanb/bcbb/tree/master/gff "github"); for GFF, this method is helpful because the annotations on a sequence can vary widely. However, I no longer think this would be useful for VCF: 1. Most importantly, the VCF headers generally contain a complete listing of all of the types of information contained in the file. It's technically optional, but I hope that the most commonly used variant callers produce accurate headers. However, if there is a prevalence of files with a mismatch between headers and actual INFO/FORMAT fields, please let me know. 2. Next, any listing of ranges of data such as POS or QUAL might as well be coupled with actual filtering. This would be different if a presentation of the distribution of quality scores would be necessary to set an appropriate threshold. It would also depend on the ratio of speed between the range scan and the filtering (i.e. whether a possible second filter would be unacceptably time consuming). 3. Finally, and perhaps most importantly, many files are so large that scanning an entire file would take too long. Setting a limit and displaying updated information in real time (i.e. writing to `sys.stdout` with '\r', [example](https://gist.github.com/3161269 "github gist")) could overcome this issue. If anyone can think of a great reason to scan a VCF file before filtering it, please get in touch. ------- I added the method `as_SeqFeature()` to my basic variant class, but it's still incomplete. Some of this is in flux due to forthcoming changes to FeatureLocation. I'm currently working on expanding the [coordinate mapper](http://biopython.org/pipermail/biopython/2010-June/006598.html) Reece posted to the dev list a couple years ago. Expect an update on that very soon.

#gsoc12 #gsoc #GSOC2012 #python

Week 7: Variations

The highlight of this week was getting strep throat. I also gave myself a quick lesson in setuptools (working on an unrelated project), which is helpful for understanding how various projects do their building and testing. The fragmentation of standards is certainly not one of the things I love about Python (e.g. getopt/optparse/argparse; distutils/setuptools/distribute/distutils2). Reece suggested trying to represent a variety of variants with just five identifiers: accession, start, stop, pre_seq, and post_seq. I've started a very minimal Variant object ([source](https://github.com/lennax/biopython/blob/variant2/Bio/Variant/variant.py)), using `FeatureLocation` for its location. This uses zero-based, right-open coordinates, similar to array counting in Python. In contrast, HGVS and VCF both count from 1. I've created a list of variant types each represented in HGVS, VCF (if possible), and my new Python representation. Please let me know if there are any errors in my interpretation of these variant types. ### single SNP: g.303G>C #CHROM POS ID REF ALT [...] 19 303 . G C [...] (DNAaccession, 302, 303, G, C) ### compound het: g.[303G>A];g.[303G>C] #CHROM POS ID REF ALT [...] FORMAT SAMPNAME 19 303 . G A,C [...] GT 1|2 [(DNAaccession, 302, 303, G, A), (DNAaccession, 303, 304, G, C)] phases = [True] ### RNA variant: r.78u>a # can VCF represent RNA? (RNAaccession, 77, 78, u, a) ### variant in CDS coords (coding region): c.3G>C # can VCF represent CDS? (CDSaccession, 2, 3, G, C) # Given the FeatureLocations of the CDS, convert to genomic coords ### variant in protein: p.Trp26Cys # can VCF represent CDS? (proteinaccession, 25, 26, Trp, Cys) ### trinucleotide repeat: g.77_79dup #CHROM POS ID REF ALT 19 79 . G GCTG # OR 19 77 . CTG CTGCTG (DNAaccession, 79, 79, "", "CTG") # OR (DNAaccession, 76, 79, CTG, CTGCTG)

#gsoc #gsoc12 #python #gsoc2012

Week 6: vertical slicing

This week, I wrote a script for PyVCF that can filter a file by sample as it's being parsed. It's currently named `vcf_sample_filter.py`. It's designed to be functional from the command line, the Python interpreter, or as a module. I've written basic unit tests for the command line form. To minimize impact on the existing parser, I'm adding the necessary methods to the Reader object at runtime of the filtering script. I believe the technical term for this is "monkey patching." I see no reason this script could not be used in combination with the filter by row script, although I'm not sure what's the best way to combine the two. On the other hand, the unit tests indicate that vcf_filter is broken and I can't get it to accept custom arguments. I want to move the object into the vcf module and leave the command line portion in the scripts directory. I will also add unit tests/examples for the module form. Next up: come up with a generic-via-extensibility representation of a variant. I was thinking about trying to store a HGVS variant in PyVCF's objects but with all the properties it could get ugly and would no doubt be very confusing/misleading. I'm working through some examples and should have a basic outline soon.

#gsoc #gsoc12 #python #gsoc2012

Weeks 4 and 5

Given the choice of attending an aerospace conference or spending three days writing Python in the lobby of a hotel, I chose the latter, which turned out to be rather productive. I finished a prototype writer that reverses the VCF to SQL trip, discovering more of the peculiarities of the meta-format along the way. However, it seems that my SQL project may have been relegated to being a time-consuming primer in the maddeningly data-dense nature of VCF. I've been convinced that file to python to sql to python to file is not going to be particularly efficient.

And so back to the drawing board.

In re(re-re-re)considering the decision of making a dedicated Python variant object versus using SeqFeature directly, I've emailed the Biopython list to ask for feedback. For now, I intend to make a variant object and the ability to convert it to SeqFeature.

I've made a new branch (variant2) which has a very skeletal outline of a set of Python objects designed to store variants. One might note many similarities to the organization of PyVCF. One thing SQL did neatly was store per-allele data with the allele, rather than with the site, and I'm envisioning doing this in Python, as well.

For a Python variant object, are there any organizational choices that would make it easier for possible future conversion of a variant to HGVS syntax? (this is primarily directed at Reece but I'm open to all suggestions)

Another question that may reveal my complete ignorance of haplotypes and such: could a polyploid site ever be partially phased? e.g. a triploid genotype of 0/1|0?

Looking forward to any and all questions, comments, concerns, etc.

Mailing list: http://lists.open-bio.org/pipermail/gsoc/2012/000141.html

#gsoc #gsoc12 #python #gsoc2012

Weeks 2 and 3: More SQL

James raised some [concerns](http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009688.html) about the difficulty of representing the VCF "metaformat" in SQL. I've taken these into consideration and am forging ahead. So far, some of the types of data fit more neatly into SQL than into a VCF row. I have redesigned my SQL schema with a two-pronged approach to tackle the flexibility of VCF: 1. For the site, alt, and genotype tables, there are columns for the reserved info/format keywords in the VCF spec (so far only for non-SV). 2. For new info and format keywords (both in the header and in the body), I am storing the values in a "narrow table." This table stores a foreign key to the key's row and the key-value pair. The narrow table is also good for storing reserved keys that are lists (but not per-allele or per-genotype). Note: this diagram only has the FKs listed for simplicity. ![](http://media.tumblr.com/tumblr_m5sp9736xi1qznvq8.png) Interestingly, despite the increase in the number of tables and thus insert statements, the current script is considerably faster than the previous version. Evidently JSON serialization is slow. There are a few things I haven't figured out: 1. Can an info field be per-genotype? The spec implies that wouldn't make sense, but doesn't forbid it. 2. Is there a safe way to find out if a VCF 4.0 field is per-allele or per-genotype? 3. Will my SQL representation be able to handle SV? ------- I'll be out of town for the next week but I will have plenty of time for Python. Mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009725.html

#biopython #gsoc #gsoc12

Weeks 0 and 1: SQL

I started implementing storage of VCF data in SeqRecord and SeqFeature. I digressed, spending a few days experimenting with overloading __getattr__() in lieu of manually writing properties. Then it occurred to me that if, as Reece pointed out, a variant doesn't contain the actual sequence but a reference to the sequence, the advantages to using SeqRecord are minimal or possibly negative.

In my experience, the highest performance for filtering large amounts of data is SQL. SQL has the advantage of scalability: SQLite now ships with Python, users can choose to run their own MySQL/PGSQL server, and I've read about a few approaches to GPU accelerated SQL.

My initial glances at BioSQL, GMOD, etc. didn't show anything specifically designed for variants (again, a focus on storage of the sequence itself) so I implemented my own interface. Currently, the parse_all() method is very slow (approximately 260 seconds for a file with 240,000 variants when the parsing takes 5-10 seconds) and I am investigating why. My first step will be to reduce commit frequency. Update: Reducing commit frequency has knocked this step to 40s.

With a SQL backend, it seems superfluous to have a dedicated variant representation within Python. The SQL result object should allow for straightforward retrieval of data by name. I'm storing "misc" data in a SQL text field using JSON, which is also easy to access.

Looking at BioSQL/GMOD etc to see if there is an existing standard I should be using/following

Deciding the extent of the convenience functions I wish to implement

Thinking about the most efficient way to filter records on the way into the SQL database

Mailing list: http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009682.html

#BioPython #GSOC2012 #gsoc #openbio #SQL

Github issues

I'm going to use Github issues/milestones instead of trying to use some sort of calendar-based timeline. I have a very limited ability to predict how long things will take, so a date-linked timeline seemed like it would end up being "Lenna shuffles due dates around for half an hour every week." Plus, the auto-closing and referencing of issues is really nice. Links: * Variant dev is on the `variant` branch: https://github.com/lennax/biopython/tree/variant * Issues are here: https://github.com/lennax/biopython/issues As an aside, I just saw that Github released a Windows GUI. As a Mac/Linux user who's tried to teach Windows people how to use TortoiseGit, I am really excited to test it out. Has anybody tried it yet?

#BioPython #GSOC2012 #gsoc #openbio #github

A couple quick things: Met a grad student today who has written his own Python PDB parser - he only found out about Biopython two months ago. He has a collection of variations in alphabets used by proprietary software and I am trying to convince him to contribute it to Biopython. His lab does primarily NMR and molecular dynamics. Started a github [branch](https://github.com/lennax/biopython/tree/variant) for the variant project. I did subclass Bio.SeqIO.SequenceIterator but I ended up overloading all of the methods. Got somewhat distracted again reading about lambdas and closures; still not sold. Right now I'm using dictionary function dispatching and I think it's pretty nice. If it needs to get messier, it's possible closures would help. Although I still don't understand closures well enough to explain them to another programmer, much less a six year old.

#biopython #gsoc #gsoc2012

New Vim setup

Vim on Mac has some very odd quirks. I've seen a version refuse to understand backspace, so that the only methods of deleting text are using x and d commands. Mouse support is limited at best, and explore windows aren't resizable. I tried MacVim but got some odd errors ("error detected while processing function"). [This post](http://jeffkreeftmeijer.com/2011/vim-is-hard-i-just-want-to-click-around/) mentions the same error. I used the tried-and-true method of copying someone else's ~/.vimrc to avoid figuring out the real problem. Mouse works, explore windows are resizable and clickable; I'm happy. To resize the font, in ~/.vimrc: `set guifont=Menlo\ Regular:h13` For some reason, MacVim uses different syntax coloring than vim but I haven't investigated why. Then I decided I'd like to alias `vi` to `mvim`. Interestingly, a symlink doesn't work, regardless of user path settings. To work around that, I made /usr/bin/vi be a bash script, as follows: #!/bin/bash mvim $* `$*` passes all command line arguments to mvim. But wait, there's more! git commit messages didn't work with mvim, so I set git core-editor to vim (which is still a symlink to vim72). However, I still got the message indicating there was a problem with the editor. Found [a post](http://tooky.co.uk/2010/04/08/there-was-a-problem-with-the-editor-vi-git-on-mac-os-x.html) which describes a somewhat baffling fix: set the editor to the full path to vim. $ git config --global core.editor /usr/bin/vim It would also be possible to set the $EDITOR environment variable, as I may not want to use mvim for most applications. In summary: * /usr/bin/vim -> vim72 * MacVim from MacPorts * ~/.vimrc from someone whose MacVim works (todo: repository for config files) * /usr/bin/vi : bash script that opens mvim * git core.editor set to /usr/bin/vim

Trending Blogs

Recently Viewed Blogs

import antigravity