I have been expanding the [coordinate mapper](http://lists.open-bio.org/pipermail/biopython/2010-June/006598.html) Reece posted to the dev list a couple of years ago. It's currently living as a [gist](https://gist.github.com/3172753), although it has grown rather precipitously (over 300 lines each of code and testing). I may have gotten slightly carried away with the concept of "test-driven development," but in this case, extensive testing is extremely critical. Note that as of this writing, the code is in disarray; it was much less messy [before](https://gist.github.com/3172753/f12878bb9d34c524f7427fe7d0bde5747e7eb6d1) I started testing it with 1-based output. (More on that below.) I have modified it to work with the new `CompoundLocation` Peter is working on, while retaining the ability to provide exons as a sequence of pairs. ### Representing intron locations ### One of the more complicated operations in coordinate mapping is converting a genomic intron position to CDS coords. I am using the HGVS [conventions](http://www.hgvs.org/mutnomen/refseq_figure.html), in which positions are shown in the format 123+x and 124-y, where 123 is the last base of one exon in CDS coords and 124 is the first base of the next exon. I'm using a custom object called `NonExonPosition`. Are there standards that use an alternative display for intron positions relative to CDS coords? ### Configuration ### I'm currently using a configuration class to store customization for display of positions, most importantly conversion between the 0-based, Pythonic internal representation and 1-based representation used by many formats (GenBank and HGVS, to name just a few). However, I have read a variety of StackOverflow threads that imply that if I'm trying to use globals or a singleton class, I'm doing something wrong. Is it unwise to use a class simply as a storage object? Does anyone have a "more Pythonic" way to handle module-wide configuration? On a related note, I have been fiddling with adding and subtracting the pesky 1s all day, and I am thinking that I need to adjust my design and refactor the code slightly to maintain internal 0-based coordinates while adjusting the string representation to match 0- or 1-based coords. I'll likely use a class similar to `ExactPosition` (i.e. subclass `int`). Actually, would it be useful to add to the *Position objects a method for representation in GenBank and other formats? ### Other considerations ### * I have a few questions about negative integers. Can genomic or protein coordinates ever have negative coordinates? For CDS, as it be confusing for the integer -1 and the string '-1' to be treated differently, my code interprets both as a downstream position. Using the pattern of the Python list index doesn't make biological sense. * I haven't yet tackled circular genomes. Are they handled by `SeqFeature`/`FeatureLocation` yet? In GFF, the end is represented as an index greater than the "length" of the entire sequence. In order to handle this, it should only be a matter of `except`ing an `IndexError`. * I also haven't considered how strand will influence coordinate mapping. Any pointers in this direction would be appreciated. * Are there Biopython objects other than `FeatureLocation` that should be transformable? For example, is it worth attempting to map non-exact locations across coordinates? ------- All that said, I hope to have a production-ready coordinate mapper by the end of the week. Mailing list: * http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009847.html * http://lists.open-bio.org/pipermail/biopython-dev/2012-August/009849.html