Jonathan Sick @jonathansick - Tumblr Blog

ADS to BibDesk: Command Line & PDF Ingest

In the last few weeks I've been rolling out improvements to the venerable [*ADS to BibDesk*](http://jsick.net/adsbibdesk) service. Today I'm announcing version 3.0.6. What's new? 1. A full-fledged command line edition, installable with `pip`, 2. A PDF ingest mode, great for getting your legacy folder of PDFs into BibDesk, and 3. Lots of bug fixes to make *ADS to BibDesk* more robust against the peculiarities of some papers. ## The Command Line Edition It is now possible to run *ADS to BibDesk* from the command line. This opens up new possibilities for hacking your own workflows: from automatic scripts to integration with Mac OS X launchers like [Alfred][]. To get started, you can `pip`-install the latest release (you may need to run this as `sudo`): pip install adsbibdesk Then check out the help: adsbibdesk --help The command line edition takes the very same tokens as the *Service* edition: an ADS or arXiv URL, an ADS bibcode, an arXiv pre-print ID, or a DOI. For example: adsbibdesk 1998ApJ...500..525S ## Ingesting a Folder of PDFs [BibDesk](http://bibdesk.sourceforge.net) is becoming more popular with astronomers. One request I've received from new users is an easier way to add folders-full of papers downloaded from ADS and arXiv into BibDesk (with matching the BibTeX and abstract data). *ADS to BibDesk* is good at downloading papers, BibTeX and abstracts; the challenge here is reliably identifying a paper given its PDF. The approach I've taken is borrowed from an older [script by Dr Lucy Kim][kim]. The first step is to extract text from a PDF, and second, to extract a DOI string from that text. *ADS to BibDesk* can then act on that DOI as usual. To extract text from a PDF, I've opted for the [pdf2json][] program.[^1] It can easily be installed with [Homebrew][] on your Mac. Before you try the PDF ingest mode, go ahead and install `pdf2json`. Next, we need to extract a DOI from the paper's text: a perfect job for regular expressions. The solution is [written by Alix Axel in this excellent StackOverflow post][doiregex], and the Python implementation is: import re regStr = r'\b(10[.][0-9]{4,}(?:[.][0-9]+)*/(?:(?!["&\'<>])\S)+)\b' pattern = re.compile(regStr) doiMatches = pattern.findall(paperTxtData) Reading through that [StackOverflow post][doiregex], it appears that DOI is a tricky format to parse. Fortunately, this regular expression seems to work with the astronomical literature. You can give this PDF ingest workflow a try via: adsbibdesk -p my_pdf_dir/ where `my_pdf_dir/` is a directory containing PDFs that you want to ingest into BibDesk. Note that DOIs are not present in all papers; particularly ones only a few years old. You can easily find the DOI text on the first page of newer papers. ## Bug Fixes Personally, I'm most excited about some of the bugs we've been able to fix (mostly with the prodding of [Issues](https://github.com/jonathansick/ads_bibdesk/) posted on GitHub). First, we've fixed a lot of problems caused by unicode characters and LaTeX markup in BibTeX data. The point of failure was how this data was escaped and passed via pipes between the Python scraper code and the AppleScript interface script to BibDesk. The solution was simple: don't try to escape characters passed on the command line—just pass data through a temporary file. The second bug was harder to identify. Some papers would work fine with the command line edition, but crash the Service edition. Thanks to a [bug report](https://github.com/jonathansick/ads_bibdesk/issues/17) we determined that the problem is triggered by papers with quotation marks in the paper title, such as [*The "True" Column Density Distribution in Star-Forming Molecular Clouds*](http://adsabs.harvard.edu/abs/2009ApJ...692...91G). It turns out the problem was ultimately with the HTML served by ADS. Abstract pages are laden with helpful metadata, but these metadata fields are *not* escaped! Thus in the header of the aforementioned paper's HTML page you'll find the line: Those extra unescaped quotation marks break the `HTMLParser` module in Python—except not always. With the command line edition I run Python 2.7.3, whose `HTMLParser` is robust against this type of malformed HTML. But the Service edition uses the default Python provided by Apple (version 2.7.1 for Mountain Lion). In this version of Python, `HTMLParser` is stopped cold by such HTML errors. To make `HTMLParser` happy, *ADS to BibDesk* pre-processes the ADS HTML to remove these metadata lines. ## Roadmap My wish-list for future updates includes: integrating the arXiv-updater script into the command line interface, and being more careful when updating papers to not lose BibTeX data (*e.g.* the `notes` field). In the meantime, I have papers *to write*. But do tweet me, @jonathansick, or [post an Issue to GitHub](https://github.com/jonathansick/ads_bibdesk/issues) if you have problems or suggestions. [^1]: I find that `pdf2json` loses word spaces in its output. If you know of a better text extraction program, I'm open to suggestions. Tweet @jonathansick. [Alfred]: http://www.alfredapp.com [kim]: http://www.mit.edu/people/lucylim/BibDesk.html [pdf2json]: http://code.google.com/p/pdf2json/ [Homebrew]: http://mxcl.github.com/homebrew/ [doiregex]: http://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page

#adsbibdesk

This weekend we're commissioning a new 14" Celestron for the [Queen's Observatory](http://observatory.phy.queensu.ca). Photo by Prof [Stéphane Courteau](http://www.astro.queensu.ca/people/Stephane_Courteau/main.php).

At last, you can continue to efficiently ingest astro papers into BibDesk thanks to the latest update to [*ADS to BibDesk*](http://www.jonathansick.ca/adsbibdesk/index.html).[^1] I won't say much about this release, aside from reports by beta-testers that it does indeed work. But I will write again soon about some updates to *ADS to BibDesk*: shell usage, and a cool method of bulk PDF ingesting to help those getting started with BibDesk. Stay tuned. [^1]: The bug, if you care to know, was a mysterious consequence of Mountain Lion's handling of Automator actions that passed data from a Python shell script into a AppleScript. The solution was to embed the AppleScript directly into the Python script (obviating the step of passing the data within Automator). Now when you run *ADS to BibDesk*, it installs the AppleScript necessary for talking to BibDesk. Look for `~/adsbibdesk_injector.scpt` in your home directory. To see the details, check out the source on [GitHub](https://github.com/jonathansick/ads_bibdesk) as always.

Continuous LaTeX Compilation with a Python Watchdog

I recently came across the [Watchdog][] python package that allows scripts to act on changes in the filesystem. An obvious application is continuous integration: running `make` whenever a source file changes.[^1] Even more pertinent for academics is continuous compilation of LaTeX documents. Here's the gist (borrowing ideas from the [Watchdog example][] and this [GITS Blog post][]): To run, simply execute the script from the same directory as your LaTeX project. Whenever a file changes in the directory watched by the `Observer` instance, the `on_any_event()` method of the `FileSystemEvenHandler` instance is called. If the event is due to a `*.tex` file, the `subprocess` module is used to call `make`. If you don't use make files to manage your LaTeX compilation, perhaps a direct to call something like [latexmk][] with subprocess.call('latexmk -f -pdf -bibtex-cond paper.tex', shell=True) would work. [Watchdog]: http://packages.python.org/watchdog/index.html [Watchdog example]: http://packages.python.org/watchdog/quickstart.html#a-simple-example [GITS Blog post]: http://ginstrom.com/scribbles/2012/05/10/continuous-integration-in-python-using-watchdog/ [latexmk]: http://www.phys.psu.edu/~collins/software/latexmk-jcc/ [^1]: Other applications are numerous; a Dropbox-style uploader is also possible, for example.

#python

For many of us, the most shocking revelation to come out of CERN's Higgs boson announcement today was quite unrelated to the science itself. Rather, we were blown away by the fact that a team made up of some of the most undoubtedly brilliant people in the world believe that Comic Sans is an appropriate font for such a historic occasion.

[Sam Byford](http://www.theverge.com/2012/7/4/3136652/cern-scientists-comic-sans-higgs-boson). I concur (and hat tip to the [Panda](http://blog.jonathansick.ca/post/26026556172/why-yes-those-are-roasted-marshmallow-and-smores)).

Herbig-Haro 110, seen by [HST](http://hubblesite.org/newscenter/archive/releases/2012/30/).

The IAC has [posted videos](http://iactalks.iac.es/talks/serie/15) from the Secular Evolution Winter school on their new [IACTalks](http://iactalks.iac.es/) website. It was a really fantastic school, and this curriculum could easily constitute the basis for graduate reading courses on galactic and extragalactic astronomy. The rock star lecturers include Kormendy, Binney, Scoville, Calzetti, Peletier, Buta, van Gorkom, and Bosma.[^1] I enjoyed attending the school [in Tenerife](http://blog.jonathansick.ca/post/13079909768/puerto-de-la-cruz-tenerife) last year, and I think I'll be referring to these lectures again. [^1]: For some reason, Athannasoula's talks aren't included. A shame; her talks on the dynamics of bar systems were some of the best lectures in the school.

Why yes, those are roasted marshmallow and s'mores milkshakes. Thanks [Stand 4](http://www.standburger.com/).

Jean-Eric Vergne launches his Torro Rosso. Reminds me of the opening scenes from *Top Gun*.

Narain Karthikeyan locks up his HRT into Turn 1.

Sebastian Vettel in the Senna 'S'. Notice how he unloads the front-right tire.

Lewis Hamilton at the Turn 2 apex.

I got to see Lewis Hamilton take victory at the Montreal Grand Prix last weekend. My dad loaned me a 500 mm lens for a few photos during second free practice.

Hey! I just met you, and this is crazy.

Sculpture is always closer than architecture to pure form, being mostly liberated from all the obvious constraints (environmental, economic, technological and political) that shape any building’s design. Architecture is a contaminated art in this sense, but that is also a virtue. It’s a social art. It creates social spaces. The best architecture embraces and pushes beyond this, formally. There’s a metaphor at play. A free society consensually accepts its governing burdens and principles. Constraint and freedom: the essence of good architecture and a healthy culture.

— Michael Kimmelman in [A Canopy as Social Cathedral](http://www.nytimes.com/2012/06/05/arts/design/architectural-canopy-shines-in-battery-park-city.html)

Dennis Overbye writes:

The phone call came like a bolt out of the blue, so to speak, in January 2011. On the other end of the line was someone from the National Reconnaissance Office, which operates the nation’s fleet of spy satellites. They had some spare, unused “hardware” to get rid of. Was NASA interested?

Also read Jon Timmerman's article at Ars, where Marshal Perrin (STScI) comments on ITAR.

Visit for the prophesies,1 stay for the design.

God knows I need #13, The Anti-Slouch Screen. ↩︎

#nytimes

Trending Blogs

Recently Viewed Blogs

Jonathan Sick