GIFing the 1040 and other notes on hacking the IRS website
I published a story on Thursday about the complexity of the US tax code over time and used the length of the IRS Form 1040 over time as a proxy. It lead to responses like this one on Facebook:
Imagine the human being who took the time to make this. We must deeply honor the focus of that person.
Apparently it seemed like a crazy task to filter through almost 100 years of documents and tabulate information about them. Let’s asume they thought I was doing this by hand.
I wasn’t. Not even the GIF. Heres how:
There’s a command line tool called ImageMagick that will both turn a PDF into a series of images and then turn that series of images into a GIF. These are the two commands using imagemagick I used to accomplish this:
$ for infile in *.pdf; do convert -density 400 -resize 400 -trim -extent 500x700 -gravity north $infile jpeg/$infile.jpg; done
$ convert -delay 25 -loop 0 jpeg/f1040__*-0.jpg animated1040.gif
The first line tells ImageMagick to look at every PDF in the current directory, convert it to a 400px-wide PNG using a resolution of 300ppi for vector data, extend the edges of the image to 500px by 700px (anchoring the image to the top center of the new bounds), save it in the folder named jpeg. The second line tells ImageMagick to merge every .jpg file in the jpeg folder (i.e., every file I just created) with a file name ending in “-0.jpg” (this is the first page of the former PDF) into a GIF called “animated1040.gif” that flips through each image at 25 hundredths of a second and loops continuously.
After cleaning and optimizing it in Photoshop I had this.
All of this was dependent on having all of these 1040s. When I started looking for them, I was hoping some think tank or library would have an archive of the documents.
I decided to start simple. The current form is easy to find. A web search for “1040” revealed the IRS served PDF as the top result. Now what about the old forms? A web search for “2010 Form 1040” also returned a PDF on the IRS website but it had a slightly different URL: www.irs.gov/pub/irs-prior/f1040—2010.pdf. “irs-prior” — I like the look of that — “f1040—2010.pdf” Could all of the filenames be systematized?
Yes! A couple minutes of URL manipulation in my browser allowed me to find that there were files at this URL dating back to 1913 (though there were no forms for 1914 and 1915, since those years used the same 1913 form).
The next step was to download all of the files. Should I change the year in each URL and save as from my browser? TERRIBLE IDEA. I opened up my command line used the interactive prompt of python to download all the files super quick. It went something like this:
$ python
>>> import urllib
>>> years = range(1916,2012)
>>> for y in years:
. . . urllib..urlretrieve("http://www.irs.gov/pub/irs-prior/f1040--%s.pdf" % (y,), "f1040--%s.pdf" % (y,))
. . .
load the library I need to download files
create a list of years that I want to download: start in 1916 end in 2011 (one year before 2012) call it “years”
cycle through every year in that list calling the current year “y”
download the file using the url and naming system I figured out before, save the file using the same system
Two minutes later there were 97 PDFs in my folder for this project. I opened up the 1913 form in my browser and downloaded it. BOOM. Every 1040 ever.
So I now we had all these files and we had to quantify exactly how much more complex they got over time. My first idea was to use the amount of ink used on each document as a proxy for complexity. I wanted to count the number of black pixels in each document. I used ImageMagick to convert all the PDFs to images and could start counting pixels.
Using a Python library called PIL, I opened up each file with Python, converted it to grayscale and counted the number of black pixels, calculated the ratio of black-to-total pixels, associated that JPEG with the appropriate year and save that information as a JSON blob and CSV spreadsheet. I’ll save you from that code here, but you can see it here on github.
Using the CSV I got out of that, I made this chart showing the amount of “ink” on the form over time:
It was antithetical to what we knew was true. If the amount of printing was a proxy to complexity, this chart would show that the tax code is less complex than the first years of the system.
Were the older documents just bigger? Use larger type? I charted the same information but as a ratio of amount of black per page. Same story. Then I realized that the older documents have more instructions on them! (More recent 1040s include instructions in a separate appendix.) What if we just looked at the tabulation page. No luck. Apparently today’s documents are more ink-efficient than those of yesteryear.
I crafted a new strategy: count the number of line items on the form. (Our methodology for what we counted is recounted in the piece.) The slow way to do this would be to double-click each file in a document viewer, count how many lines were on each page and input that into a spreadsheet, hoping I dont miss anything or make a typo. (It was beer o’clock in the office.)
The fast way is to write more code. I created another Python script that would open up each page of every document individually and prompt me to enter how many lines were on that page and whether I should overwrite the current number of lines I recorded or add these to the number of lines already recorded. Once complete, the script saved a spreadsheet of the recorded information.
Thinking about the transfer of instructions from the form to a separate document, I decided to take a look at the instructions booklet and see how those have changed over time. I used the same exact scripts as above to download all of the instruction files by changing the URL slightly to “i1040” from “f1040.” (This naming convention was also revealed by a web search for “1992 form 1040 instructions.”)
The recent documents were long: 2011 is nearly 200 pages. I used some more code to count the number of pages, and it looked like this:
$ python
>>> from pyPdf import PdfFileReader
>>> years = range(1939,2012)
>>> for y in years:
. . . print y, PdfFileReader(file("i1040--%s.pdf"%(y,),"rb")).getNumPages()
. . .
I copied the data from the output (I didn’t save this to into a file for speed’s sake), pasted it into Excel, and made this chart:
Fifteen years ago, tax instructions were half the size! More striking, the booklets from the ’80s have smaller pages, but were still significantly shorter than today’s.
So now I had a GIF, three charts, and a whole bunch data. All that was left was words.
Read them all here: Line for line, US income taxes are more complex than ever