From the Hands of David Yanofsky @yanofsky - Tumblr Blog

Posts

There appears to be some disagreement on the location of Alaska

All of a sudden cartograms of the US keep popping up in my twitter feed.

The New York Times, recently made this one

538 made this one

NPR made this one

(Which was apparently a precursor or later exploration of this one they published)

Update, Here’s Propublica’s:

Update, Here’s Bloomberg’s:

Update, now National Journal published one:

(It’s identical to the NYT’s)

Update: Here’s one the Guardian made:

Update: Here’s one from the Wall Street Journal:

Update: Here’s one from WNYC

Update: this is the Marshal Project’s

Fine, cartograms are great for this kind of each-state-counts-the-same-no-matter-what-size visual analysis. And obviously this type of visualization forces the designer to balance the actual arrangement of states with what fits in a gridded US-like shape.

Here are some considerations that seem to be taken in all three:

Hawaii and Alaska aren’t in the continental US

Florida is a dongle

Maine/The north east is a dongle

The Canadian border is straight until the Great Lakes

The Mexican border slopes north west to south east

Outside of these rules, it would appear that every news org in town is trying to reinvent the wheel their arrangement in these things. Scroll back up and compare the locations of the following states:

California

Alaska

Rhode Island

Wisconsin

Illinois

Chartbuilder 0.6

For the last month or so reporters at Quartz have been building charts with the bleeding edge version of Chartbuilder. Have you noticed? Didn't think so.

Last night I was able to merge these changes into the master branch—it includes some long long long outstanding pull requests and bug fixes.

You can try it out using the hosted version of Chartbuilder, but as always, Chartbuilder really sings only after you customize it and host it yourself.

The version number is 0.6 as I still can't get myself to bump it to 1.0 until it has proper documentation, a complete suite of unit tests, a less-than-utilitarian interface and fewer situations where the whole thing breaks down.

I hope you find these changes helpful. As you come across bugs or desire more functionality please let me know by submitting them on github or by email or twitter or bike messenger. Better yet, take a stab at fixing it yourself or any of the other open issues, currently there are 34 waiting to be fixed.

Here are the ones that were just closed:

Bug Fixes:

Prevent series colors from changing when the type of series is changed

Prevent the plague of the skinny columns

Strange behavior with titles of bargrids

Auto title a chart with only one series works again

Enhancements:

the input accepts numbers that have $ £ € or % on them

the input accepts numbers with your regions decimal and thousands separator i.e 1,300.10 is valid on the US localed machines and 1'3000,10 is valid on France localed machines (thanks Parker Shelton)

the input accepts excel error cells e.g. #N/A (and plots them as blanks)

there are lots more semicolons in the code

the automatic date axis format is much much better (thanks Parker Shelton)

you can take a regular date series and format it as quarters

Up to 10 y-axis ticks by default

font loading to support https (thanks to Imran Nathani)

completely integrate Bower and include installation instructions for using it. (thanks to Alan Palazzolo)

the HTML table output uses the number formats of a user's locale

Inline styling isn't overwritten on save if there is no new rule. (thanks to Alan Palazzolo)

#chartbuilder #opennews #opensource #charts #ddj

Changing the author of track changes comments in Word

A friend of mine needed to submit a Word document with track changes for school, but she needed it to be anonomized. The default behavior for word is to attach the name listed in Word preferences to every change when track changes is turned on. However if you change your name in the preferences all previous edits still remain under the old name in the document and are not editable.

With a series of command line functions you can change all the track changes in a word document to any name you need because the docx format is actually just a zip file of xml documents that contain all of the word doc's content and meta data. (It's specified here)

Here's the code you can run in your mac's terminal to change all of the track changes author names, it assumes the file you're editing is on your desktop

cd ~/Desktop unzip myDocument.docx -d anonDocument/ grep -rl "w:author" ./anonDocument | xargs sed -i '' 's/w:author="[a-zA-Z0-9 ]*"/w:author="anonymous edit"/g' cd anonDocument zip -r ../cleanDocument.docx . cd .. rm -r anonDocument open cleanDocument.docx

This is what that does:

change the working directory to the Desktop

unzip the word doc into a new directory called anonDocument

search all of the files in the word doc package and replace any comment or track changes author with anonymous edit

change the working directory to the anonDocument directory

create a new word doc on the Desktop called cleanDocument.docx

change the working directory back to the Desktop

remove unzipped document folder from the desktop

open the new document

The code is on github, as well.

That was terrifying, exhilarating, and distracting

I released Chartbuilder last week not knowing what to expect. I had thrown up some gists in the past, but I'd never open-sourced an ongoing project.

I had a fear was that it would be shrugged off with a whimper of notice. I was worried that people who I admired in the industry would see my efforts and think of them as trivial.

I still view Chartbuilder as a simple solution to a simple problem, why shouldn't they think so too? (I also wrestle with the "I'm not a developer" complex described perfectly by Noah Veltman)

I wrote up a piece about how the tool has made Quartz better, and the internet–or at least the little corner of the internet that I operate in–went nuts.

Turns out, lots of people in news and elsewhere have been dying for a way to easily create and export charts as images.

Here are some people who I've never met (well, with the exception of the one who I went to middle school with...TRIVIA!) whose work and opinions I admire and respect saying nice things about Chartbuilder:

Chartbuilder: http://t.co/77QbyTm96A

— Daring Fireball (@daringfireball)

August 1, 2013

Chartbuilder, a D3-based frontend from @qz https://t.co/oOJ8kpnyda

— Nathan Yau (@flowingdata)

July 31, 2013

If I were an online journalist, I'd definitely want to work for @qz, who are tearing it up right now. http://t.co/9y1fbO78Vm

— Downtown Josh Brown (@ReformedBroker)

July 31, 2013

Chartbuilder is everything I've ever hoped for. Nice work @YAN0 and @QZ http://t.co/hVukmWqH9F

— Jared Keller (@jaredbkeller)

July 30, 2013

Journalists developing tools to spread better journalism is a great trend, and @YAN0 helps us all: http://t.co/NuiWapMXSf

— Erin Sparling (@everyplace)

July 30, 2013

@ezraklein @qz It's great, right? Simple, nice, easy.

— Jeremy Bowers (@jeremybowers)

July 30, 2013

I love, love, love how @qz made a chart-building tool for their reporters, instead of just making charts for them http://t.co/dwFci1Apto

— jonathanstray (@jonathanstray)

July 30, 2013

great stuff. thanks. MT @YAN0: The charting tool I made for @qz is now open source! Read about it here: http://t.co/KlNeZ9SISP

— gabriel dance (@gabrieldance)

July 31, 2013

Chartbuilder is already being used in other newsrooms, and has gained fabulous contributors on github.

Now that it's open source, the version of Chartbuilder today is significantly better than the version that only existed on Quartz servers last month.

Terrifying, exhilarating, and distracting. Also, incredibly fun.

#chartbuilder dataviz news opensource github charts quartz qz newsapps

quartzthings

We’ve just open-sourced Chartbuilder, the tool that all reporters use at Quartz to quickly make simple charts at graphics-desk quality. Read more about how Chartbuilder came to be and how we use it in David Yanofsky’s piece for the Nieman Journalism Lab.

theannotationlayer

A couple of hours ago, I was telling a colleague of mine how good this Periscopic graphic about gun deaths is over beer. If you haven’t seen it, you should check it out right now. The way that the dots (people) just drop off of their potential lifespans, and how, once the animation gets up to full speed, the whole thing looks like a machine gun firing…it’s super affecting.

But I’m starting to question the editorial judgement a little bit. I took another look at the graphic tonight after finding it to share the link with my colleague. I hadn’t actually realized that you can click on any one of those lines—which, of course, represent real individuals—and be taken to the news story about the corresponding person’s death.

After filtering out all but the deaths in the past seven days, I found and clicked on one that had taken place in my own borough of Brooklyn. Apparently, the victim had stabbed somebody, and then lunged with his knife at the cops who arrived on the scene. The cops ended up shooting and killing him.

I’m not sure that including gun deaths like this one in the graphic was a sound decision. Clearly the graphic was intended to inform the debate about gun regulation in the US. It was published when Sandy Hook was very fresh in everyone’s mind and Wayne LaPierre was on TV almost every day.

So, in addition to tacitly arguing for tighter gun control, is it also arguing that police officers shouldn’t have guns? And, is it really fair to say that someone who gets shot after threatening a group of cops with a knife has had his life stolen from him? He played some role in his demise, no?

Obviously, I have no idea what actually happened that night. The cops could have been trigger-happy or bigoted or just a bunch of dumbasses. Maybe they did fire without cause and maybe they did steal a life. I’m not sure.

But the point is that Periscopic isn’t either. They made the decision to include all gun deaths and to declare the consequent lost years of the victim “stolen” regardless of who fired the gun and whether or not it was self-defense.

And I understand—there are a lot of gun deaths in this country, unfortunately, and going through every individual death probably isn’t all that feasible for the Periscopic team.

But if you’re going to take on a project this ambitious and important, I think that you should do your best not to be misleading. A simple way to do that would be to not include cases where a cop was the shooter. Surely, police officers have caused a slew of unnecessary gun deaths. But save that injustice for a different graphic.

A final fiscal cliff wall story

A friend of mine was at a party last night. The hosts pull her aside and say,

“ Lauren, we need to talk. ”

They take her into the bathroom.

"For the last month we've looked out this window in our shower and saw this big 'No' taped on that wall over there…we couldn't figure out what it was or why it would be there. We were obsessed. We would try to figure it out all the time. But now it's gone.

“ The other day we were on Facebook and we saw that you liked a picture of the wall!

“WHAT IS THIS WALL?”

http://www.hastheusgoneoffthefiscalcliff.com/

quartzthings

How we built hastheusgoneoffthefiscalcliff.com

Today we put hastheusgoneoffthefiscalcliff.com to sleep after a month of service, so we wanted to explain how it came to be.

How did the site work?

We had a DSLR plugged into a AC power supply, on a tripod, hooked up to a Mac Mini with a USB cable:

The Mac Mini ran a bash script every 5 minutes through the crontab

the script triggered a camera capture through the USB cable and downloaded the image

the script created two smaller resized copies of the image (one for the site one for social media use)

the script uploaded those images to our web server, replacing the previous captures

the script put a timestamp in the full sized image’s filename and moved it to an archive on the Mac Mini (for posterity)

The webpage was hard coded to the location of the image file and had some javascript that would update the image every 2.5 minutes (faster than images are taken to reduce the lag between what one user may see and what is actually in the office). That script used jQuery and looked like this:

setInterval(function() { $("#camimg").attr( "src", "http://hastheusgoneoffthefiscalcliff.com/imagecapture_1000.jpg?timestamp=" + (new Date()).getTime() ) },1000*60*2.5);

This redefined the path to the image every 2.5 minutes; appended with the date stamp as a parameter to make sure we dont get a cached version of the image.

The clickable areas were defined in a Google spreadsheet that was loaded in every time the page loaded and on each subsequent image replacement. We updated this document by hand every time we changed the wall.

How did the wall work?

The hard way: Every morning we got up, printed out some headlines, tweets, quotes and pictures, tiled them together and taped them to the wall.

Why??!

The idea for the single serving site from the beginning was Zach’s. Sometime in October he noticed that that the hastheusgoneoffthefiscalcliff.com domain was available to register, and he brought it up as something maybe worth pursuing. The first idea he sketched out was this:

A man who slowly inches towards the edge of a cliff paired with links to stories around the web about the topic and an explanation of what the fiscal cliff was. That conversation devolved into the merits of different depictions of cliffs:

After seeing Brian Rea’s coverage of the US Presidential Election Night on Instagram and remembering the website for Sagmeister and Walsh, I thought about making a webcam of a wall of stuff. I made this as a proof of concept:

Everyone got on board, and this is what resulted over the next month:

- David Yanofsky

View the code on github

Development time: 2 days

quartzthings

GIFing the 1040 and other notes on hacking the IRS website

I published a story on Thursday about the complexity of the US tax code over time and used the length of the IRS Form 1040 over time as a proxy. It lead to responses like this one on Facebook:

Imagine the human being who took the time to make this. We must deeply honor the focus of that person.

Apparently it seemed like a crazy task to filter through almost 100 years of documents and tabulate information about them. Let’s asume they thought I was doing this by hand.

I wasn’t. Not even the GIF. Heres how:

Making the GIF

There’s a command line tool called ImageMagick that will both turn a PDF into a series of images and then turn that series of images into a GIF. These are the two commands using imagemagick I used to accomplish this:

$ for infile in *.pdf; do convert -density 400 -resize 400 -trim -extent 500x700 -gravity north $infile jpeg/$infile.jpg; done $ convert -delay 25 -loop 0 jpeg/f1040__*-0.jpg animated1040.gif

The first line tells ImageMagick to look at every PDF in the current directory, convert it to a 400px-wide PNG using a resolution of 300ppi for vector data, extend the edges of the image to 500px by 700px (anchoring the image to the top center of the new bounds), save it in the folder named jpeg. The second line tells ImageMagick to merge every .jpg file in the jpeg folder (i.e., every file I just created) with a file name ending in “-0.jpg” (this is the first page of the former PDF) into a GIF called “animated1040.gif” that flips through each image at 25 hundredths of a second and loops continuously.

After cleaning and optimizing it in Photoshop I had this.

Finding the files

All of this was dependent on having all of these 1040s. When I started looking for them, I was hoping some think tank or library would have an archive of the documents.

I decided to start simple. The current form is easy to find. A web search for “1040” revealed the IRS served PDF as the top result. Now what about the old forms? A web search for “2010 Form 1040” also returned a PDF on the IRS website but it had a slightly different URL: www.irs.gov/pub/irs-prior/f1040—2010.pdf. “irs-prior” — I like the look of that — “f1040—2010.pdf” Could all of the filenames be systematized?

Yes! A couple minutes of URL manipulation in my browser allowed me to find that there were files at this URL dating back to 1913 (though there were no forms for 1914 and 1915, since those years used the same 1913 form).

Downloading the docs

The next step was to download all of the files. Should I change the year in each URL and save as from my browser? TERRIBLE IDEA. I opened up my command line used the interactive prompt of python to download all the files super quick. It went something like this:

$ python >>> import urllib >>> years = range(1916,2012) >>> for y in years: . . . urllib..urlretrieve("http://www.irs.gov/pub/irs-prior/f1040--%s.pdf" % (y,), "f1040--%s.pdf" % (y,)) . . .

Here’s what that means:

start python

load the library I need to download files

create a list of years that I want to download: start in 1916 end in 2011 (one year before 2012) call it “years”

cycle through every year in that list calling the current year “y”

download the file using the url and naming system I figured out before, save the file using the same system

Two minutes later there were 97 PDFs in my folder for this project. I opened up the 1913 form in my browser and downloaded it. BOOM. Every 1040 ever.

Counting pixels

So I now we had all these files and we had to quantify exactly how much more complex they got over time. My first idea was to use the amount of ink used on each document as a proxy for complexity. I wanted to count the number of black pixels in each document. I used ImageMagick to convert all the PDFs to images and could start counting pixels.

Using a Python library called PIL, I opened up each file with Python, converted it to grayscale and counted the number of black pixels, calculated the ratio of black-to-total pixels, associated that JPEG with the appropriate year and save that information as a JSON blob and CSV spreadsheet. I’ll save you from that code here, but you can see it here on github.

Using the CSV I got out of that, I made this chart showing the amount of “ink” on the form over time:

It was antithetical to what we knew was true. If the amount of printing was a proxy to complexity, this chart would show that the tax code is less complex than the first years of the system.

Were the older documents just bigger? Use larger type? I charted the same information but as a ratio of amount of black per page. Same story. Then I realized that the older documents have more instructions on them! (More recent 1040s include instructions in a separate appendix.) What if we just looked at the tabulation page. No luck. Apparently today’s documents are more ink-efficient than those of yesteryear.

Counting lines

I crafted a new strategy: count the number of line items on the form. (Our methodology for what we counted is recounted in the piece.) The slow way to do this would be to double-click each file in a document viewer, count how many lines were on each page and input that into a spreadsheet, hoping I dont miss anything or make a typo. (It was beer o’clock in the office.)

The fast way is to write more code. I created another Python script that would open up each page of every document individually and prompt me to enter how many lines were on that page and whether I should overwrite the current number of lines I recorded or add these to the number of lines already recorded. Once complete, the script saved a spreadsheet of the recorded information.

I made this chart.

Thinking about the transfer of instructions from the form to a separate document, I decided to take a look at the instructions booklet and see how those have changed over time. I used the same exact scripts as above to download all of the instruction files by changing the URL slightly to “i1040” from “f1040.” (This naming convention was also revealed by a web search for “1992 form 1040 instructions.”)

Counting pages

The recent documents were long: 2011 is nearly 200 pages. I used some more code to count the number of pages, and it looked like this:

$ python >>> from pyPdf import PdfFileReader >>> years = range(1939,2012) >>> for y in years: . . . print y, PdfFileReader(file("i1040--%s.pdf"%(y,),"rb")).getNumPages() . . .

I copied the data from the output (I didn’t save this to into a file for speed’s sake), pasted it into Excel, and made this chart:

Fifteen years ago, tax instructions were half the size! More striking, the booklets from the ’80s have smaller pages, but were still significantly shorter than today’s.

So now I had a GIF, three charts, and a whole bunch data. All that was left was words.

Read them all here: Line for line, US income taxes are more complex than ever

-David Yanofsky

View the code on github

Development Time: 1 day

My latest interactive is a comparison tool for the Center for Global Development's Commitment to Development Index, a barometer of developed countries dedication to supporting poorer nations enhance their standing.

The piece is responsive and accompanied by a bunch of words by Tim Fernholz. It is running the visualization library d3 making this my second published work to leverage it. (The first was a dashboard for the release of the jobs report that I will write about soon) This would have not been possible without Scott Murray's tutorial. Inspiration also came from the recently launched State-by-State interactive from the Bloomberg Visual Data group, seeing the HTML and CSS markup in their drop downs was very instructive.

#visualization dataviz graphics quartz d3js responsive

bizweekgraphics

yanofsky

That feeling when you weren't expecting a byline.

"Unfortunately, we have not selected your project to move forward"

Now is when I try to come up with a way to do TableTent with no money. I've been thinking about using ScraperWiki.

newschallenge2

1. What do you propose to do?

We wish to build an aggregator, exploratory tool, and API revealing in-depth information about who is testifying in front Congress, how often they’re doing it, what they’re saying, and who they work for.

2. How will your project make data more useful?

By...

yanofsky

I submitted this the other day and it has a great response so far. Head over to the original post and heart it up!

Pyglet Image Upside-down Example Fix

I was doing some video processing in python today and I couldn't find any good answers on how to save rightside-up images using pyglet. Apparently pyglet saves them upside-down by default. Here's my solution: #load the video video = pyglet.media.load("myVideo.wmv")

#get the first frame frame = video.get_next_video_frame()

#get the image data of the first frame imageData = frame.get_image_data()

#get the pixels of the frame but invert the pitch pixels = imageData.get_data(imageData.format,imageData.pitch *-1)

#set inverted pixels to the image dataimageData.set_data(imageData.format,imageData.pitch,pixels)

#save the image imageData.save("myVideo_capture.png")

#python #code #example #pyglet #video-processing

What are the best undergrad B-Schools in the U.S.? My latest work is an exploration tool of Bloomberg Businessweek's 2012 Undergraduate Business Schools Ranking. Some things you can do with it include... ☞ see the top ranked schools ☞ the schools rising and falling the most in the ranking ☞ search for a school ☞ find out which students say they're workload is too heavy ☞ discover which students spend the most time preparing for class ☞ AND MORE http://buswk.co/GGmChP

#data-visualization #business schools #college search #college applications #ranking

The final in the series of energy interactive I made last week was an exploration of where new wind and solar energy installations were going around the world, how large the installation have been, and what the unit cost of the new equipment was.

It uses data from Bloomberg New Energy Finance from 1990 with their projection through 2030.

The Great Renewable Energy Race

#data-visualization #energy #solar #wind #graphics #charts #interactive #sustainability

Deja Vu? Kind of. Yes those are the same types of charts, in the same order, on the same type of page, with similar headlines, and subject-matter, but this is about wind–TOTALLY DIFFERENT Wind Innovations Drive Down Costs, Stock Prices

Trending Blogs

Recently Viewed Blogs

From the Hands of David Yanofsky