Ethan Herdrick @herdrick-blog - Tumblr Blog

Very long term trusts will not take over the world

A few years ago Trevor Blackwell wondered why there appears to be a dearth of old family fortunes, given the historical rates of return on investment, and suggests some reasons. Here's another: the law. Paul Collins in Lapham’s Quarterly writes that there have been panics about a similar issue - specifically that trusts (family or otherwise) growing at an exponential rate would take over the world, sink the economy, have all the money, eat all the blue M&Ms, etc. So, for example, perpetual trusts were outlawed in England in 1859, and a century later the IRS claimed in court that a particular long term trust would destroy "the tax base of the nation, if not the world".

None of these trusts had much chance of paying out what their founders calculated or destroying anything. For those of the past ~200 years it looks like unexpectedly high management fees are holding back much of their growth - and why not? With no living founder, who's going to look carefully for inflated fees, i.e. skimming? And this effect would be much stronger for a very long term trust, where after a century or two there would be no one alive who had ever met the founder nor would ever meet any beneficiaries. (The same applies to dynastic fortunes held outside trusts.)

Further, if the trust is meant to eventually pay out to some long-lived institution, like a city or something, then that city could get its hands on the risk adjusted value of it today by getting someone at an investment bank to facilitate a hidden buy out of the windfall. (Hiding that sort of thing is pretty much what Wall Street does, according to Michael Lewis.) So now some investor(s) has, in essence, obfuscated zero-coupon long term bonds, which are steadily growing in value, and no mandate to hold them. Not very different from any other bond. The fact that they would, by the time they mature, be worth $1 trillion or whatever is something that the economy would have had plenty of time to adjust to. No big deal.

Thinking you can force some institution to keep its hands off some money until all its current managers are long dead seems cute in an era of sophisticated finance. And plans that require continuous virtue over generations are probably not going to work. [1]

Tweet !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');

[1] Betting on accumulated virtue is a much better plan. Which makes me wonder if maybe the reason that universities tend to persist is that there is little to be gained by raiding them or skimming from them, because what they value goes unnoticed by thieves. But the better modern universities do have a lot of what thieves value: portfolio assets. Will they be raided then?

Finding deep insight in 350 year old sayings by de La Rochefoucauld discourages me, as it suggests either that I will not be able to make much progress on those topics, or that too few will listen for progress to result. Am I just relearning what hundreds have already relearned century after century, but were just not able to pass on?

http://www.overcomingbias.com/2009/12/why-read-old-thinkers.html#sthash.cje0nZfK.dpuf

Interest in Spark waning lately? (Updated June 2013)

Spark is far easier to work with than Hadoop, and better for rapid development. Excellent! But the community of Hadoop users is probably two orders of magnitude bigger. That's bad for Spark. But is the Spark community growing fast enough to make that soon irrelevant?

I counted new topics on the Spark users Google Group by time period.

(UPDATE: I wonder if this is a bad measure. Since this isn't merely a forum, but also a mailing list, maybe some people leave when traffic rises.)

All of 2011 2 Aug 2012 1 Sept 2012 0 Oct 2012 1 Nov 2012 5 Dec 2012 8 Jan 2013 5 Feb 2013 101 March 2013 150 April 2013 124 May 2013 107

I was loving that growth until April. It may be that this group didn't became the home of the community until February, meaning that March is the first full month that can be measured. Whether that's true or not, as of April 22, April is on track to see fewer new threads than March. (UPDATE: Indeed, April ended up with 124 new threads. May, 107.)

Amusingly, after counting these up I found that Google Groups have stats pages: https://groups.google.com/forum/?fromgroups#!aboutgroup/spark-users Those are somewhat different measurements but they show the same trend.

New questions on StackOverflow with Hadoop or Hadoop family tags are at ~135 in the past seven days, a monthly rate of 540. So far there are zero Spark or Shark questions on SO.

(Actually it's much worse than that. There are some questions on SO about other Shark projects, especially a new Flash/Flex framework by that name. This illustrates the awfulness of the name 'Spark'. BDAS really blew it in giving their project a name that is a common English word. But that is another post.)

So I am not yet seeing exponential growth in the Spark community. Let's hope that changes.

Charles Dickens, rejecting an invitation from a friend: “‘It is only half an hour’ — ‘It is only an afternoon’ — ‘It is only an evening,’ people say to me over and over again; but they don’t know that it is impossible to command one’s self sometimes to any stipulated and set disposal of five minutes — or that the mere consciousness of an engagement will sometime worry a whole day … Who ever is devoted to an art must be content to deliver himself wholly up to it, and to find his recompense in it. I am grieved if you suspect me of not wanting to see you, but I can’t help it; I must go in my way whether or no.”

The boat engine is worth 33,500 Egyptian slaves

In 1998 Google was worth 1,838,389 workers I proposed measuring the worth of innovations by estimating the equivalent amount of labor 'saved' by using them. But how about the great rapid transportation innovations of the 20th century? Surely those can't be reduced to human power. Much like you can't make a baby in one month with nine women, no amount of people can make a vehicle go faster than a running human, right? No. You can do it, and with that insight I'll show you what the human labor equivalent of a 5 horsepower Evinrude boat engine would be in ancient Egypt. In America, pulling a canal barge with horses or mules on the banks was how midwest grain and beef got from the Great Lakes (and so the entire upper midwest) to the Atlantic. But in some sad times and places, labor was so cheap that humans were riverside draft animals 1. There's even a name for that job in Russian: Burlak.

Lots of burlaks meant you could haul lots of freight. But with a rope gear - just two spools of different radii ganged together - on an anchored axle you could "gear up" to convert their slow, strong force to a fast weak one. A series of these rigs on the banks of a river, could, with coordination, pull a long, shallow draft boat quickly and continuously. How many burlaks would you need? Let's say you'll need 5 horsepower, sustained. I'm pretty sure I could get a long slim, light riverboat with a light load to 20 knots with a 5-horse Evinrude. A person can produce, in a short burst, 1.2 horsepower. Friction between the spool and the axle would eat up some of that, but then the boat engine loses a lot of power to turbulance around the prop, too. So let's say we need four burlaks pulling at maximum effort at all times. The ancient Egyptians were fine rope-makers - it looks like they could make 100 meter rope strong enough. 2 Experience suggests that 5 knots (~5 mph or 8 kph) is about the ideal speed for a human to generate maximum power - think of a footbal or rugby player driving an opponent backward. That's 1/4 the speed we want our boat to go, so the right gearing would have our burlaks surge forward only 25 meters while the boat they are pulling covers 100 meters. Our rig will need 125 meters of rope - call it 150 meters just in case. We'll need one of these rigs (a large double spool anchored into the ground, 150 meters of rope, and 5 burlaks) every 100 meters of the trip - about 6700 of them over the 667 kilometers between Luxor and Alexandria. That's 33500 burlaks. 3

So if we waive the manufacturing costs of the outboard engine and the rope-spool-axle system (and the work needed to supply gasoline for the motor and food for the burlaks), the 5 horsepower outboard engine 4, when plopped down into ancient Egypt, does the work of about 33,500 slaves. 5

On the Nile in fact. ↩︎

The spool size needed seems reasonable for the ancients, too: 19 mm for rope thickness, and 15 inches, 12 inches, and 20 inches for the traverse, barrel diameter, and flange diameter, respectively, gives you a capacity of 137 meters of rope. Close enough. ↩︎

This setup is good for more than just the occasional trip by the Pharoah. The burlaks should be able to go all out at least four times an hour, leaving some time to pull the rope back off the spool and lay it out in place, throughout a burlak's 15 hour day. That's 60 trips / day on the trans-Nile high speed boating system. You could use it to go in either direction, although some boats would have to wait in places. ↩︎

Of course the outboard is worth more as you could use it to cross the Nile, not just go down- or up-river. So we haven't completely found the human labor equivalent of the outboard. ↩︎

I put the painting of burlaks at the top because it's famous and relevant. But I included photos at the bottom because they are just so tragic. God, human draft animals - the ultimate result of cheap labor. Is there any surer sign that your society has gone wrong? ↩︎

Google was worth 1,838,389 workers in 1998, maybe

What is an innovation worth? I'm not asking how much money it makes, because that's just part of it. To take an extreme example, if you give an invention away to the public it can still provide value to people, it's just that you're not getting any of it, or no more than anyone else. But the cash value of the uncaptured part is notoriously hard to quantify. How about a different approach?

Human labor has always been a fundamental good. And lots of new technologies have been called "labor saving devices". If we can figure out a way to calculate an innovation's equivalent in human work, we'd have a measure that works across history, even prehistory. Plus we wouldn't care if the invention was 'monetizable', ex. whether it appeared in a period with a legal system defending private property and maybe patents. 1 Maybe best of all we don't have to worry about the value of various currencies over time, and in fact can value innovations that predate money itself.

The value of some new ideas seems well captured by measuring how much human work they replace. Manufacturing and hanging drywall needs much less effort than lath and plaster. Dynamite and bulldozers remove rock with much less effort than picks and shovels. 2 But what labor is saved by the jet engine? All the laborers in the world couldn't get you from San Francisco to New York in six hours. 3

Google's search engine seems to be like that - not a labor saver but something that does what was before impossible. Is it though? Could you measure the value of Google by how much labor it would take to replace it? I think you can. What if, instead of Google's new software, you just had people? Could you build such a system that could rival, if not today's Google, then the first Google search engine, from 1998? Google was searching over only 26 million pages at the time. Couldn't you fulfill a query over those pages given enough 'librarians'? If we can value even Google this way, then maybe we've got a useful scale for innovation.

How about if you divided up the web among your librarians? Before reporting for duty, each would read each page in his or her bailiwick and remember, more or less, what they say. It's not so unrealistic if you assign people to pages pertaining to things they already know something about. Of course most pages weren't really about anything, then perhaps more than now. The blog hadn't been formally invented 4 but 'home pages' seemed to make up a majority of the web and few of them were about anything other than the author and his interests. Here's a surviving example that exemplifies the species: http://jerrypournelle.com/ Notice the multiple sections, "Books and Movie reviews", "What's new", "Reader email", etc. Remembering what was mentioned in one of those pages wouldn't be easy.

We can make it easier. Let's give each librarian the software from one of the existing, crummy pre-Google search engines (or maybe just grep) and set it up so search only their 100 pages. That will give the librarian a good quick start, jog his or her memory, and help a lot with the kind of things that unsophisticated software is good at, like finding exact matches of sentence fragments.

If we assign 100 web pages to each person and their search engine, we'd have the 26 million pages covered by 260,000 librarians. But what if you search for something common, like bill clinton and most of those 260,000 librarians have results? How to pick among them? This is really what the search engine that Google launched in 1998 did that was so great. Its results were ordered in a way that seemed like magic. You searched for that sherlock holmes story with the snake and, sure enough, the first result was The Adventure of the Speckled Band. To replicate that we're going to need more people to sort out the work of those first 260k people. We need editors.

Let's start with a layer of editors above the librarians. We assigned 100 pages to each librarian, so why not 100 librarians per editor? That'd be 2600 editors. When a dump of those bill clinton results comes in to an editor, he or she picks the 10 best, in order, and passes them on, declaring them the best 10 results from the 100 librarians he edits. Each of those are assigned 100 web pages, so the the editor's top 10 results are the best from the 10,000 pages his librarians cover. Now we've got 10 results each of our 2600 editors. We need to whittle these down to 10 results to show to the user, who is still staring at the screen, waiting. You can see that all we need are more layers of editors. log base 100 of 260,000 is 2.7 so a total of three layers of these editors is enough. That'll give us 260,000 librarians, 2,600 first level editors, 26 second level ones, and 1 chief editor: 262,627 workers. If it takes a minute for each layer to do its work, which seems reasonable, then a user gets a result back in three minutes, and the system can handle one query per minute. 5 That's not much. Luckily this system is easily parallelized. To get another query per minute we simply add another 262,627 workers searching over the same 26 million pages. Apparently Google was doing 10,000 searches per day in 1998. That's about 7 per minute. 6 To handle that at a steady state, we'll need 262,627 * 7 = 1,838,389 workers. 7

There you go. On the day Google launched they were providing, free of charge and with less than 1/120th the latency, what you'd need 1,838,389 smart workers to do the day before.

Does this technique work as a scale of innovation? Well, it’s got the nice advantages I mention above. But it can only give you an upper limit on the value of the innovation, since if it paid to do it the labor intensive way, that would have been happening. 8 It needs improvement. What do you think?

Tweet !function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");

Follow me on Twitter

Or whether it appealed to the richer segments of society. Of course this last item is controversial. On one hand perhaps serving the needs of people who are themselves effectively creating (and capturing) value is morally better than otherwise. But that == "it's morally better to benefit the wealthy than the poor" which surely isn't true. It's not surprising that I've have waded into this swamp since I'm more or less writing about the labor theory of value, a staple of Marxism. ↩︎

Of course you need to amortize the work to build the bulldozers and dynamite. ↩︎

You might be able to do better than you'd think though. The way to see what could be done with enough manpower is to imagine yourself a Pharaoh. Better yet, the Pharaoh's head engineer with unlimited cooperative laborers. Now how fast can you move the Pharaoh from Luxor to Alexandria? I explored that here: The boat engine is worth 33500 Egyptian slaves. ↩︎

Although there were proto-bloggers already. ↩︎

I'm describing the worst-case scenario. Often an editor will have less to do for some searches, as when his reports give him fewer than 100 results. You could take advantage of this and drop the lockstep architecture. But no closed form solution to calculate how much more productive you could make the tree of workers comes to mind. You'd probably do best with a Monte Carlo simulation. This optimization would be an interesting problem. ↩︎

Actually a lot more than that at peak periods and fewer late at night. But for simplicity we'll stick with this. ↩︎

There are complications. For one thing, if the user asks for the second page of results, everyone has to do the same thing except each editor must pass up the 20 best results, since there's no way for such an editor to know which, if any of those could ultimately be in the overall top 20. All his fellow editors at his level do the same and now the editor above him has twice as much work to do. Further clicking deeper into the search results makes it worse (only linearly, though). But most people don't do that and anyway this is supposed to be a first version. ↩︎

In this case we don't really have an upper limit either, since the army of librarians and editors are so much slower than Google, and speed is so important in a search engine. ↩︎

Facebook checkin to become the new price for free wifi

Just showing up at Coupa Cafe and connecting to their wifi now automatically does a Facebook checkin there. Good idea for the local business. And when this spreads to embaressing locations it'll make your Facebook feed a lot more interesting, so it's good for Facebook too!

Seriously, I think it only checks you in if you've already accepted thier TOS once. So the loss of privacy is probably in the 'one more step' sweet spot. I predict ubiquity.

My newest hack (with Dave Brushinski)

http://endrank.com/crunchbase

"The Crunchranked. The 217,000 most important companies, financial firms, and people in the startup world, according to an impartial algorithm."

Comments: http://news.ycombinator.com/item?id=3805555

Display git branch and 'dirty' status in fish shell prompt

UPDATE: due to this bug which exists in the version fish I got from macports, I decided against fish. It's a bad sign when you find that bad of a bug in your first few minutes of using something.

Dissatisfaction with the bash shell, and the feature list and humorous promotion of the new fish shell fork ("Finally, a command line shell for the 90s... You'll have an astonishing 256 colors available for use!") spurred me to try fish out for a few days. It'll have to be a lot better than bash to make worthwhile leaving the large bash community and its google-able answers. We'll see.

The customization in my bash .profile I can't live without [1] is showing the current git branch in the command prompt. Googling this feature for fish got me most of the way there, with the code found here: https://wiki.archlinux.org/index.php/Fish#Configuration_Suggestions . But it didn't quite work. Below is what I got to work. As it also shows if you have staged or unstaged changes I like it better than what I had in bash.

set fish_git_dirty_color red

function parse_git_dirty

git diff --quiet HEAD ^&-

if test $status = 1

echo (set_color $fish_git_dirty_color)"Δ"(set_color normal)

end

function parse_git_branch

# git branch outputs lines, the current branch is prefixed with a *

set -l branch (git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* $.*$/\1/')

echo $branch (parse_git_dirty)

end

function fish_prompt

if test -z (git branch --quiet 2>| awk '/fatal:/ {print "no git"}')

printf '%s@%s %s%s%s (%s) $ ' (whoami) (hostname|cut -d . -f 1) (set_color $fish_color_cwd) (prompt_pwd) (set_color normal) (parse_git_branch)

else

printf '%s@%s %s%s%s $ ' (whoami) (hostname|cut -d . -f 1) (set_color $fish_color_cwd) (prompt_pwd) (set_color normal)

end

[1] Actually the best customization is changing the maximum size and count of the .bash_history file to be really big so that I can keep a lifetime of shell work.

Professor Sebastian Thrun quits Stanford to teach people

The text of his homepage today:

One of the most amazing things I've ever done in my life is to teach a class to 160,000 students. In the Fall of 2011, Peter Norvig and I decided to offer our class "Introduction to Artificial Intelligence" to the world online, free of charge. We spent endless nights recording ourselves on video, and interacting with tens of thousands of students. Volunteer students translated some of our classes into over 40 languages; and in the end we graduated over 23,000 students from 190 countries. In fact, Peter and I taught more students AI, than all AI professors in the world combined. This one class had more educational impact than my entire career. Just watch this video.

Now that I saw the true power of education, there is no turning back. It's like a drug. I won't be able to teach 200 students again, in a conventional classroom setting. I've just peeked through a window into an entire new world, and I am determined to get there.

(and yes, I gave up my tenured position at Stanford)

I could not be more impressed with Sebastian. From a story about this:

... the physical class at Stanford... dwindled from 200 students to 30 students because the online course was more intimate and better at teaching..."

#udacity #aiclass #stanford #onlinelearning #future #heroic

Kill hashtables, get shorter code

Driving late at night two months ago I wondered, why do we return hashtables from functions? Isn't a hashtable like a function? Instead of calling some function to get a hashtable, then looking up values on that hashtable with keys, why not simply call the function directly each time I want a value? Just make the 'key' the last argument of the new function. So I tried it on the code from my last post, where I explained hierarchical document clustering and showed some Clojure that does it. In this post I'll show how I eliminated hashtables in that code and got a shorter codebase, comparing the already pretty short original to the new code.

The best way to show the change will be to walk you through what happens in a call to euclidean, which calculates the euclidean distance of two documents or an arbitrarily deep tree of documents, based on their word frequencies.

In the original, ordinary version, I get a hashtable of the word frequencies of a tree of documents with a call to freq, which recursively combines the frequencies (hashtables) of the branches of the tree, ultimately calling freq-files on each leaf, which is a document, of the tree. freq-files returns a hashtable of frequencies for that a document. We will pass the hashtable we get back from freq into euclidean.

In contrast, in this new version, we just call euclidean before getting any frequencies. euclidean calls freq once for each word in the corpus. Each call to freq calculates the frequency for just that one word, again by recursively combining the frequency of that word of each of the branches, and ultimately calling freq-files on each document in the tree, which calculates just that word's frequency for that document.

This is a little weird. Do you see what has happened here? That data now is no longer represented with a hashtable - instead it's not represented at all.

Doing all those function calls sounds grossly inefficient but it shouldn't. For one thing, who cares? It's just performance. We can always profile and deal with problems like that later. But second, it isn't really any less efficient than getting the whole document worth of frequencies all at once, because now I'm memoizing - caching function return values. I've wrapped all of the functions in that chain of calls down to and including freq-files (and beyond) in memoize [1]. For example, it looks like I'm reading files from disk every time I call freq-files. But memoizing to-words prevents that, so following a first call, every time freq-files is called with the same file tree (but probably a different word) the memozied to-words returns a cached word list. frequencies-m (the memoized version of frequencies), is in turn called with that word list as its argument, a call which would be somewhat computationally costly but since it has a cached value associated with that argument it simply returns that. This frequency needs to be relative, so we've got to divide by the word count of the doc. I've got count wrapped in memoize too, calling it count-m, which I call on the results of calling to-words again, which again returns a cached value.

The really cool thing is that that data is still there. It's just that you don't see it or manipulate it in the code anymore. It's implicit.

This change chops the code down by about 20%, from 55 non-comment lines of code to 44 (or 39, not counting the boilerplate function memoization, which would be a 30% reduction) [2].

This style of programming has downsides. For me, wrapping functions in memoize means my editor / IDE can't tell what arguments my functions take any more (a common problem when you have first class functions). That sort of sucks. Further, it makes some functions, like euclidean, hopelessly non-general, since euclidean now has to know what function to call to get a frequency [3]. It makes it harder to track down bugs. When you get all the values you want from each function in a chain of functions, always passing the entire hashtable of them up the chain, it's easier to figure out where a problem has come up than if you ask about a single value all the way down the call chain. But the main disadvantage to this technique is that it doesn't match the way I code. I like to program very interactively. For example, the first thing I did when I started coding this stuff up was to slurp up the documents in question and call frequencies on them. With the resulting hashtables in hand I wrote code to find the distance between two document "vectors". With these distances I wrote the code to find the shortest ones, and so forth. Calling the entire function chain to get each value is only possible after you've written that chain of functions. While you're building them it's easier to pass all the data from each step to the next step. But I haven't really tried coding in this style from scratch yet, so who knows? Maybe it's got some other hidden advantage I don't know about.

But it's hard to argue with brevity. And by the way, there's nothing inherently lisp-ish about this technique. You can do it in any language with first class functions, like Python or Ruby, just as easily.

Comment on this post at News.YC.

In the Bay Area and need help with your machine learning project? Contact me at [email protected] or twitter.com/herdrick

[1] And we've been doing some of that all along anyway, since we often call freq with the same arguments, over multiple calls to euclidean.

[2] There's actually another related change here that isn't so interesting, but definitely helped. In the original version, the word-list - the list of all words that appear somewhere in the documents - is calculated immediately upon calling cluster, and passed into each subsequent recursive call to cluster. From there it's passed into best-pairing, which passes it into euclidean, the only place where it is actually used. I did this because getting this list requires some file reading and computation and it seems ineffiecient to recalculate it every time I call euclidean, which is a lot. But that was the wrong way to think about this problem. When I went ahead and did the inefficent thing, the right fix became clear: memoization, again. best-pairing's argument is a list of file-trees in which every file in our corpus is represented, so we can calculate our list of words from that - so I did. Now every call to euclidean makes a call to word-list. But this isn't such a problem because word-list and most of the functions it calls are memoized. So the only additional cost of calling that function are several memo lookups - negligible - and flattening and sorting a file tree. Which isn't nothing but I don't think it's a big deal. I should profile it to find out.

[3] Actually I could make euclidean more general by just sticking with the original version. Because Clojure hashtables can be called as functions, I can just call euclidean like this: (euclidean (partial freq pof1) (partial freq pof2) (word-list pofs)), using closures created with partial function application as the frequency arguments. Seemed a little harder to explain though, so I skipped it.

#refactoring #clojure

Cluster (with Clojure)

I was watching a video of Berkeley professor Michael Jordan lecturing on the Chinese Restaurant process and for a moment he showed a slide of a tree of documents that were matched up by word frequencies. It seemed cool so I coded up my own version of it, mostly to learn about the topic and to get some practice with Clojure. It turned out there's a name for this: hierarchical clustering. I went with the 'agglomerative' version of it, which is repeatedly pairing up things and pairs of those things until you have a single pairing that, beneath it, contains everything you started with. Usually you choose pairings based on similarity - you pair up the two available things that are most similar.

To make documents into something you can easily compare, I'm converting each into a hashtable of the relative frequencies of the words it contains, like this: {"purple" 0.0015, "it" 0.0023, "this" 0.0083 ...} etc. The code finds the two most similar docs based on those frequencies, and matches them up, making a new hashtable like that by averaging the frequencies of those two docs [1]. This "pairing" is now on the same footing with all the other documents. So we again find the two most similar docs (allowing this new pairing to be treated as a doc), repeating that process until we're left with only a single pairing. It contains every pairing we made, and ultimately every document we started with. This is our finished product, a hierarchical cluster.

You can see the code on my GitHub.

I turned it loose on some text files got the following (plotted with the Protoviz Javascript toolkit) [3]

The blue nodes are the documents and the green nodes are pairings. From top to bottom, the docs are:

-the German Wikipedia article on the lambda calculus, -the first several hundred words of a German novel, Wilhelm Meister's Apprenticeship by Johann Wolfgang von Goethe, -an exerpt from Shakespeare's King Lear, -the scifi short story, "They're Made out of Meat", -the English Wikipedia article on the lambda calculus, -the English Wikipedia article on "The Buzzer" or UVB-76, -the Sherlock Holmes story, "The Red Headed League", -the Sherlock Holmes story, "A Scandal in Bohemia", -the scifi short story, "The Long Watch" by Robert Heinlein, -Shakespeare's first and second sonnets, -Shakespeare's third and fourth sonnets, -the Spanish poems, "Candor" and "Reto" from Julio Flórez, -the Spanish Wikipedia article on the lambda calculus, -the Spanish Wikipedia article on functional programming, -the Dutch Wikipedia article on the lambda calculus,

Notice that each pairing shows three words. Those are the most 'interesting words': those whose frequencies do the most to make that pairing stand out from the average. For example, if you've got a document that uses the word "mooloolaba" a few times, that's probably going to be one of its interesting words because it's so rare elsewhere. But a word could also be interesting for not showing up, ex. if the word "the" never shows up in a document or only a few times in a long text. In that case the word is in parens.

It seems to have done an OK job here. It strongly leans toward matching up docs (and pairings) in the same language when possible. I was hoping that it would be able to pull out the two science fiction stories, but that isn't happening. It's not smart enough for that. I'm pleased that it grouped the Spanish functional programming and lambda calculus articles before including the Spanish poetry. It put Shakespeare's sonnets together, but failed to associate them very closely with the excerpt from King Lear. [4]

I was also hoping the Dutch and German articles would cluster together before being joined with the Spanish or English docs, but this didn't happen. It might be that "de" is a common word in Dutch and Spanish, whereas "is" and "in" are common in Dutch and English. So the classifier might see Dutch documents as partway between English and Spanish ones (even though the opposite is closer to the truth).

Interesting. I think this could be improved by using a statistical distance to measure the distance between vectors, instead of unscaled distance of relative frequencies as I'm doing now.

In playing with this code I stumbled on a interesting way to refactor it. I'll talk about that in the next post.

[1] This is called the vector space model, declaring each word to be a dimension, and each document a point, or vector, in that word-space. Identifying docs by the words they contain and not worrying about word order is in general called using the "bag of words". I'm comparing those frequencies using Euclidean distance. It's often said to be better to use the cosine of the two vectors but that doesn't matter here since the dimensions of any document vector sum to 1 (is there a name for such a vector?), and I'm only looking for a ranking of distance.

[2] I omitted the thirty lines of code to translate the s-expression result to the JSON needed for Protovis.

[3] I found in trial runs over all of Shakespeare's sonnets that it did a pretty good job of sorting out the earlier sonnets from the later ones.

[4] The regular expression I used for extracting words is poor for non-English languages, but the algorithm can probably handle it anyway, as the fragments it creates will be unique to the words they came from.

In the Bay Area and need help with your machine learning project? Contact me at [email protected] or twitter.com/herdrick

#machine learning #clojure #hierarchical clustering #document clustering #agglomerative hierarchical clustering

Some speculation on human knowledge

(What I write here includes a lot of confident-sounding guesses. Consider this more like a late night wine-soaked conversation than an exposition about things that I know. This was meant to be an exploration. It turned out to raise more questions than it answered.)

Last night as I was trying to fall asleep I was, for some reason, thinking about mathematics and innovation, and it occurred to me that the point humanity passed a few decades ago, where a single mind [1] can no longer contain all our mathematics is a major and universal milestone for a civilization. What does it mean? Will the balkanization of math [2] slow progress? Are we approaching Peak Math? [3]

There is something else related to this: the more study required to reach the frontiers of a field, the fewer restless turbulent personalities will be in that field, because they won't have the patience to quietly learn at the feet of their predecessors for the many years necessary to reach the frontier. And without those sort of people, you won't get the sort of revolutionaries that shake the planet with their work. Thus as the frontier gets further from the starting point of pure ignorance, you should see more incremental advances and fewer revolutions. Do we see that in mathematics? If the revolutionaries are driven out of vast fields, where do they go? I have my suspicions. Is it useful to ask how they are driven out? Do they just find the culture and its practitioners no simpatico?

The distance to the frontier of a field should be related to how much of the field can be contained in a single brain, (it should increase as the square root of the amount of knowledge in the field if you assume the field grows in all directions) since you've got to have all the preceding stuff that is on the route to your speciality in mind before you can advance the field and by the time you get there you have less space for the new stuff. This implies that there will be a bigger chunk of your brain filled up with the common stuff shared between subfields than if you were in a smaller field.

So again, how many brains do you need to contain mathematics? Or physics? How many other fields have exceeded the capacity of a single brain? Economics? Certainly there are fields of Econ that have no contact with other fields - game theory has little in common with econometrics - but is brain-filled-ness the reason for that? Geology - certainly soft rock geologists and hard rock geologists don't fully understand what each other are currently up to but is that because they can't or because they don't want to? [4] How about Computer Science - surely there is more than one brainfull there. I'm pretty sure that most of what's going on in machine learning is not understood at all by the people working in theory, and vice versa. Likewise the PL people can't be current on algorithms. I just don't know enough about these fields to know how many minds you need to encompass the field, just that its got to be more than one.

What fields aren't like that? New ones. [5] From what I understand, neuroscience is new enough that a neuroscientist can just be a neuroscientist.

So is this the difference between a field and a subfield? Two different fields can share some stuff, like math and physics (or math and anything) but from the beginning they are diverging - there are lots of things that a first year physics undergrad and a first year math undergrad learn together, but already there are things that they don't. Same is true for geologists and physicists - from the first year they are diverging. (Hmm... yet they all have the same pre-college experience.) But budding topologists and analysts are studying exactly the same things for years at college. So are the capabilities of the 18 year old college student the thing that makes electrical engineering and computer science different fields? Maybe. Honestly, academia politics probably has much to do with it. Hmmm... probably I'm seeing this thru my lens shaped by the current divisions in university departments. I mean, what the 18 year olds are studying isn't necessarily what they *should* be studying.

What I'd love to see is some way to measure the size of fields in units of brains. How? I was thinking that maybe a field branches into subfields every time it gets too big to be contained in a single head. Does this happen? Well, if you assume that the amount of knowledge in a leaf on the tree of fields is constant, the increase in fields and subfields should be proportionate to the increase in human knowledge.

How much have fields increased? It seems that there were no departments at universities 800 years ago. This page shows around 1500 of them today, but that list is under-branched. This shows many subfields not mentioned on the first page. Excluding repeats it makes at most a 5-fold increase. If it is underspecified and there are (or should be) similar pages for other topics we might get another order of magnitude. So four orders of magnitude increase in human knowledge in 800 years. For every one thing known 800 years ago do we now know only 1000 things? No way - that seems too low. Math has done far more than that, even excluding its offshoots (Computer Science, etc). On the other hand, theology was a major area of study at the time and we have probably not even doubled our knowledge of that since then. For example I think most Protestants would say that theological knowledge has been completely static for 2000 years. [6]

Here's something interesting: if your field is smaller than one brainfull, I don't think you can tell how much less. It seems to come as a surprise to all that the arts and sciences combined had gotten too big, and it was only by complaints by some mathematicians that they realized it was true in their field, too.

Related: The Last Days of the Polymath.

FOOTNOTES:

[1] I should clarify that the amount of knowledge that you can fit into one brain is going to be very hard to differentiate from the amount you can learn before you start going senile. Perhaps, in fact, that is the limit. Maybe the brain has no limit of knowledge, but instead is limited by rate of intake. Then that rate times the lifespan of a person equals the size of a brain.

[2] Just how balkanized is math? Do they in fact know, in absolute terms, less about other fields than they used to? Of course they know a smaller percentage of the whole field than they used to, but do they also just know less? Would a topology mathematician of 30 years ago know things about analysis that a topologist today wouldn't? Probably if someone wanted to simply know as much as possible about math and did not intend to contribute, he or she could know much more than the typical mathematician. I.e. it might be just that no one ambitious wants to 'waste' any time keeping up with the rest of their field when their subfield is so demanding. But that's OK - couldn't I just redefine my definition of "too big for a brain to encompass" to "too big for an ambitious brain to bother learning all of"?

[3] I don't think this would hurt us for some time. Our current known reserves of unapplied math should last centuries.

[4] Here's what wikipedia says are the 'subdisciplines' of geology: Economic geology, Mining geology, Petroleum geology, Engineering geology, Environmental geology, Geochemistry, Geological modeling, Geomorphology, Historical geology, Hydrogeology, Mineralogy, Paleontology, Petrology, Sedimentology, Stratigraphy, and Structural geology. Certainly no ambitious geologist is atop the current trends of all of these. (I have the advantage of knowing some.)

[5] Here's an ancient one that can be pretty well understood by one mind: music. I'm pretty sure one person could understand everything that's currently happening and everything that is known about previous music of humanity. (It's really interesting. You can start here: http://en.wikipedia.org/wiki/Musical_scale .) That's really weird. People have been innovating music forever and yet the field isn't too big to know? Why? Is our innate musical sense just that much better than our ability to comprehend math or science? Probably.

[6] You could compare the religions of the world based on how they see the advance of theological or spiritual knowledge over time. Christians would see it as a starting in Eden with minuscule knowledge, then a spike when Eve bites the apple, a long period of occasional increases by the odd prophet here and there until a big jump at Moses, and eventually a giant leap by Jesus, and finally a small increase with the writing of Revelations. After that Catholics and Orthodox would suppose continued small increased through the present day while Protestants would draw a flat line. Islam's line wouldn't look too different except with a giant leap around the year 800 followed by a flat line. I don't really know enough about other religions to say further, even by the loose standard I'm holding myself to here.

Math question

Question: Can an edge of a hypergraph connect to some node more than once in some definition of a graph? If so you could represent a text, for example a book, as a single directed hyperedge running through a graph of words. EDIT: Probably not. If you could you'd lose lots of the useful properties of graphs, I think.

#math #graphs #hypergraphs #hyperedges

Can't I have tasteful with a good keyboard?

I'm writing this in a store that sells laptops, on one of the tackiest computers I've ever seen. I think it's called a Gateway FX or something - imagine a Pontiac Grand Prix with a touchpad. Yet I'm envious and irritated because it has something that I can't have in my wonderfully designed MacBook Pro: fantastic keyboard feel. My MBP has the mushiest, least responsive keyboard I've ever used, at least among those keyboards I've tolerated. It's crap. I'm going to be ordering up a nice IBM Model M keyboard to use while I've seated at a table, which will be great since that must account for, gosh, at least 5% of my usage. I wonder if I can get a laptop case that will hold a seven pound keyboard?

When should you get on a bandwagon?

My sister and two of my best friends just joined Facebook after years of grouchy protest. Now they love it. Why wouldn't they? All their friends are already there - it's like a personal theme park that's been waiting for them. It's funny, because their early adopter buddies had a much harder time, yet talked the experience up.

So the sticks-in-the-mud have more fun. Is this true of other experiences? I was going to say that that's true of anything with a heavy network effect. But that's not true - I was the first of my circle of friends to get involved with the internet and it was a fantastic experience. The early Twitter had that feel, and blogging. CB radio is said to have been like that too. Why? Simple: that's where the cool kids hang out. Your friends may not be there, but the friends you wish you had are.

So, there you have it. If you want a new crew of hip friends, jump aboard the newest network. They're waiting for you there.

Dropbox viewed from the south

I'm in Mexico right now working around people in the mining industry and I just got an invitation from one of the Mexican geologists to join a Dropbox folder to share files about a mining project. Now, I've been using and telling people about Dropbox since they starting beta testing. Hearing about a startup sources unrelated to the people building is always really cool and feels like a big milestone. That I'm far from Silicon Valley makes it all the more impressive. Plus these are corporate users who have told me they'd be perfectly happy to upgrade to a paying account when they need to. ¡Dropbox les gusta a todos aca!

Trending Blogs

Recently Viewed Blogs

Ethan Herdrick