“A squawking man’s biggest regret”and other fake TED talks: How I made a title generator
As part of a project for a Natural Language Processing class, I decided to build a generator that invented fake titles of TED talks. The trick was to get titles that made at least some sense, sounded like titles, and were made of pieces found within a set of TED titles available online. I was determined to make the generator learn from preexisting data only; anyone can write a bunch of Mad-Lib-style templates like “10 things you never knew about ____” or “The surprising beauty of ______” or “The future of “________” and plug in random words. I wanted the generator to know which syntactic structures were allowed and which weren’t on its own, meaning it could invent a structure without being given a complete title minus one noun.
First attempt (abysmal)
My first attempt was really straightforward. I used the Stanford Part-of-Speech Tagger to tag all of the existing TED titles and store the possible tag sequences. For example, “Why I do theater” is tagged as:
Wh-determiner, personal pronoun, base-form verb, singular noun I also stored all of the words of all the titles in a directory such that they could be accessed by their tags—separate arrays of singular nouns, prepositions, -ing verbs, etc. To generate a title, all I had to do was pick a random sequence of tags, and then for each tag in that sequence, pick a random word. Makes sense, right?
Unfortunately, it turned out not to be that simple. This model didn’t know enough about which words could follow one other. “The beautiful math of coral” has the same tag sequence as “Any wrong cancer upon secrets”, which is nonsense.
Fixing the tagger*
The problem with the random-sequence-random-words approach was the lack of features in the tags. Sure, past tense verbs were distinguished from present tense verbs and plural nouns were distinguished from singular nouns (and that’s a huge deal and very impressive), but the constraints just weren’t enough to generate anything that wasn’t a big bowl of word salad. I tried to identify what the recurring problems were in tagging and figure out ways around them. I noticed first that a lot of function words (things like prepositions, “the”, “a”, “what”, etc that don’t carry most of the meaning in the sentence) are not interchangeable but are treated as such by the tagger. For example, all prepositions held the same status and all personal pronouns held the same status. However, “upon” just doesn’t work the same as “of” (as in “Any wrong cancer upon secrets”). For that matter, “any” does not work the same as “the”, even though they’re both technically determiners according to the tagger. “The wrong cancer of secrets” is much closer to making sense than “Any wrong cancer upon secrets”. This kind of thing happened a lot. To fix it, I took a list of “stop words”, which are very common words that show up all the time and are often removed when doing any kind of text analysis. I then had all of the stop words be tagged as themselves. “The” is now just a “the” and “any” is an “any”. Likewise, before I could have gotten a sequence like “I has” because “I” and “he” were indistinguishable as personal pronouns. No longer!
This made things better, but there were still some issues that the tagger couldn’t handle. The two big ones were verb transitivity and mass/count noun distinction. Verb transitivity has to do with whether a verb can take a direct object (whether you can put a noun directly after it to express that the verb is done to that noun). For example: you can kick a ball or pet a cat, but you can’t “arrive” anything. You can arrive AT a place, but that’s an indirect relationship, not a direct one. “Sit” or “lie” are other good examples. The Stanford tagger cannot parse transitive verbs from intransitive ones. This means it can produce a sequence of an intransitive verb followed by a noun: “How I arrived my piano” (pianos come up a lot in TED talks for some reason). My adhesive-bandage solution (look Ma, no buzz marketing!) was to create a new tag for transitive verbs. Any verb that was immediately followed by an adjective, a noun, or a determiner (e.g. “the” or “a”) was marked as Transitive.
The next challenge was mass nouns and count nouns. English (and most languages) has different types of nouns. Count nouns are anything you can put after an indefinite article “a” or “an”. A book. A laptop. A squirrel. Mass nouns are more like substances that are harder to separate into discreet items that you can count. Stuff. Water. Pudding. Independence. We can probably find a scenario in which it’s ok to say “5 puddings”, but it would need to be understood that we mean “cups of pudding” or something.
If you don’t have this distinction in a language generator, you get phrases like “How much squirrel” or “a lot of laptop”. I worked around this a lot like how I dealt with the verbs. Any noun that came after a number or an indefinite article got marked as a count noun, and everything else was defaulted to a mass noun. A lot of nouns can go either way—“I have a fear of flying” and “I was paralyzed by fear” are both fine—so unless a noun shows up behaving as both a mass noun and a count noun, my generator won’t know that it could be either. But such is the problem with generators like this. They only see what’s in the training corpus.
Now that there were more types of tags than before, the generator started giving me much better results. The longer titles were still sometimes word salad, but the shorter ones were pretty reliably wonderful. Check out some of the results below.
Stay tuned! I’m going to post the actual generator to the web shortly so you can create your own fake TED titles!
*For those of you with some NLP knowledge, I should mention that I tried to use a trigram language model to fix the problems with the tagger, but it didn’t improve the results for a number of reasons I don’t want to get into here.
Generated TED talk titles (personal favorites in bold)
The high childhood Animations from tiny people The social truth of galaxies An electric government The crap between killing and taking The college of mental fractals child-driven cyborgs and new remix The innovation of kids The mobile and new legos of designers that can learn Sculpting the power of a law Hear your city Why we need new cats Why voices should protect the thinking The ecosystem to fission How dead africa matter can be new Complicated origins and the end of science Illusions from jeopardy business lessons An ugly chief keyboard What means more than your roots The time-lapse war cost Anatomy + democracy The true architectural gift Understanding’s global government Fashion, the democracy music and humanity on the secret tv How to learn a morality A true poem for light reason ... from a mother My 1000 games of species Refugees of life Building for climate How we can embrace our mistakes Beyond the underwater science A squawking man's biggest regret Carseats designed from capitalism Make grownups the next poverty! Food’s music Get your English piano Reinventing the innovative Missing the city Women through cognitive power That terrifying autism of law 3 scenes in the synthetic past A new bold salvation Green college billions and robots The lost fashion in wonder mathematics How instruments do global online The synthetic spectroscopy underneath wireless power A surprising web of the strange music Universal mesh The world by attitudes Endangered data, medical nonprofits Chocolate, inventory, and the ollie about tourism Found a consciousness? Domesticate inventing A broken time The reasons in my peace










