Autocorrect, word forms and other linguistic tricks
I have recently been tempted to go "all in" into semantic analysis.
I have played around with the Sphinx search engine, even wrote a plugin for it, checked out the logic for infixes (suffixes, prefixes), word forms, spelling, stemming, etc.
Have to say, the code they have in there is a mess, but still fast. It's hard to make heads or tails out of a 28 000 lines cpp file. (yes, it's not a mistake 28k lines). Maybe they make it this way to stop others from playing around with the code?! What's the meaning of open source then.
Anyway, for the curious minds, here are the results of my research for the past couple of days:
spelldump - great utility, comes with sphinx. Have a read here: http://sphinxsearch.com/docs/1.10/ref-spelldump.html
snowball - Stemming library with multiple steamers for romantic, germanic, northern and russian languages http://snowball.tartarus.org/
php-stemmer - http://code.google.com/p/php-stemmer/
word lists - All the english words dictionary (not really, but close enough) http://wordlist.sourceforge.net
soundex - a phonetic algorithm http://en.wikipedia.org/wiki/Soundex
metaphone - a better phonetic algorithm http://en.wikipedia.org/wiki/Metaphone
double metaphone - an even better (but way slower) phonetic algorithm http://en.wikipedia.org/wiki/Metaphone
levenshtein distance - the distance between two words http://en.wikipedia.org/wiki/Levenshtein_distance
ispell - Non-GNU, but part of GNU spelling and typographical error correctors. http://www.gnu.org/software/ispell/ispell.html
So how can you use these ?
If the word is not in your dictionary, get the phonetic form of it (through one of the algorithms above) and calculate the Levenshtein distance the phonetic representations of all the other words you know. Find the closest match and get the stemming of it. Now check if it's in your accepted list again.
If it is, let the people submitting the words know that this is a match based on autocorrect (something like "did you mean ?"
All of the above are cool to be done in C, C++, but even php has implement ions of these (mostly written in C and made available through native functions): http://us.php.net/manual/en/function.levenshtein.php
http://us.php.net/manual/en/function.soundex.php
http://us.php.net/manual/en/function.metaphone.php
http://code.google.com/p/php-stemmer/
Hope the above help you get started with whatever you want to do. There are tons of applications. Please comment and let me know.