Discover Top Posts Tagged with #spell-checker

Fecking spell-checker. You'll get yours when the nukes start flying. Let see you spell-check your way out of an EMP!

@tyrie2001

#Quote #Tyrie #Tyrie2001 #EMP #Spell-Checker #Skynet #Nukes #PuffBlog

its that moment in the night when my own re-blogs appear in my dash

[CODE] Writing A Simple Spell-Checker in Python

I've written a simple spell checker using Python. It takes as input a word and tells whether it is correct or not, by looking in a dictionary file words.txt. If not, it prints the top 5 suggestions. Dictionary file contains a list of words which are considered to be correct. It can be different and can be modified during the runtime to add words to it. Please note that I'm a beginner in Python, so don't expect this code to be highly optimised or accurate. Documentation Strings are included wherever possible.

import re alphabets = "abcdefghijklmnopqrstuvwxyz" filename = 'words.txt' def convertToList(text): ''' converts input taken from the file to a list of words ''' return re.findall(r"[a-z']+", text.lower()) fp = open(filename, "r") ALL_WORDS = convertToList(fp.read()) def combinations1(word): ''' returns a set which contains all the possibilities of what the correct word can be ''' length = len(word) inserts = [word[:i] + alphabets[j] + word[i:] for i in xrange(length+1) for j in xrange(26)] deletes = [word[:i] + word[i+1:] for i in xrange(length+1)] replaces = [word[:i] + alphabets[j] + word[i+1:] for i in xrange(length) for j in xrange(26)] transposes = [word[:i] + word[i+1] + word[i] + word[i+2:] for i in xrange(length-1)] transposes2 = [w[:i] + w[i+1] + w[i] + w[i+2:] for i in xrange(length-1) for w in transposes] return set(inserts + deletes + replaces + transposes + transposes2) def inFile(words): ''' returns a set of all those words from the iterable given as the argument which are in the file (ALL_WORDS) ''' return set(a for a in words if a in ALL_WORDS) def rankingCoeff(edits, word): ''' determines the Ranking Coefficient of the correct words and returns a dictionary of the form {word: rankingCoeff} ''' wordset = set(word) jacoeff = {e: (len(((set)(e).intersection(wordset))) * 2.0 / len(((set(e).union(wordset)))) - abs((len(word) - len(e))/len(word))) for e in edits} return jacoeff def calculatemax(coeff): ''' calculates the maximum five of all the correct words based on their ranking coefficients ''' ans = [] count = 0 while len(ans) < 5 and count <= len(coeff): ans.extend([i for i,j in coeff.items() if j == max(coeff.values()) and j > -100]) for a in ans: coeff[a] = -100 count+=1 return ans def calculatere(word, top): ''' returns a list containing the words which are somewhat similar to the given word using regular expressions -- this function is only called when the words matched using the ranking coefficients are less than 5 ''' retop = [] i = len(word)-1 matches = set(top) while len(matches) < 5 and i > 1: exp = "\\b" + word[:i] + "[a-z']*\\b" l = re.findall(exp, str(ALL_WORDS)) retop.extend(l) matches.update(set(retop)) i -= 1 return list(matches) word = raw_input() word = re.search(r"[a-z']+", word.lower()) if word: word = word.group() if word in ALL_WORDS or not word: print("The word is correctly spelt.") exit(0) edits1 = combinations1(word) alledits = edits1 candidates = inFile(alledits) apostrophes = [a[:len(a)-1] + '\'s' for a in candidates] candidates.update(inFile(apostrophes)) top = [] if candidates: coeff = rankingCoeff(candidates, word) top = calculatemax(coeff) retop = [] if len(top) < 5: retop = calculatere(word, top) top.extend(retop) print("The word is incorrectly spelt. The nearest options are :") for i in xrange(5): print top[i]

Algorithm: 1. Take the word as input. Lets denote it as w. 2. Compute a set A, which contains all the possibilities of what the correct word could be, by applying insertions, deletions, replaces and transposes. 3. Compute a set P, which contains those words from A, which are present in the file. 4. Compute set X, which contains all the alphabets of the word w. 5. For each word e in P, compute set Y, which contains all the alphabets of e. 5.1 Rank all the words using the formula (derived by me, can be further modified): Rank = (|X intersecton Y| / |X union Y|) - (length(w) - len(e))/len(w) 6. Select n (the number of suggestions you want to display) words with highest ranks and display them. 7. If the number of words in P are less than n, then use regular expressions to find the matches. Some Key Points: 1. Take care of the case - lowercase won't match uppercase letters.

I can provide further explanation on request. :)