Some tricks for handling non-ascii text when analyzing tweets
For Twxplorer, we needed to clean up "terms" in tweets so as to not clutter the counts with emoji and other noise.
In the first implementation, we simply forced unicode strings to ascii using this python code:
s = s.encode('ascii', 'replace').replace('?', '').lower()
This broke down when we wanted to support non-English languages, which often use non-ascii characters such as ñ, ç and é. Here's a solution which seems to work, at least for the languages we support. I think it would break with CJKV languages.
import unicodedata s = filter(lambda x: unicodedata.category(x)[0] != 'C',s.lower())













