Sometimes, you just want Python to shut up about Unicode Errors
There are times where you're messing with text and you're like
"OMG I do not care, please just do whatever you want with the text, it can have weird symbols or question marks in it for all I care!"
And then, even then, you'll see something horrible like this
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
For times when you don't care what the text actually contains, how it's encoded, or what it will look like when you're done with it, you can use the following helpful function:
def unmessup(input): for enc in ['utf-8', 'latin-1', 'unicode_escape']: try: return input.decode(enc) except UnicodeError: pass except AttributeError: # if input doesn't have the .decode attribute # it is either a Python unicode string or not a string at all return input
This function goes through a variety of common encodings in order from most to least likely. Odds are good, very good, that you're just looking at a non-unicode string that is utf-8 encoded. So we try that first. If that fails, Latin 1 (which corresponds to ISO 8859-1) is the most likely alternative. At this point even if it's wrong one of the two encodings has probably returned something.
Still, if neither of these work (possible), then we turn to the fallback encoding unicode_escape which is basically the hail mary of text encodings. It basically says, "look, whatever characters there are here, just put them — as entered — right into the resultant string, without trying to decode them or alter them in any fashion whatsoever.
The result will be horrific. For instance, the word Beijing written in Chinese: 北京, when encoded using Big5, becomes '\xa5_\xa8\xca'. When decoded using Big5, it looks like u'\u5317\u4eac', but when using the unicode_escape codec, it becomes: u'\xa5_\xa8\xca'. Notice that none of the codes have changed. The only difference is that the string is now prepended with u, which means "Sure, buddy. This is unicode. Don't worry about it." And when you print it out, it looks like this:
That's not anywhere close to the original. So it's wrong. Which is absolutely why UnicodeErrors exist in the first place — to keep you from writing a travel book that encourges you to visit China's amazing capital city, Yen Underscore Umlaut E-Circumflex.
However, if you're dealing with situations where you really just don't care, my unmessup function will at least keep you from having to write code riddled with try…except UnicodeError.