Discover Top Posts Tagged with #unicodeerror

Popular Recent

Django UnicodeError in Ubuntu 16.04 LTS

윈도우에서 테스트하던 장고를 우분투로 옮겼다.

옮기고나서 안보이던 오류가 보이기 시작했는데 그게 바로 ‘ UnicodeError’ 다.

이 에러는 한글을 출력할 때 발생했다.

'ascii' codec can't encode character '' in position : ordinal not in range(128)

이런 식으로.. 처음엔 내가 짠 장고 코드가 문제인줄 알았다.

인터넷에서 해결책을 찾아보고, str를 Unicode로 바꾸는 등 여러가지를 시도해봤지만 결국 해결하지 못했다...

처참하게 쉘을 갖고 놀던 중 장고가 2.x버전에 설치되어 있다는 사실을 깨달았다.

윈도우에서는 당연히 python3에 장고를 올렸었기도하고 파이썬을 실행하면 3.x버전으로 실행되서 전혀 눈치채지 못한 부분이었다.

python2 버전은 기본 인코딩이 ascii다. utf-8이나 unicode를 이용할 순 있겠지만 번거로운 경우가 많다. 반면 python3 버전은 unicode로 통일되서 그런지 별 문제 없는듯?

리눅스에서 pip을 이용하여 모듈을 설치할 땐 디폴트경로가 2.x로 되어있는 듯 하다. 그러므로 pip3을 이용하여 다시 설치하기로 했다.

python2.x에서 장고를 삭제하고 pip3를 이용하여 설치하였더니 python3.x에 설치되었다. 그리고 한글이 잘만 출력된다!

#django #UnicodeError

Sometimes, you just want Python to shut up about Unicode Errors

There are times where you're messing with text and you're like

"OMG I do not care, please just do whatever you want with the text, it can have weird symbols or question marks in it for all I care!"

And then, even then, you'll see something horrible like this

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

For times when you don't care what the text actually contains, how it's encoded, or what it will look like when you're done with it, you can use the following helpful function:

def unmessup(input): for enc in ['utf-8', 'latin-1', 'unicode_escape']: try: return input.decode(enc) except UnicodeError: pass except AttributeError: # if input doesn't have the .decode attribute # it is either a Python unicode string or not a string at all return input

This function goes through a variety of common encodings in order from most to least likely. Odds are good, very good, that you're just looking at a non-unicode string that is utf-8 encoded. So we try that first. If that fails, Latin 1 (which corresponds to ISO 8859-1) is the most likely alternative. At this point even if it's wrong one of the two encodings has probably returned something.

Still, if neither of these work (possible), then we turn to the fallback encoding unicode_escape which is basically the hail mary of text encodings. It basically says, "look, whatever characters there are here, just put them — as entered — right into the resultant string, without trying to decode them or alter them in any fashion whatsoever.

The result will be horrific. For instance, the word Beijing written in Chinese: 北京, when encoded using Big5, becomes '\xa5_\xa8\xca'. When decoded using Big5, it looks like u'\u5317\u4eac', but when using the unicode_escape codec, it becomes: u'\xa5_\xa8\xca'. Notice that none of the codes have changed. The only difference is that the string is now prepended with u, which means "Sure, buddy. This is unicode. Don't worry about it." And when you print it out, it looks like this:

¥_¨Ê

That's not anywhere close to the original. So it's wrong. Which is absolutely why UnicodeErrors exist in the first place — to keep you from writing a travel book that encourges you to visit China's amazing capital city, Yen Underscore Umlaut E-Circumflex.

However, if you're dealing with situations where you really just don't care, my unmessup function will at least keep you from having to write code riddled with try…except UnicodeError.

#UnicodeError #unicode #python #text encoding