Okay so this is a background post about Text encoding, ASCII and Unicode
Text encoding is the process of turning characters to numbers. text encoding allows one to save text as computer data, and to move this data around.
It was understood very early on, that if every user will define their own encoding, no interface could use the data of another because one interface's "a" would be another interface's "p", and so the text would be read as gibberish.
and so, a long time ago (in the 1960s), in a continent far, far away, a standard for text encoding was invented: the American Standard for Character Interface Interchange: ASCII.
ASCII used the fact that in english, almost no characters exist, and so only needed to use 128 characters: each character took 7 bits (1s and 0s), and was sent over a wire. (notice, not everything is a character, there are also character like "delete" and "go down a line" here. this is not for displaying, this is for every interfaces)
Something to remember for later: the number 0 is encoded as NULL, basically "nothing". This is useful because sometimes you want to enter text with an unspecified length, and so you stick a NULL in the end, and the interface reading it reads until it sees a NULL, and all is well. this will be important later
Standard explained, technical info for nerds, go to the next red section to pass
ASCII is a wonderful standard. remember: everything in electronics is easier with powers of 2 (1,2,4,8,16,32 etc.) because of the way we save data (if you want I can explain this further); the first 32 characters are the control characters. want to check if something isn't a control character? check if it's 128 or bigger than 32, and you're done (both powers of 2). the lowercase characters are 32 + their uppercase counterpart. all the numbers have a byte in common. truly, a marvel of engineering.
All was well until computers hit the scene not too long after, and used bytes. a byte is basically a whole number whose value can be only from 0-255. they are the standard building block of computer memory, and they have 8 bits.
some countries, like France, used encodings compatible with ASCII, and used the final bit to encode their language's characters. different countries used different versions of encodings, some countries (like Japan) had multiple encodings for the same characters. each encoding used a different number of bits, and different letters for each bit.
But that is fine since, well, how often do you need a computer in London to use an interface in Tokyo? all is well.
Then the World Wide Web happens, and suddenly computers speaking different languages read and write complete garbage everywhere.
So an organization called the Unicode Consortium tries to solve the problem, and to create a unified symbol for all languages. They called the standard utf-8
This standard supports 1,114,112 different characters. at present, only around 10% of this data capability is actually used. this includes dead languages, and emojis (which is a wonderful story)
Standard explained, technical info for nerds, go to the next red section to pass
Issues to tackle in a universal text encoding standard:
The protocol must be backwards compatible with ASCII: if you are writing text in English, which is the language most users used, because ASCII is the standard for this language, your new standard must be readable as ASCII as well
The protocol must never send 8 zeros in a row, except for the NULL character, otherwise old computers will stop reading in the middle
You must be able to minimize space wasted: to create a universal standard one can just make every character 32 bytes long and call it a day, but you would waste a bunch of space that way, and space is expensive
You must be able to pass from letter to letter easily. no saving the index of each character in some sort of list.
english characters are just ASCII. no thinking there. the first bit is set to 0 and so it is very easy to spot
if not, here's what you do:
the first byte has its first bit set to 1, so it's not ASCII. from that point onwards, you count the number of remaining ones until a zero appears. in this case, 1. this is how many more bytes will come. from there on, the rest is data. the first 2 bits of every next byte would start with 10 until the character ends
let's say your character is 2 bytes long, here is how you would represent it:
and when removing the headers, you'll have
which will be some character.
let's say your character is 3 bytes long, here is how you would represent it:
1110some , 10charac , 10ter___
and when removing the headers, you'll have
which will be another character.
if you wanna go back 1 character? just go back bytes until you find one that starts with something other than 10
no excess Nulls will appear because the only way to get 8 zeros