this is a series of very common bugs in UTF-16 supplementary character handling
UTF-16 (Unicode Transformation Format, 16-Bit) is an encoding of the Unicode character set in which characters in the Basic Multilingual Plane are represented as 16-bit units and characters in the Supplementary Planes are represented as sequences of two 16-bit units ('surrogates') drawn from a reserved space within the Basic Multilingual Plane: a High Surrogate in the range D800–DBFF followed by a Low Surrogate in the range DC00–DFFF. this works because the surrogates are represented as codepoints that are not themselves characters, and because unpaired surrogates are invalid UTF-16
that's a lot of computer words. what's a codepoint? what's a plane?
the purpose of unicode, to simplify greatly, is to map numbers from 0 to 1,114,111 to things like chinese characters, emoji, and the alphabet. this is because computers know what numbers are[1], but don't know what the alphabet is unless you tell them. there are many ways to tell computers what the alphabet is, but if computers that are trying to talk to each other don't have the same idea of what the alphabet is, problems arise. so there's a standard, which is unicode
in unicode jargon, a number from 0 to 1,114,111 is a 'codepoint', which means 'the meaning of a cell in a spreadsheet', which means 'the meaning assigned to a number'. (it's trivial to map between numbers and cells in a spreadsheet: if the spreadsheet has N columns, the numbers 0 to N-1 represent the first row, the numbers N to N*2-1 represent the second row, etc.) numbers are typically written in hexadecimal; if their hexadecimal representation contains fewer than four digits, they're padded with zeroes to four. unicode numbers are typically (not always) prefixed with U+. for example, the letter 'a' is represented in unicode as the codepoint U+0061, which is named (codepoints have names) LATIN SMALL LETTER A, and the shrimp emoji '🦐' is U+1F990 SHRIMP.
the reason we write U+0061 instead of U+61 is that the first plane, the Basic Multilingual Plane (BMP), which contains most characters in common use to write most living languages, starts at U+0000 and ends at U+FFFF. 'a', which is U+0061 with four digits, is in the BMP; '🦐', which is U+1F990 with five, is not. if a codepoint has five digits, it's in one of the Supplementary Planes, and is referred to as a 'supplementary character'
[1] in fact, electronic computers were invented in order to handle numerical calculations faster than mechanical calculators were capable of. by 'numerical calculations' i mean things like computing range tables for the united states army and solving aerodynamic modeling equations for the wehrmacht. do not be confused about this: computers are primarily tools of war
okay, but where did the question mark in the box come from?
remember surrogates? reserved noncharacters in the BMP, from U+D800 to U+DFFF? those exist because UTF-16 is a 16-bit encoding - it only uses numbers from 0 to 65,535 - but the biggest number unicode uses is 1,114,111. so you need ways to represent numbers from 65,536 to 1,114,111 using only numbers from 0 to 65,535, and the way UTF-16 does that is with surrogates: numbers that, in a sequence of numbers that is valid according to the rules of UTF-16, will only appear in certain places. if you have a high surrogate that isn't followed by a low surrogate or a low surrogate that isn't preceded by a high surrogate, the thing that you have is called an unpaired surrogate, which is invalid UTF-16. most font rendering systems will represent unpaired surrogates visually with the glyph (font picture) at U+FFFD � REPLACEMENT CHARACTER.
the question mark in the box came from tumblr truncating the tag in the middle of a surrogate pair
it's in a supplementary plane?
yes! the character 𐐮 is U+10423 DESERET SMALL LETTER SHORT I, which is represented in UTF-16 as the codepoint sequence U+D801 U+DC2E. D801 is between D800 and DBFF, so it's a high surrogate; DC2E is between DC00 and DFFF, so it's a low surrogate. as long as they aren't separated, this is fine. but if you, say, truncate a tag by cutting off everything after the hundredth 16-bit number, you might cut off a low surrogate and leave a dangling high surrogate behind, which is what happened here
it's a very easy mistake to make. in general, people do not think about unicode, and for historical reasons many programming languages use UTF-16 under the hood. somebody probably wrote something like `str.slice(0, 100)`, assumed this was a reasonable thing to write, and didn't realize it could split down the middle of a surrogate pair, in a language that doesn't warn you about things like splitting down the middle of a surrogate pair. the length of a string (sequence of characters in a computer) isn't even a well-defined question
it's not? can't you just count the letters?
no! "letter" isn't even well-defined - sometimes you can represent the exact same letter as either one codepoint or two
a1 is a single character representing an a with an umlaut[1]; a2 is a normal latin letter a followed by a combining umlaut. this is also how zalgo text works.
so what do you mean by 'letter'? what are you counting? the codepoints? the normalized codepoints, where normalization means replacing things like a2 with things like a1 or vice versa? what javascript's `.length` does, and what tumblr most likely does, is count the number of UTF-16 'code units' - that is, the number of numbers between 0 and 65,535. this is usually not what you want: a character in a supplementary plane takes up two 'code units'. so if you're writing a tumblr tag, you can use at most 140 latin letters per tag, but only at most 70 deseret letters. (twitter also has this problem. most applications written in javascript do.) and if tumblr tries to truncate your tag at the hundredth character, well
[1] technically a diaeresis, a separate diacritic which was unified with the umlaut as part of the process of latin unification, analogous to CJK unification but far earlier. the germanic and latin alphabets used to be thought of as two separate writing systems: if you were writing a book in (without loss of generality) english, and you wanted to quote a sentence in (without loss of generality) spanish, you'd stop using the germanic alphabet and start using the latin one. unification basically didn't happen in india, which is why they have so many writing systems that work in almost but not quite the same way and look different; it would be more accurate than not to think of the script situation in europe until recently as analogous to the script situation in india today
is everything in the world this complicated?