# How does UTF-8 turn βπβ into βF09F9882β? If you're anything like me, you love emojis! Emojis appear like an image on the screen, but they aren't an image like a PNG or JPEG. **What do emojis look like to computers?**
NOTE: I use the term "octet" in this article which means "a grouping of 8 bits". Today's computers consider 8 bits to be 1 byte, but previously there were systems which used a different number of bits per "byte" hence the distinction. For our purposes an octet and a byte are the same thing.
Every octet that can be produced using UTF-8 will fall into one of five types. The octet will either be a "header" octet specifying a length of 1, 2, 3, or 4 octets or a "tail" octet which only holds onto data. You can determine what type each individual octet is by examining the high-order bits (in this representation these bits are left-most in a block and colored green).
Each octet also has "empty" spaces for bits (visualized as X
in blue) which we'll eventually fill with data. You can also see the 4 unique layouts that a UTF-8 encoded codepoint can use to store different amounts of data (between 7-21 bits).
0
, 10
, 110
, 1110
, and 11110
. If I start reading an octet from left to right and encounter the bits 1
, 1
, and 0
I immediately know without reading further that this octet is a "2 octet header" and can't possibly be any other octet type.
Using Python we're able to see that our target output is \xf0\x9f\x98\x82
:
>>> emoji = "π"
>>> emoji.encode("utf-8")
b'\xf0\x9f\x98\x82'
Encoding a Unicode codepoint into bytes is a multi-step process. The first step is determining the number of octets required to encode the codepoint. The codepoint value for "Face with Tears of Joy" ( π ) is 0x1F602
. From the previous diagram the value 0x1F602
falls in the range for a 4 octets header (between 0x10000
and 0x10FFFF
).
Next step is converting the codepoint value 0x1F602
into binary. In Python you can do f"{0x1F602:b}"
which will return '111β1101100β0000010'
as a string. This value is padded with zeroes until there are 21 bits to fit the layout for 4 octets. This padded value can be seen on the top of the "UTF-8 encoding" section in the diagram as "000 011β111 011β000 000β010
".
Back when UTF-8 was first introduced there were many systems that didn't understand any character encoding beyond "US-ASCII". This meant whenever data encoded with another Unicode encoding was used then that system would produce garbage, even if the characters were within the US-ASCII range.
UTF-8's use of byte prefixing 0
to be identical to the US-ASCII range in 0x00-0x7F
means that all characters within the US-ASCII range are encoded exactly as they would had they been explicitly encoded using US-ASCII instead of UTF-8.
This was a big win for compatibility as it meant many systems could start using UTF-8 as an encoding immediately. As long as input data wasn't outside of the US-ASCII range the encoded bytes would not change which allowed for incremental adoption within a group of systems instead of having to switch "all at once" to a new encoding.