3. Text inside the computer¶

3.1. Why should I care?¶

Even though as linguists, we may think we know everything about text, definitely more than programmers, thank you very much – this ain’t Kansas anymore as they say, this is Computerland, and the rules for how text works are strict and sometimes surprising. If we’re going to take advantage of computers to supercharge our linguistic analyses, we need to learn to play by their rules, otherwise we stand a big chance of shooting ourselves in the foot.

On occasion, this stuff might seem ridiculously low-level to a linguist – you’ll probably get that feeling more than once while perusing this chapter. But if you get it wrong, anything that you build on top will crumble like a house of cards and your linguistic analyses will be likely to yield garbage. And just to be extra clear, every single day, there are professional programmers who get some of this wrong. When was the last time an e-shop maltreated the diacritics in your name, for instance?

3.2. Binary and hexadeci-what-now?¶

Before we get started, humor me as I engage in a short digression on representing numbers, to make sure we’re all on the same page. Why numbers, I hear you ask, I thought this chapter was supposed to be about text? Well because in a computer, it’s numbers all the way down, even text is ultimately represented as numbers. So we need to be relatively comfortable with the different ways numbers are commonly written down in the context of computers:

as a plain old decimal number, using 10 digits 0–9 (e.g. 12)
as a binary number, using only 2 digits, 0 and 1 (e.g. 1100, which equals decimal 12)
as a hexadecimal number, using 16 digits: 0–9 and a–f (e.g. c, which also equals decimal 12)

All of these numeral systems can represent any number, the binary system doesn’t stop counting at 1, any more than the decimal system stops counting at 9 – beyond these respective points, we just add another digit and carry on. The only difference is that the fewer different digits a system uses, the more digits it takes to represent a number.

If you’re curious how that works, just take a moment to reflect on what you do implicitly each time you read a regular decimal number: you multiply the individual digits with increasing powers of 10, which is the base of the decimal system, starting from the right, and add this all up:

e.g. for (regular decimal) 12: \(2 \times 10^0 + 1 \times 10^1 = 2 \times 1 + 1 \times 10 = 12\)
for binary 1100, this goes: \(0 \times 2^0 + 0 \times 2^1 + 1 \times 2^2 + 1 \times 2^3 = 0 + 0 + 4 + 8 = 12\)
and for hexadecimal c: \(c \times 16^0 = c \times 1 = 12 \times 1 = 12\).

Decimal numbers are useful because everybody knows them, which makes them universally easy to read. Binary numbers are useful because they closely map to the underlying hardware: computer memory consists of bits, tiny slots each of which can hold a 0 or a 1. A group of eight adjacent bits is called a byte.

As a consequence, the need often arises to convert between these different systems. If you don’t want to do so by hand, Python has your back! The bin() function gives you a string representation of the binary form of a number:

bin(65)

'0b1000001'

As you can see, in Python, binary numbers are given a 0b prefix to distinguish them from regular (decimal) numbers. The number itself is what follows after (i.e. 1000001).

Similarly, hexadecimal numbers are given an 0x prefix:

hex(65)

'0x41'

To convert in the opposite direction, i.e. to decimal, just evaluate the binary or hexadecimal representation of a number:

0b1000001

0x41

Number literals using these different bases are mutually compatible, e.g. for comparison purposes:

0b1000001 == 0x41 == 65

True

Why are hexadecimals useful? They’re primarily a more convenient, condensed way of representing sequences of bits: each hexadecimal digit can represent 16 different values, and therefore it can stand in for a sequence of 4 bits, i.e. half a byte (\(2^4 = 16\)).

0xa == 0b1010

True

0xb == 0b1011

True

# if we paste together hexadecimal a and b, it's the same as pasting
# together binary 1010 and 1011
0xab == 0b10101011

True

In other words, instead of binary 10101011, we can just write hexadecimal ab and save ourselves some space. Of course, this only works if shorter binary numbers are padded to a 4-bit width:

0x2 == 0b10

True

0x3 == 0b11

True

# if we paste together hexadecimal 2 and 3, we have to paste together
# binary 0010 and 0011...
0x23 == 0b00100011

True

# ... not just 10 and 11
0x23 == 0b1011

False

The padding has no effect on the value, much like decimal 42 and 00000042 are effectively the same numbers.

3.3. Representing text: a DIY approach¶

With that out of the way, the best way to understand how computers represent text, and why it works like it does, is to try and come up with our own system to do that. First of all, we’ll need a table of all the characters we want to support, mapping each one of them, you guessed it, to a unique number, since computers only work with numbers. For this toy example, let’s say four characters are enough:

number	character
1	a
2	b
3	c
4	d

Such a table is called a character set, and each row is a codepoint, which basically means a number/character pair. Let’s represent our character set as a Python dictionary; since we’ll mostly be concerned with going from characters to numbers, we’ll use the characters as keys.

charset = dict(a=1, b=2, c=3, d=4)
charset

{'a': 1, 'b': 2, 'c': 3, 'd': 4}

Now, we’ve just established that computers typically know only about two digits, 0 and 1, but our character set goes up to 4, so we’re going to have to come up with a way to encode these numbers into series of 0’s and 1’s so that we can store them in computer memory. We can think of an encoding as a function which decides how to turn a number into a sequence of bits. A very simple encoding would be to turn each number into the corresponding number of 1’s and store those in the memory.

def encoding1(char):
    num = charset[char]
    return "1" * num

Let’s see how our new encoding1 encodes each character in charset.

for char in charset.keys():
    print(char, "->", encoding1(char))

a -> 1
b -> 11
c -> 111
d -> 1111

Looks fine so far, all of the characters map to a different sequence of bits! Now for something more challenging: let’s see how it handles encoding strings which consist of multiple characters.

[encoding1(char) for char in "ac"]

['1', '111']

This seems fine as well, except we have to remember that computer memory is just a long line of contiguous slots, with no boundaries between them. Even when we conceptually split it into bytes, the byte boundaries every eight bits are just imagined, it’s not like there’s some kind of fence in the memory after every eight slots. So we have to join this list of strings in order to get a more accurate idea of what our encoding would actually look like when stored in memory.

"".join(encoding1(char) for char in "ac")

'1111'

Uh-oh. Can you spot the problem? Our encoding works perfectly well one way, to encode characters into bits, but in the other direction, it breaks down, because we don’t really have a way to reconstruct the boundaries between individual characters. Imagine you’re a program whose task it is to show characters on screen based on that chunk of memory. You see those four 1’s. What characters should you display? Well, you could go for aaaa. Or for ac. Or d. Or… You get the gist.

How can we fix this? We’ll definitely have to put to use that other binary digit we have at our disposition, 0, because we just saw our problem is that encoding1 yields an undifferentiated, uninterrupted string of 1’s. One way we could do this is by using the 0 as a character terminator. This encoding2 is basically the same as encoding1, just with a 0 tacked at the end each time.

def encoding2(char):
    return encoding1(char) + "0"

for char in charset.keys():
    print(char, "->", encoding2(char))

"".join(encoding2(char) for char in "ac")

a -> 10
b -> 110
c -> 1110
d -> 11110

'101110'

That’s better! Now we know that each time we encounter a 0, we’ve reached the end of a character, and we can determine which character it was by counting the number of preceding 1’s. So 101110 can only correspond to ac. encoding2 is the first encoding we’ve written worthy of the name because it can go both ways – at least in theory, the function we’ve written obviously works only for encoding; for decoding series of bits into characters, we’d have to write the inverse function. It’s a variable-width encoding, because different characters are encoded using different numbers of bits – a is encoded as 10 for a width of 2, whereas c is 4 bits wide, 1110.

The technical term for this kind of encoding is a prefix code. The name comes from the fact that you have to make sure that no character’s encoding can be mistaken for the prefix of another character’s (longer) encoding, which ensures that decoding is possible. By taking into account expected frequencies of the individual characters, there are ways to automatically come up with the optimal coding scheme, i.e. that which yields the shortest encoded messages on average. See Huffman coding, but this blog post or this video might be more digestible than the Wikipedia article.

Unfortunately, it’s not very good. The number of bits per character quickly gets out of hand. If we had a character set consisting of all 26 letters of the English alphabet, the last few would take up more than 20 bits per character, and that’s just the lowercase letters.

The trouble is that we’ve tied 1 and 0 down to a single role: 1’s determine the codepoint number, and 0’s tell us when to stop adding those 1’s. If we could use both to encode the number, say in its usual binary form, then we could squeeze a lot more information into the same space. Something like this:

def encoding3(char):
    num = charset[char]
    binary = bin(num)
    # remove the 0b prefix from the representation
    without_prefix = binary[2:]
    return without_prefix

for char in charset.keys():
    print(char, "->", encoding3(char))

"".join(encoding3(char) for char in "ac")

a -> 1
b -> 10
c -> 11
d -> 100

'111'

Except encoding3 has the same drawback as encoding1: it’s not a proper encoding, because it doesn’t work both ways, the decoding back is ambiguous. 111 could stand for any of aaa, ac or ca. So how do we free the hands of the 0 to use its full potential as a digit? Can we use something else to mark character boundaries? But we’ve established there is nothing else than 1’s and 0’s in computer memory!

… unless we use a trick. What if we said that each character always has to fit into the same number of bits, e.g. 1 byte? Then we could read memory by chunks of 8 bits and just interpret each chunk as a character. It would work exactly as if there were actual boundaries delimiting characters in the memory, spaced evenly 8 bits apart, even though we know that in actual fact, there are no boundaries between the individual tiny memory slots. This approach is called a fixed-width encoding, in contrast to the variable-width approach we encountered earlier.

For our toy character set, we don’t need 8 bits, 2 are enough to encode 4 different characters, corresponding to the 4 different sequences of 1’s and 0’s you can create with two available slots: 00, 01, 10 and 11. Let’s what decimal numbers these correspond to – it’s easy, you can probably do it in your head, but let Python tell us anyway.

0b00, 0b01, 0b10, 0b11

(0, 1, 2, 3)

Neat, we can easily get those numbers by just shifting the number values in our character set table by 1 (i.e. subtracting 1). So an encoding4 function could look something like this:

def encoding4(char):
    num = charset[char] - 1
    binary = bin(num)
    without_prefix = binary[2:]
    padded = without_prefix.rjust(2, "0")
    return padded

The .rjust() (and .ljust()) methods pad a string to a desired width with a provided padding character; in our case, we pad to a maximum width of 2 with 0’s, unless the string is already 2 characters wide (or wider). Perhaps confusingly, .rjust() pads on the left and vice versa; this is because padding on the left justifies the text along the right margin (hence .rjust()). Let’s take this baby out for a spin.

for char in charset.keys():
    print(char, "->", encoding4(char))

"".join(encoding4(char) for char in "ac")

a -> 00
b -> 01
c -> 10
d -> 11

'0010'

Looks alright to me! When decoding, we just chop up the memory into chunks two bits wide and interpret each as a separate character, so 00 yields a and 10 yields c. Piece of cake. This is where we leave our toy example character set and encodings as they have no more to teach us, and pick up the thread of the story in the real world.

3.4. Fixed-width encodings: ASCII et al.¶

Obviously, how many different characters your encoding can handle depends on how many bits you allow per character:

with 1 bit you can have \(2^1 = 2\) characters (one is mapped to 0, the other to 1)
with 2 bits you can have \(2^2 = 2 \times 2 = 4\) characters (mapped to 00, 01, 10 and 11)
with 3 bits you can have \(2^3 = 2 \times 2 \times 2 = 8\) characters
etc.

The oldest encoding still in widespread use is called ASCII, which is a 7-bit encoding. What’s the number of different sequences of seven 1’s and 0’s?

# this is how Python spells 2⁷, i.e. 2*2*2*2*2*2*2
2**7

This means ASCII can represent 128 different characters, which comfortably fits the basic Latin alphabet (both lowercase and uppercase), Arabic numerals, punctuation and some “control characters” which were primarily useful on the old teletype terminals for which ASCII was designed. For instance, the letter “A” corresponds to the number 65 (1000001 in binary, see above).

Nowadays, ASCII is represented using 8 bits (= 1 byte), because that’s the unit of computer memory which has become ubiquitous (in terms of both hardware and software assumptions), but still uses only 7 bits’ worth of information. That extra bit means that there’s room for another 128 characters in addition to the 128 ASCII ones, coming up to a total of 256.

2**(7+1)

What happens in the range [128; 256) is not covered by the ASCII standard. In the 1990s, many encodings were standardized which used this range for their own purposes, usually representing additional accented characters used in a particular region. E.g. Czech (and Slovak, Polish…) alphabets can be represented using the ISO latin-2 encoding, or Microsoft’s cp-1250. Encodings which stick to the same character mappings as ASCII in the range [0; 128) and represent them physically in the same way (as 1 byte), while potentially adding more character mappings beyond that, are called ASCII-compatible.

ASCII compatibility is a good thing™, because when you start reading a character stream in a computer, there’s no way to know in advance what encoding it is in (unless it’s a file you’ve encoded yourself and you happen to remember). So in practice, a heuristic has been established to start reading the stream assuming it’s ASCII by default, and switch to a different encoding if evidence becomes available to the contrary. For instance, HTML files describing web pages displayed in your browser should all start with something like this:

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8"/>
  ...

This way, whenever a program wants to read a file like this, it can start off with ASCII, waiting to see if it reaches the charset (i.e. encoding) attribute, and once it does, it can switch from ASCII to that encoding (UTF-8 here) and restart reading the file, now fairly sure that it’s using the correct encoding. This trick works only if we can assume that whatever encoding the rest of the file is in, the first few lines can be considered as ASCII for all practical intents and purposes.

Without the charset attribute, the only way to know if the encoding is right would be for you to look at the rendered text and see if it makes sense; if it did not, you’d have to resort to trial and error, manually switching the encodings and looking for the one in which the numbers behind the characters stop coming out as gibberish and are actually translated into intelligible text.

Let’s take a look at printable characters in the Latin-2 character set. The character set consists of mappings between positive integers (whole numbers) and characters; each one of these is called a codepoint. The Latin-2 encoding then defines how to encode each of these integers as a series of bits (1’s and 0’s) in the computer’s memory.

latin2_printable_characters = []
# the Latin-2 character set has 256 codepoints, corresponding to
# integers from 0 to 255
for codepoint in range(256):
    # the Latin-2 encoding is simple: each codepoint is encoded
    # as the byte corresponding to that integer in binary
    byte = bytes([codepoint])
    character = byte.decode(encoding="latin2")
    if character.isprintable():
        latin2_printable_characters.append((codepoint, character))

latin2_printable_characters

[(32, ' '),
 (33, '!'),
 (34, '"'),
 (35, '#'),
 (36, '$'),
 (37, '%'),
 (38, '&'),
 (39, "'"),
 (40, '('),
 (41, ')'),
 (42, '*'),
 (43, '+'),
 (44, ','),
 (45, '-'),
 (46, '.'),
 (47, '/'),
 (48, '0'),
 (49, '1'),
 (50, '2'),
 (51, '3'),
 (52, '4'),
 (53, '5'),
 (54, '6'),
 (55, '7'),
 (56, '8'),
 (57, '9'),
 (58, ':'),
 (59, ';'),
 (60, '<'),
 (61, '='),
 (62, '>'),
 (63, '?'),
 (64, '@'),
 (65, 'A'),
 (66, 'B'),
 (67, 'C'),
 (68, 'D'),
 (69, 'E'),
 (70, 'F'),
 (71, 'G'),
 (72, 'H'),
 (73, 'I'),
 (74, 'J'),
 (75, 'K'),
 (76, 'L'),
 (77, 'M'),
 (78, 'N'),
 (79, 'O'),
 (80, 'P'),
 (81, 'Q'),
 (82, 'R'),
 (83, 'S'),
 (84, 'T'),
 (85, 'U'),
 (86, 'V'),
 (87, 'W'),
 (88, 'X'),
 (89, 'Y'),
 (90, 'Z'),
 (91, '['),
 (92, '\\'),
 (93, ']'),
 (94, '^'),
 (95, '_'),
 (96, '`'),
 (97, 'a'),
 (98, 'b'),
 (99, 'c'),
 (100, 'd'),
 (101, 'e'),
 (102, 'f'),
 (103, 'g'),
 (104, 'h'),
 (105, 'i'),
 (106, 'j'),
 (107, 'k'),
 (108, 'l'),
 (109, 'm'),
 (110, 'n'),
 (111, 'o'),
 (112, 'p'),
 (113, 'q'),
 (114, 'r'),
 (115, 's'),
 (116, 't'),
 (117, 'u'),
 (118, 'v'),
 (119, 'w'),
 (120, 'x'),
 (121, 'y'),
 (122, 'z'),
 (123, '{'),
 (124, '|'),
 (125, '}'),
 (126, '~'),
 (161, 'Ą'),
 (162, '˘'),
 (163, 'Ł'),
 (164, '¤'),
 (165, 'Ľ'),
 (166, 'Ś'),
 (167, '§'),
 (168, '¨'),
 (169, 'Š'),
 (170, 'Ş'),
 (171, 'Ť'),
 (172, 'Ź'),
 (174, 'Ž'),
 (175, 'Ż'),
 (176, '°'),
 (177, 'ą'),
 (178, '˛'),
 (179, 'ł'),
 (180, '´'),
 (181, 'ľ'),
 (182, 'ś'),
 (183, 'ˇ'),
 (184, '¸'),
 (185, 'š'),
 (186, 'ş'),
 (187, 'ť'),
 (188, 'ź'),
 (189, '˝'),
 (190, 'ž'),
 (191, 'ż'),
 (192, 'Ŕ'),
 (193, 'Á'),
 (194, 'Â'),
 (195, 'Ă'),
 (196, 'Ä'),
 (197, 'Ĺ'),
 (198, 'Ć'),
 (199, 'Ç'),
 (200, 'Č'),
 (201, 'É'),
 (202, 'Ę'),
 (203, 'Ë'),
 (204, 'Ě'),
 (205, 'Í'),
 (206, 'Î'),
 (207, 'Ď'),
 (208, 'Đ'),
 (209, 'Ń'),
 (210, 'Ň'),
 (211, 'Ó'),
 (212, 'Ô'),
 (213, 'Ő'),
 (214, 'Ö'),
 (215, '×'),
 (216, 'Ř'),
 (217, 'Ů'),
 (218, 'Ú'),
 (219, 'Ű'),
 (220, 'Ü'),
 (221, 'Ý'),
 (222, 'Ţ'),
 (223, 'ß'),
 (224, 'ŕ'),
 (225, 'á'),
 (226, 'â'),
 (227, 'ă'),
 (228, 'ä'),
 (229, 'ĺ'),
 (230, 'ć'),
 (231, 'ç'),
 (232, 'č'),
 (233, 'é'),
 (234, 'ę'),
 (235, 'ë'),
 (236, 'ě'),
 (237, 'í'),
 (238, 'î'),
 (239, 'ď'),
 (240, 'đ'),
 (241, 'ń'),
 (242, 'ň'),
 (243, 'ó'),
 (244, 'ô'),
 (245, 'ő'),
 (246, 'ö'),
 (247, '÷'),
 (248, 'ř'),
 (249, 'ů'),
 (250, 'ú'),
 (251, 'ű'),
 (252, 'ü'),
 (253, 'ý'),
 (254, 'ţ'),
 (255, '˙')]

Using the 8th bit (and thus the codepoint range [128; 256)) solves the problem of handling languages with character sets different than that of American English, but introduces a lot of complexity – whenever you come across a text file with an unknown encoding, it might be in one of literally dozens of encodings. Additional drawbacks include:

how to handle multilingual text with characters from many different alphabets, which are not part of the same 8-bit encoding?
how to handle writing systems which have way more than 256 “characters”, e.g. Chinese, Japanese and Korean (CJK) ideograms?

3.5. Unicode and UTF-8¶

For these purposes, a standard character set known as Unicode was developed which strives for universal coverage of (ultimately) all characters ever used in the history of writing, even adding new ones like emojis. Unicode is much bigger than the character sets we’ve seen so far – its most frequently used subset, the Basic Multilingual Plane, has \(2^{16}\) codepoints, but overall the number of codepoints is past 1M and there’s room to accommodate many more.

2**16

Here’s just a small sample of the treasure trove of codepoints that is Unicode.

from unicodedata import name

print("\N{HORIZONTAL ELLIPSIS}")
for sample in (range(0x16a0, 0x16f1), range(0x1f600, 0x1f645)):
    for cp in sample:
        char = chr(cp)
        print(f"U+{cp:x}\t{char}\t{name(char)}")
    print("\N{HORIZONTAL ELLIPSIS}")

…
U+16a0	ᚠ	RUNIC LETTER FEHU FEOH FE F
U+16a1	ᚡ	RUNIC LETTER V
U+16a2	ᚢ	RUNIC LETTER URUZ UR U
U+16a3	ᚣ	RUNIC LETTER YR
U+16a4	ᚤ	RUNIC LETTER Y
U+16a5	ᚥ	RUNIC LETTER W
U+16a6	ᚦ	RUNIC LETTER THURISAZ THURS THORN
U+16a7	ᚧ	RUNIC LETTER ETH
U+16a8	ᚨ	RUNIC LETTER ANSUZ A
U+16a9	ᚩ	RUNIC LETTER OS O
U+16aa	ᚪ	RUNIC LETTER AC A
U+16ab	ᚫ	RUNIC LETTER AESC
U+16ac	ᚬ	RUNIC LETTER LONG-BRANCH-OSS O
U+16ad	ᚭ	RUNIC LETTER SHORT-TWIG-OSS O
U+16ae	ᚮ	RUNIC LETTER O
U+16af	ᚯ	RUNIC LETTER OE
U+16b0	ᚰ	RUNIC LETTER ON
U+16b1	ᚱ	RUNIC LETTER RAIDO RAD REID R
U+16b2	ᚲ	RUNIC LETTER KAUNA
U+16b3	ᚳ	RUNIC LETTER CEN
U+16b4	ᚴ	RUNIC LETTER KAUN K
U+16b5	ᚵ	RUNIC LETTER G
U+16b6	ᚶ	RUNIC LETTER ENG
U+16b7	ᚷ	RUNIC LETTER GEBO GYFU G
U+16b8	ᚸ	RUNIC LETTER GAR
U+16b9	ᚹ	RUNIC LETTER WUNJO WYNN W
U+16ba	ᚺ	RUNIC LETTER HAGLAZ H
U+16bb	ᚻ	RUNIC LETTER HAEGL H
U+16bc	ᚼ	RUNIC LETTER LONG-BRANCH-HAGALL H
U+16bd	ᚽ	RUNIC LETTER SHORT-TWIG-HAGALL H
U+16be	ᚾ	RUNIC LETTER NAUDIZ NYD NAUD N
U+16bf	ᚿ	RUNIC LETTER SHORT-TWIG-NAUD N
U+16c0	ᛀ	RUNIC LETTER DOTTED-N
U+16c1	ᛁ	RUNIC LETTER ISAZ IS ISS I
U+16c2	ᛂ	RUNIC LETTER E
U+16c3	ᛃ	RUNIC LETTER JERAN J
U+16c4	ᛄ	RUNIC LETTER GER
U+16c5	ᛅ	RUNIC LETTER LONG-BRANCH-AR AE
U+16c6	ᛆ	RUNIC LETTER SHORT-TWIG-AR A
U+16c7	ᛇ	RUNIC LETTER IWAZ EOH
U+16c8	ᛈ	RUNIC LETTER PERTHO PEORTH P
U+16c9	ᛉ	RUNIC LETTER ALGIZ EOLHX
U+16ca	ᛊ	RUNIC LETTER SOWILO S
U+16cb	ᛋ	RUNIC LETTER SIGEL LONG-BRANCH-SOL S
U+16cc	ᛌ	RUNIC LETTER SHORT-TWIG-SOL S
U+16cd	ᛍ	RUNIC LETTER C
U+16ce	ᛎ	RUNIC LETTER Z
U+16cf	ᛏ	RUNIC LETTER TIWAZ TIR TYR T
U+16d0	ᛐ	RUNIC LETTER SHORT-TWIG-TYR T
U+16d1	ᛑ	RUNIC LETTER D
U+16d2	ᛒ	RUNIC LETTER BERKANAN BEORC BJARKAN B
U+16d3	ᛓ	RUNIC LETTER SHORT-TWIG-BJARKAN B
U+16d4	ᛔ	RUNIC LETTER DOTTED-P
U+16d5	ᛕ	RUNIC LETTER OPEN-P
U+16d6	ᛖ	RUNIC LETTER EHWAZ EH E
U+16d7	ᛗ	RUNIC LETTER MANNAZ MAN M
U+16d8	ᛘ	RUNIC LETTER LONG-BRANCH-MADR M
U+16d9	ᛙ	RUNIC LETTER SHORT-TWIG-MADR M
U+16da	ᛚ	RUNIC LETTER LAUKAZ LAGU LOGR L
U+16db	ᛛ	RUNIC LETTER DOTTED-L
U+16dc	ᛜ	RUNIC LETTER INGWAZ
U+16dd	ᛝ	RUNIC LETTER ING
U+16de	ᛞ	RUNIC LETTER DAGAZ DAEG D
U+16df	ᛟ	RUNIC LETTER OTHALAN ETHEL O
U+16e0	ᛠ	RUNIC LETTER EAR
U+16e1	ᛡ	RUNIC LETTER IOR
U+16e2	ᛢ	RUNIC LETTER CWEORTH
U+16e3	ᛣ	RUNIC LETTER CALC
U+16e4	ᛤ	RUNIC LETTER CEALC
U+16e5	ᛥ	RUNIC LETTER STAN
U+16e6	ᛦ	RUNIC LETTER LONG-BRANCH-YR
U+16e7	ᛧ	RUNIC LETTER SHORT-TWIG-YR
U+16e8	ᛨ	RUNIC LETTER ICELANDIC-YR
U+16e9	ᛩ	RUNIC LETTER Q
U+16ea	ᛪ	RUNIC LETTER X
U+16eb	᛫	RUNIC SINGLE PUNCTUATION
U+16ec	᛬	RUNIC MULTIPLE PUNCTUATION
U+16ed	᛭	RUNIC CROSS PUNCTUATION
U+16ee	ᛮ	RUNIC ARLAUG SYMBOL
U+16ef	ᛯ	RUNIC TVIMADUR SYMBOL
U+16f0	ᛰ	RUNIC BELGTHOR SYMBOL
…
U+1f600	😀	GRINNING FACE
U+1f601	😁	GRINNING FACE WITH SMILING EYES
U+1f602	😂	FACE WITH TEARS OF JOY
U+1f603	😃	SMILING FACE WITH OPEN MOUTH
U+1f604	😄	SMILING FACE WITH OPEN MOUTH AND SMILING EYES
U+1f605	😅	SMILING FACE WITH OPEN MOUTH AND COLD SWEAT
U+1f606	😆	SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES
U+1f607	😇	SMILING FACE WITH HALO
U+1f608	😈	SMILING FACE WITH HORNS
U+1f609	😉	WINKING FACE
U+1f60a	😊	SMILING FACE WITH SMILING EYES
U+1f60b	😋	FACE SAVOURING DELICIOUS FOOD
U+1f60c	😌	RELIEVED FACE
U+1f60d	😍	SMILING FACE WITH HEART-SHAPED EYES
U+1f60e	😎	SMILING FACE WITH SUNGLASSES
U+1f60f	😏	SMIRKING FACE
U+1f610	😐	NEUTRAL FACE
U+1f611	😑	EXPRESSIONLESS FACE
U+1f612	😒	UNAMUSED FACE
U+1f613	😓	FACE WITH COLD SWEAT
U+1f614	😔	PENSIVE FACE
U+1f615	😕	CONFUSED FACE
U+1f616	😖	CONFOUNDED FACE
U+1f617	😗	KISSING FACE
U+1f618	😘	FACE THROWING A KISS
U+1f619	😙	KISSING FACE WITH SMILING EYES
U+1f61a	😚	KISSING FACE WITH CLOSED EYES
U+1f61b	😛	FACE WITH STUCK-OUT TONGUE
U+1f61c	😜	FACE WITH STUCK-OUT TONGUE AND WINKING EYE
U+1f61d	😝	FACE WITH STUCK-OUT TONGUE AND TIGHTLY-CLOSED EYES
U+1f61e	😞	DISAPPOINTED FACE
U+1f61f	😟	WORRIED FACE
U+1f620	😠	ANGRY FACE
U+1f621	😡	POUTING FACE
U+1f622	😢	CRYING FACE
U+1f623	😣	PERSEVERING FACE
U+1f624	😤	FACE WITH LOOK OF TRIUMPH
U+1f625	😥	DISAPPOINTED BUT RELIEVED FACE
U+1f626	😦	FROWNING FACE WITH OPEN MOUTH
U+1f627	😧	ANGUISHED FACE
U+1f628	😨	FEARFUL FACE
U+1f629	😩	WEARY FACE
U+1f62a	😪	SLEEPY FACE
U+1f62b	😫	TIRED FACE
U+1f62c	😬	GRIMACING FACE
U+1f62d	😭	LOUDLY CRYING FACE
U+1f62e	😮	FACE WITH OPEN MOUTH
U+1f62f	😯	HUSHED FACE
U+1f630	😰	FACE WITH OPEN MOUTH AND COLD SWEAT
U+1f631	😱	FACE SCREAMING IN FEAR
U+1f632	😲	ASTONISHED FACE
U+1f633	😳	FLUSHED FACE
U+1f634	😴	SLEEPING FACE
U+1f635	😵	DIZZY FACE
U+1f636	😶	FACE WITHOUT MOUTH
U+1f637	😷	FACE WITH MEDICAL MASK
U+1f638	😸	GRINNING CAT FACE WITH SMILING EYES
U+1f639	😹	CAT FACE WITH TEARS OF JOY
U+1f63a	😺	SMILING CAT FACE WITH OPEN MOUTH
U+1f63b	😻	SMILING CAT FACE WITH HEART-SHAPED EYES
U+1f63c	😼	CAT FACE WITH WRY SMILE
U+1f63d	😽	KISSING CAT FACE WITH CLOSED EYES
U+1f63e	😾	POUTING CAT FACE
U+1f63f	😿	CRYING CAT FACE
U+1f640	🙀	WEARY CAT FACE
U+1f641	🙁	SLIGHTLY FROWNING FACE
U+1f642	🙂	SLIGHTLY SMILING FACE
U+1f643	🙃	UPSIDE-DOWN FACE
U+1f644	🙄	FACE WITH ROLLING EYES
…

Now, the most straightforward representation for \(2^{16}\) codepoints is what? Well, it’s simply using 16 bits per character, i.e. 2 bytes. That encoding exists, it’s called UTF-16 (“UTF” stands for “Unicode Transformation Format”), but consider the drawbacks:

we’ve lost ASCII compatibility by the simple fact of using 2 bytes per character instead of 1 (encoding “a” as 01100001 or 00000000|01100001, with the | indicating an imaginary boundary between bytes, is not the same thing)
encoding a string in a language which is mostly written down using basic letters of the Latin alphabet now takes up twice as much space (which is probably not a good idea, given the general dominance of English in electronic communication)

Looks like we’ll have to think outside the box. The box in question here is fixed-width encodings – all of the real-world encoding schemes we’ve encountered so far were fixed-width, meaning that each character was represented by either 7, 8 or 16 bits. In other words, you could jump around the string in multiples of 7, 8 or 16 and always land at the beginning of a character. (Not exactly true for UTF-16, because it is something more than just a “16-bit ASCII”: it has ways of handling characters beyond \(2^{16}\) using so-called surrogate sequences – but you get the gist.)

The smart idea that some bright people have come up with was to use a variable-width encoding, specifically one that doesn’t suck, unlike our encoding2. The most ubiquitous one currently is UTF-8, which we’ve already met in the HTML example above. UTF-8 is ASCII-compatible, i.e. the 1’s and 0’s used to encode text containing only ASCII characters are the same regardless of whether you use ASCII or UTF-8: it’s a sequence of 8-bit bytes. But UTF-8 can also handle many more additional characters, as defined by the Unicode standard, by using progressively longer and longer sequences of bits.

def print_utf8_bytes(char):
    """Prints binary representation of character as encoded by UTF-8.

    """
    # encode the string as UTF-8 and iterate over the bytes;
    # iterating over a sequence of bytes yields integers in the
    # range [0; 256); the formatting directive "{:08b}" does two
    # things:
    #   - "b" prints the integer in its binary representation
    #   - "08" left-pads the binary representation with 0's to a total
    #     width of 8, which is the width of a byte
    binary_bytes = [f"{byte:08b}" for byte in char.encode("utf8")]
    print(f"{char!r} encoded in UTF-8 is: {binary_bytes}")

print_utf8_bytes("A")   # the representations...
print_utf8_bytes("č")   # ... keep...
print_utf8_bytes("字")  # ... getting longer.

'A' encoded in UTF-8 is: ['01000001']
'č' encoded in UTF-8 is: ['11000100', '10001101']
'字' encoded in UTF-8 is: ['11100101', '10101101', '10010111']

How does that even work? The obvious problem here is that with a fixed-width encoding, you just chop up the string at regular intervals (7, 8, 16 bits) and you know that each interval represents one character. So how do you know where to chop up a variable width-encoded string, if each character can take up a different number of bits?

Essentially, the trick is to use some of the bits in the representation of a codepoint to store information not about which character it is (whether it’s an “A” or a “字”), but how many bits it occupies. This is what we did with our encoding2, albeit in a very primitive way, by simply using 0 as a character delimiter. In other words, if you want to skip ahead 10 characters in a string encoded with a variable width-encoding, you can’t just skip 10 * 7 or 8 or 16 bits; you have to read all the intervening characters to figure out how much space they take up. Take the following example:

for char in "Básník 李白":
    print_utf8_bytes(char)

'B' encoded in UTF-8 is: ['01000010']
'á' encoded in UTF-8 is: ['11000011', '10100001']
's' encoded in UTF-8 is: ['01110011']
'n' encoded in UTF-8 is: ['01101110']
'í' encoded in UTF-8 is: ['11000011', '10101101']
'k' encoded in UTF-8 is: ['01101011']
' ' encoded in UTF-8 is: ['00100000']
'李' encoded in UTF-8 is: ['11100110', '10011101', '10001110']
'白' encoded in UTF-8 is: ['11100111', '10011001', '10111101']

Notice the initial bits in each byte of a character follow a pattern depending on how many bytes in total that character has:

if it’s a 1-byte character, that byte starts with 0
if it’s a 2-byte character, the first byte starts with 11 and the following one with 10
if it’s a 3-byte character, the first byte starts with 111 and the following ones with 10

This makes it possible to find out which bytes belong to which characters, and also to spot invalid strings, as the leading byte in a multi-byte sequence always “announces” how many continuation bytes (= starting with 10) should follow.

So much for a quick introduction to UTF-8 (= the encoding), but there’s much more to Unicode (= the character set). While UTF-8 defines only how integer numbers corresponding to codepoints are to be represented as 1’s and 0’s in a computer’s memory, Unicode specifies how those numbers are to be interpreted as characters, what their properties and mutual relationships are, what conversions (i.e. mappings between (sequences of) codepoints) they can undergo, etc.

Consider for instance the various ways diacritics are handled: “č” can be represented either as a single codepoint (LATIN SMALL LETTER C WITH CARON – all Unicode codepoints have cute names like this) or a sequence of two codepoints, the character “c” and a combining diacritic mark (COMBINING CARON). You can search for the codepoints corresponding to Unicode characters e.g. here and play with them in Python using the chr(0xXXXX) built-in function or with the special string escape sequence \uXXXX (where XXXX is the hexadecimal representation of the codepoint) – both are ways to get the character corresponding to the given codepoint:

# "č" as LATIN SMALL LETTER C WITH CARON, codepoint 010d
print(chr(0x010d))
print("\u010d")

č
č

# "č" as a sequence of LATIN SMALL LETTER C, codepoint 0063, and
# COMBINING CARON, codepoint 030c
print(chr(0x0063) + chr(0x030c))
print("\u0063\u030c")

č
č

# of course, chr() also works with decimal numbers
chr(269)

'č'

This means you have to be careful when working with languages that use accents, because to a computer, the two possible representations are of course different strings, even though to you, they’re conceptually the same:

s1 = "\u010d"
s2 = "\u0063\u030c"
# s1 and s2 look the same to the naked eye...
print(s1, s2)

č č

# ... but they're not
s1 == s2

False

Watch out, they even have different lengths! This might come to bite you if you’re trying to compute the length of a word in letters.

print("s1 is", len(s1), "character(s) long.")
print("s2 is", len(s2), "character(s) long.")

s1 is 1 character(s) long.
s2 is 2 character(s) long.

For this reason, even though we’ve been informally calling these Unicode entities “characters”, it is more accurate and less confusing to use the technical term “codepoints”.

Generally, most text out there will use the first, single-codepoint approach whenever possible, and pre-packaged linguistic corpora will try to be consistent about this (unless they come from the web, which always warrants being suspicious and defensive about your material). If you’re worried about inconsistencies in your data, you can perform a normalization:

from unicodedata import normalize

# NFC stands for Normal Form C; this normalization applies a canonical
# decomposition (into a multi-codepoint representation) followed by a
# canonical composition (into a single-codepoint representation)
s1 = normalize("NFC", s1)
s2 = normalize("NFC", s2)

s1 == s2

True

Let’s wrap things up by saying that Python itself uses Unicode internally, but the encoding it defaults to when opening an external file depends on the locale of the system (broadly speaking, the set of region, language and character-encoding related settings of the operating system). On most modern Linux and macOS systems, this will probably be a UTF-8 locale and Python will therefore assume UTF-8 as the encoding by default. Unfortunately, Windows is different. To be on the safe side, whenever opening files in Python, you can specify the encoding explicitly:

with open("unicode.ipynb", encoding="utf-8") as file:
    pass

In fact, it’s always a good idea to specify the encoding explicitly, using UTF-8 as a default if you don’t know, for at least two reasons – it makes your code more:

portable – it will work the same across different operating systems which assume different default encodings;
and resistant to data corruption – UTF-8 is more restrictive than fixed-width encodings, in the sense that not all sequences of bytes are valid UTF-8.

That second point probably requires elaboration. For instance, if one byte starts with 11, then the following one must start with 10 (see above). If it starts with anything else, it’s an error. By contrast, in a fixed-width encoding, any sequence of bytes is valid. Decoding will always succeed, but if you use the wrong fixed-width encoding, the result will be garbage, which you might not notice. Therefore, it makes sense to default to UTF-8: if it works, then there’s a good chance that the file actually was encoded in UTF-8 and you’ve read the data in correctly; if it fails, you get an explicit error which prompts you to investigate further.

Another good idea, when dealing with Unicode text from an unknown and unreliable source, is to look at the set of codepoints contained in it and eliminate or replace those that look suspicious. Here’s a function to help with that:

import unicodedata as ud
from collections import Counter

import pandas as pd

def inspect_codepoints(string):
    """Create a frequency distribution of the codepoints in a string.

    """
    char_frequencies = Counter(string)
    df = pd.DataFrame.from_records(
        (
            freq,
            char,
            f"U+{ord(char):04x}",
            ud.name(char),
            ud.category(char)
        )
        for char, freq in char_frequencies.most_common()
    )
    df.columns = ("freq", "char", "codepoint", "name", "category")
    return df

Depending on your font configuration, it may be very hard to spot the two intruders in the sentence below. The frequency table shows the string contains regular LATIN SMALL LETTER T and LATIN SMALL LETTER G, but also their specialized but visually similar variants MATHEMATICAL SANS-SERIF SMALL T and LATIN SMALL LETTER SCRIPT G. You might want to replace such codepoints before doing further text processing…

inspect_codepoints("Intruders here, good 𝗍hinɡ I checked.")

	freq	char	codepoint	name	category
0	5	e	U+0065	LATIN SMALL LETTER E	Ll
1	5		U+0020	SPACE	Zs
2	3	r	U+0072	LATIN SMALL LETTER R	Ll
3	3	d	U+0064	LATIN SMALL LETTER D	Ll
4	3	h	U+0068	LATIN SMALL LETTER H	Ll
5	2	I	U+0049	LATIN CAPITAL LETTER I	Lu
6	2	n	U+006e	LATIN SMALL LETTER N	Ll
7	2	o	U+006f	LATIN SMALL LETTER O	Ll
8	2	c	U+0063	LATIN SMALL LETTER C	Ll
9	1	t	U+0074	LATIN SMALL LETTER T	Ll
10	1	u	U+0075	LATIN SMALL LETTER U	Ll
11	1	s	U+0073	LATIN SMALL LETTER S	Ll
12	1	,	U+002c	COMMA	Po
13	1	g	U+0067	LATIN SMALL LETTER G	Ll
14	1	𝗍	U+1d5cd	MATHEMATICAL SANS-SERIF SMALL T	Ll
15	1	i	U+0069	LATIN SMALL LETTER I	Ll
16	1	ɡ	U+0261	LATIN SMALL LETTER SCRIPT G	Ll
17	1	k	U+006b	LATIN SMALL LETTER K	Ll
18	1	.	U+002e	FULL STOP	Po

… because of course, for a computer, the word “thing” written with two different variants of “g” is really just two different words, which is probably not what you want:

"thing" == "thinɡ"

False

So to sum up:

Unicode strives to be a universal character set. It contains a lot of characters, many very similar-looking yet different. Appearances can be deceptive, when in doubt, examine which codepoints you’re actually dealing with and/or normalize.
Unicode can be encoded using different encodings. Some are fixed-width (UTF-32, which we haven’t mentioned yet), some are almost fixed-width (UTF-16), some are variable-width (UTF-8).
UTF-8 has many desirable properties, so you should always use it when saving plain text files, and always assume it as a first try when opening files in an unknown encoding.
Internally, Python uses a custom representation of Unicode, which is neither of the encodings we already mentioned.
The following functionality is useful for inspecting Unicode data in Python: the ord() and chr() built-in functions, the unicodedata standard library module, and the regex external package, which like the standard library re module implements regular expression support from Python, but unlike the latter, it provides much more extensive Unicode support.

An Introduction to Python for Linguists