3. Text inside the computer¶
3.1. Why should I care?¶
Even though as linguists, we may think we know everything about text, definitely more than programmers, thank you very much – this ain’t Kansas anymore as they say, this is Computerland, and the rules for how text works are strict and sometimes surprising. If we’re going to take advantage of computers to supercharge our linguistic analyses, we need to learn to play by their rules, otherwise we stand a big chance of shooting ourselves in the foot.
On occasion, this stuff might seem ridiculously low-level to a linguist – you’ll probably get that feeling more than once while perusing this chapter. But if you get it wrong, anything that you build on top will crumble like a house of cards and your linguistic analyses will be likely to yield garbage. And just to be extra clear, every single day, there are professional programmers who get some of this wrong. When was the last time an e-shop maltreated the diacritics in your name, for instance?
3.2. Binary and hexadeci-what-now?¶
Before we get started, humor me as I engage in a short digression on representing numbers, to make sure we’re all on the same page. Why numbers, I hear you ask, I thought this chapter was supposed to be about text? Well because in a computer, it’s numbers all the way down, even text is ultimately represented as numbers. So we need to be relatively comfortable with the different ways numbers are commonly written down in the context of computers:
as a plain old decimal number, using 10 digits 0–9 (e.g. 12)
as a binary number, using only 2 digits, 0 and 1 (e.g. 1100, which equals decimal 12)
as a hexadecimal number, using 16 digits: 0–9 and a–f (e.g. c, which also equals decimal 12)
All of these numeral systems can represent any number, the binary system doesn’t stop counting at 1, any more than the decimal system stops counting at 9 – beyond these respective points, we just add another digit and carry on. The only difference is that the fewer different digits a system uses, the more digits it takes to represent a number.
If you’re curious how that works, just take a moment to reflect on what you do implicitly each time you read a regular decimal number: you multiply the individual digits with increasing powers of 10, which is the base of the decimal system, starting from the right, and add this all up:
e.g. for (regular decimal) 12: \(2 \times 10^0 + 1 \times 10^1 = 2 \times 1 + 1 \times 10 = 12\)
for binary 1100, this goes: \(0 \times 2^0 + 0 \times 2^1 + 1 \times 2^2 + 1 \times 2^3 = 0 + 0 + 4 + 8 = 12\)
and for hexadecimal c: \(c \times 16^0 = c \times 1 = 12 \times 1 = 12\).
Decimal numbers are useful because everybody knows them, which makes them universally easy to read. Binary numbers are useful because they closely map to the underlying hardware: computer memory consists of bits, tiny slots each of which can hold a 0 or a 1. A group of eight adjacent bits is called a byte.
As a consequence, the need often arises to convert between these
different systems. If you don’t want to do so by hand, Python has your
back! The bin()
function gives you a string representation of the
binary form of a number:
bin(65)
'0b1000001'
As you can see, in Python, binary numbers are given a 0b
prefix to
distinguish them from regular (decimal) numbers. The number itself is
what follows after (i.e. 1000001).
Similarly, hexadecimal numbers are given an 0x
prefix:
hex(65)
'0x41'
To convert in the opposite direction, i.e. to decimal, just evaluate the binary or hexadecimal representation of a number:
0b1000001
65
0x41
65
Number literals using these different bases are mutually compatible, e.g. for comparison purposes:
0b1000001 == 0x41 == 65
True
Why are hexadecimals useful? They’re primarily a more convenient, condensed way of representing sequences of bits: each hexadecimal digit can represent 16 different values, and therefore it can stand in for a sequence of 4 bits, i.e. half a byte (\(2^4 = 16\)).
0xa == 0b1010
True
0xb == 0b1011
True
# if we paste together hexadecimal a and b, it's the same as pasting
# together binary 1010 and 1011
0xab == 0b10101011
True
In other words, instead of binary 10101011
, we can just write
hexadecimal ab
and save ourselves some space. Of course, this only
works if shorter binary numbers are padded to a 4-bit width:
0x2 == 0b10
True
0x3 == 0b11
True
# if we paste together hexadecimal 2 and 3, we have to paste together
# binary 0010 and 0011...
0x23 == 0b00100011
True
# ... not just 10 and 11
0x23 == 0b1011
False
The padding has no effect on the value, much like decimal 42 and 00000042 are effectively the same numbers.
3.3. Representing text: a DIY approach¶
With that out of the way, the best way to understand how computers represent text, and why it works like it does, is to try and come up with our own system to do that. First of all, we’ll need a table of all the characters we want to support, mapping each one of them, you guessed it, to a unique number, since computers only work with numbers. For this toy example, let’s say four characters are enough:
number |
character |
---|---|
1 |
a |
2 |
b |
3 |
c |
4 |
d |
Such a table is called a character set, and each row is a codepoint, which basically means a number/character pair. Let’s represent our character set as a Python dictionary; since we’ll mostly be concerned with going from characters to numbers, we’ll use the characters as keys.
charset = dict(a=1, b=2, c=3, d=4)
charset
{'a': 1, 'b': 2, 'c': 3, 'd': 4}
Now, we’ve just established that computers typically know only about two digits, 0 and 1, but our character set goes up to 4, so we’re going to have to come up with a way to encode these numbers into series of 0’s and 1’s so that we can store them in computer memory. We can think of an encoding as a function which decides how to turn a number into a sequence of bits. A very simple encoding would be to turn each number into the corresponding number of 1’s and store those in the memory.
def encoding1(char):
num = charset[char]
return "1" * num
Let’s see how our new encoding1
encodes each character in charset
.
for char in charset.keys():
print(char, "->", encoding1(char))
a -> 1
b -> 11
c -> 111
d -> 1111
Looks fine so far, all of the characters map to a different sequence of bits! Now for something more challenging: let’s see how it handles encoding strings which consist of multiple characters.
[encoding1(char) for char in "ac"]
['1', '111']
This seems fine as well, except we have to remember that computer memory is just a long line of contiguous slots, with no boundaries between them. Even when we conceptually split it into bytes, the byte boundaries every eight bits are just imagined, it’s not like there’s some kind of fence in the memory after every eight slots. So we have to join this list of strings in order to get a more accurate idea of what our encoding would actually look like when stored in memory.
"".join(encoding1(char) for char in "ac")
'1111'
Uh-oh. Can you spot the problem? Our encoding works perfectly well one
way, to encode characters into bits, but in the other direction, it
breaks down, because we don’t really have a way to reconstruct the
boundaries between individual characters. Imagine you’re a program whose
task it is to show characters on screen based on that chunk of memory.
You see those four 1’s. What characters should you display? Well, you
could go for aaaa
. Or for ac
. Or d
. Or… You get the gist.
How can we fix this? We’ll definitely have to put to use that other
binary digit we have at our disposition, 0, because we just saw our
problem is that encoding1
yields an undifferentiated, uninterrupted
string of 1’s. One way we could do this is by using the 0 as a character
terminator. This encoding2
is basically the same as encoding1
, just
with a 0 tacked at the end each time.
def encoding2(char):
return encoding1(char) + "0"
for char in charset.keys():
print(char, "->", encoding2(char))
"".join(encoding2(char) for char in "ac")
a -> 10
b -> 110
c -> 1110
d -> 11110
'101110'
That’s better! Now we know that each time we encounter a 0, we’ve
reached the end of a character, and we can determine which character it
was by counting the number of preceding 1’s. So 101110
can only
correspond to ac
. encoding2
is the first encoding we’ve written
worthy of the name because it can go both ways – at least in theory,
the function we’ve written obviously works only for encoding; for
decoding series of bits into characters, we’d have to write the
inverse function. It’s a variable-width encoding, because different
characters are encoded using different numbers of bits – a
is encoded
as 10
for a width of 2, whereas c
is 4 bits wide, 1110
.
Unfortunately, it’s not very good. The number of bits per character quickly gets out of hand. If we had a character set consisting of all 26 letters of the English alphabet, the last few would take up more than 20 bits per character, and that’s just the lowercase letters.
The trouble is that we’ve tied 1 and 0 down to a single role: 1’s determine the codepoint number, and 0’s tell us when to stop adding those 1’s. If we could use both to encode the number, say in its usual binary form, then we could squeeze a lot more information into the same space. Something like this:
def encoding3(char):
num = charset[char]
binary = bin(num)
# remove the 0b prefix from the representation
without_prefix = binary[2:]
return without_prefix
for char in charset.keys():
print(char, "->", encoding3(char))
"".join(encoding3(char) for char in "ac")
a -> 1
b -> 10
c -> 11
d -> 100
'111'
Except encoding3
has the same drawback as encoding1
: it’s not a
proper encoding, because it doesn’t work both ways, the decoding back is
ambiguous. 111
could stand for any of aaa
, ac
or ca
. So how do
we free the hands of the 0 to use its full potential as a digit? Can we
use something else to mark character boundaries? But we’ve established
there is nothing else than 1’s and 0’s in computer memory!
… unless we use a trick. What if we said that each character always has to fit into the same number of bits, e.g. 1 byte? Then we could read memory by chunks of 8 bits and just interpret each chunk as a character. It would work exactly as if there were actual boundaries delimiting characters in the memory, spaced evenly 8 bits apart, even though we know that in actual fact, there are no boundaries between the individual tiny memory slots. This approach is called a fixed-width encoding, in contrast to the variable-width approach we encountered earlier.
For our toy character set, we don’t need 8 bits, 2 are enough to encode
4 different characters, corresponding to the 4 different sequences of
1’s and 0’s you can create with two available slots: 00
, 01
, 10
and 11
. Let’s what decimal numbers these correspond to – it’s easy,
you can probably do it in your head, but let Python tell us anyway.
0b00, 0b01, 0b10, 0b11
(0, 1, 2, 3)
Neat, we can easily get those numbers by just shifting the number values
in our character set table by 1 (i.e. subtracting 1). So an encoding4
function could look something like this:
def encoding4(char):
num = charset[char] - 1
binary = bin(num)
without_prefix = binary[2:]
padded = without_prefix.rjust(2, "0")
return padded
The .rjust()
(and .ljust()
) methods pad a string to a desired width
with a provided padding character; in our case, we pad to a maximum
width of 2 with 0’s, unless the string is already 2 characters wide (or
wider). Perhaps confusingly, .rjust()
pads on the left and vice versa;
this is because padding on the left justifies the text along the
right margin (hence .rjust()
). Let’s take this baby out for a spin.
for char in charset.keys():
print(char, "->", encoding4(char))
"".join(encoding4(char) for char in "ac")
a -> 00
b -> 01
c -> 10
d -> 11
'0010'
Looks alright to me! When decoding, we just chop up the memory into
chunks two bits wide and interpret each as a separate character, so 00
yields a
and 10
yields c
. Piece of cake. This is where we leave
our toy example character set and encodings as they have no more to
teach us, and pick up the thread of the story in the real world.
3.4. Fixed-width encodings: ASCII et al.¶
Obviously, how many different characters your encoding can handle depends on how many bits you allow per character:
with 1 bit you can have \(2^1 = 2\) characters (one is mapped to 0, the other to 1)
with 2 bits you can have \(2^2 = 2 \times 2 = 4\) characters (mapped to 00, 01, 10 and 11)
with 3 bits you can have \(2^3 = 2 \times 2 \times 2 = 8\) characters
etc.
The oldest encoding still in widespread use is called
ASCII
, which is a 7-bit
encoding. What’s the number of different sequences of seven 1’s and
0’s?
# this is how Python spells 2⁷, i.e. 2*2*2*2*2*2*2
2**7
128
This means ASCII
can represent 128 different
characters, which comfortably fits the
basic Latin alphabet (both lowercase and uppercase), Arabic numerals,
punctuation and some “control characters” which were primarily useful on
the old teletype terminals
for which ASCII
was designed. For instance, the letter “A” corresponds
to the number 65 (1000001
in binary, see above).
Nowadays, ASCII
is represented using 8 bits (= 1 byte), because that’s
the unit of computer memory which has become ubiquitous (in terms of
both hardware and software assumptions), but still uses only 7 bits’
worth of information. That extra bit means that there’s room for
another 128 characters in addition to the 128 ASCII ones, coming up to
a total of 256.
2**(7+1)
256
What happens in the range [128; 256) is not covered by the ASCII
standard. In the 1990s, many encodings were standardized which used this
range for their own purposes, usually representing additional accented
characters used in a particular region. E.g. Czech (and Slovak,
Polish…) alphabets can be represented using the ISO latin-2
encoding, or Microsoft’s cp-1250
. Encodings which stick to the same
character mappings as ASCII
in the range [0; 128) and represent them
physically in the same way (as 1 byte), while potentially adding more
character mappings beyond that, are called ASCII
-compatible.
ASCII
compatibility is a good thing™, because when you start
reading a character stream in a computer, there’s no way to know in
advance what encoding it is in (unless it’s a file you’ve encoded
yourself and you happen to remember). So in practice, a heuristic has
been established to start reading the stream assuming it’s ASCII
by
default, and switch to a different encoding if evidence becomes
available to the contrary. For instance, HTML files describing web pages
displayed in your browser should all start with something like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
...
This way, whenever a program wants to read a file like this, it can
start off with ASCII
, waiting to see if it reaches the charset
(i.e.
encoding) attribute, and once it does, it can switch from ASCII
to
that encoding (UTF-8
here) and restart reading the file, now fairly
sure that it’s using the correct encoding. This trick works only if we
can assume that whatever encoding the rest of the file is in, the first
few lines can be considered as ASCII
for all practical intents and
purposes.
Without the charset
attribute, the only way to know if the encoding is
right would be for you to look at the rendered text and see if it makes
sense; if it did not, you’d have to resort to trial and error, manually
switching the encodings and looking for the one in which the numbers
behind the characters stop coming out as gibberish and are actually
translated into intelligible text.
Let’s take a look at printable characters in the Latin-2
character
set. The character set consists of mappings between positive
integers (whole numbers) and characters; each one of these is called
a codepoint. The Latin-2
encoding then defines how to encode
each of these integers as a series of bits (1’s and 0’s) in the
computer’s memory.
latin2_printable_characters = []
# the Latin-2 character set has 256 codepoints, corresponding to
# integers from 0 to 255
for codepoint in range(256):
# the Latin-2 encoding is simple: each codepoint is encoded
# as the byte corresponding to that integer in binary
byte = bytes([codepoint])
character = byte.decode(encoding="latin2")
if character.isprintable():
latin2_printable_characters.append((codepoint, character))
latin2_printable_characters
[(32, ' '),
(33, '!'),
(34, '"'),
(35, '#'),
(36, '$'),
(37, '%'),
(38, '&'),
(39, "'"),
(40, '('),
(41, ')'),
(42, '*'),
(43, '+'),
(44, ','),
(45, '-'),
(46, '.'),
(47, '/'),
(48, '0'),
(49, '1'),
(50, '2'),
(51, '3'),
(52, '4'),
(53, '5'),
(54, '6'),
(55, '7'),
(56, '8'),
(57, '9'),
(58, ':'),
(59, ';'),
(60, '<'),
(61, '='),
(62, '>'),
(63, '?'),
(64, '@'),
(65, 'A'),
(66, 'B'),
(67, 'C'),
(68, 'D'),
(69, 'E'),
(70, 'F'),
(71, 'G'),
(72, 'H'),
(73, 'I'),
(74, 'J'),
(75, 'K'),
(76, 'L'),
(77, 'M'),
(78, 'N'),
(79, 'O'),
(80, 'P'),
(81, 'Q'),
(82, 'R'),
(83, 'S'),
(84, 'T'),
(85, 'U'),
(86, 'V'),
(87, 'W'),
(88, 'X'),
(89, 'Y'),
(90, 'Z'),
(91, '['),
(92, '\\'),
(93, ']'),
(94, '^'),
(95, '_'),
(96, '`'),
(97, 'a'),
(98, 'b'),
(99, 'c'),
(100, 'd'),
(101, 'e'),
(102, 'f'),
(103, 'g'),
(104, 'h'),
(105, 'i'),
(106, 'j'),
(107, 'k'),
(108, 'l'),
(109, 'm'),
(110, 'n'),
(111, 'o'),
(112, 'p'),
(113, 'q'),
(114, 'r'),
(115, 's'),
(116, 't'),
(117, 'u'),
(118, 'v'),
(119, 'w'),
(120, 'x'),
(121, 'y'),
(122, 'z'),
(123, '{'),
(124, '|'),
(125, '}'),
(126, '~'),
(161, 'Ą'),
(162, '˘'),
(163, 'Ł'),
(164, '¤'),
(165, 'Ľ'),
(166, 'Ś'),
(167, '§'),
(168, '¨'),
(169, 'Š'),
(170, 'Ş'),
(171, 'Ť'),
(172, 'Ź'),
(174, 'Ž'),
(175, 'Ż'),
(176, '°'),
(177, 'ą'),
(178, '˛'),
(179, 'ł'),
(180, '´'),
(181, 'ľ'),
(182, 'ś'),
(183, 'ˇ'),
(184, '¸'),
(185, 'š'),
(186, 'ş'),
(187, 'ť'),
(188, 'ź'),
(189, '˝'),
(190, 'ž'),
(191, 'ż'),
(192, 'Ŕ'),
(193, 'Á'),
(194, 'Â'),
(195, 'Ă'),
(196, 'Ä'),
(197, 'Ĺ'),
(198, 'Ć'),
(199, 'Ç'),
(200, 'Č'),
(201, 'É'),
(202, 'Ę'),
(203, 'Ë'),
(204, 'Ě'),
(205, 'Í'),
(206, 'Î'),
(207, 'Ď'),
(208, 'Đ'),
(209, 'Ń'),
(210, 'Ň'),
(211, 'Ó'),
(212, 'Ô'),
(213, 'Ő'),
(214, 'Ö'),
(215, '×'),
(216, 'Ř'),
(217, 'Ů'),
(218, 'Ú'),
(219, 'Ű'),
(220, 'Ü'),
(221, 'Ý'),
(222, 'Ţ'),
(223, 'ß'),
(224, 'ŕ'),
(225, 'á'),
(226, 'â'),
(227, 'ă'),
(228, 'ä'),
(229, 'ĺ'),
(230, 'ć'),
(231, 'ç'),
(232, 'č'),
(233, 'é'),
(234, 'ę'),
(235, 'ë'),
(236, 'ě'),
(237, 'í'),
(238, 'î'),
(239, 'ď'),
(240, 'đ'),
(241, 'ń'),
(242, 'ň'),
(243, 'ó'),
(244, 'ô'),
(245, 'ő'),
(246, 'ö'),
(247, '÷'),
(248, 'ř'),
(249, 'ů'),
(250, 'ú'),
(251, 'ű'),
(252, 'ü'),
(253, 'ý'),
(254, 'ţ'),
(255, '˙')]
Using the 8th bit (and thus the codepoint range [128; 256)) solves the problem of handling languages with character sets different than that of American English, but introduces a lot of complexity – whenever you come across a text file with an unknown encoding, it might be in one of literally dozens of encodings. Additional drawbacks include:
how to handle multilingual text with characters from many different alphabets, which are not part of the same 8-bit encoding?
how to handle writing systems which have way more than 256 “characters”, e.g. Chinese, Japanese and Korean (CJK) ideograms?
3.5. Unicode and UTF-8¶
For these purposes, a standard character set known as Unicode was developed which strives for universal coverage of (ultimately) all characters ever used in the history of writing, even adding new ones like emojis. Unicode is much bigger than the character sets we’ve seen so far – its most frequently used subset, the Basic Multilingual Plane, has \(2^{16}\) codepoints, but overall the number of codepoints is past 1M and there’s room to accommodate many more.
2**16
65536
Here’s just a small sample of the treasure trove of codepoints that is Unicode.
from unicodedata import name
print("\N{HORIZONTAL ELLIPSIS}")
for sample in (range(0x16a0, 0x16f1), range(0x1f600, 0x1f645)):
for cp in sample:
char = chr(cp)
print(f"U+{cp:x}\t{char}\t{name(char)}")
print("\N{HORIZONTAL ELLIPSIS}")
…
U+16a0 ᚠ RUNIC LETTER FEHU FEOH FE F
U+16a1 ᚡ RUNIC LETTER V
U+16a2 ᚢ RUNIC LETTER URUZ UR U
U+16a3 ᚣ RUNIC LETTER YR
U+16a4 ᚤ RUNIC LETTER Y
U+16a5 ᚥ RUNIC LETTER W
U+16a6 ᚦ RUNIC LETTER THURISAZ THURS THORN
U+16a7 ᚧ RUNIC LETTER ETH
U+16a8 ᚨ RUNIC LETTER ANSUZ A
U+16a9 ᚩ RUNIC LETTER OS O
U+16aa ᚪ RUNIC LETTER AC A
U+16ab ᚫ RUNIC LETTER AESC
U+16ac ᚬ RUNIC LETTER LONG-BRANCH-OSS O
U+16ad ᚭ RUNIC LETTER SHORT-TWIG-OSS O
U+16ae ᚮ RUNIC LETTER O
U+16af ᚯ RUNIC LETTER OE
U+16b0 ᚰ RUNIC LETTER ON
U+16b1 ᚱ RUNIC LETTER RAIDO RAD REID R
U+16b2 ᚲ RUNIC LETTER KAUNA
U+16b3 ᚳ RUNIC LETTER CEN
U+16b4 ᚴ RUNIC LETTER KAUN K
U+16b5 ᚵ RUNIC LETTER G
U+16b6 ᚶ RUNIC LETTER ENG
U+16b7 ᚷ RUNIC LETTER GEBO GYFU G
U+16b8 ᚸ RUNIC LETTER GAR
U+16b9 ᚹ RUNIC LETTER WUNJO WYNN W
U+16ba ᚺ RUNIC LETTER HAGLAZ H
U+16bb ᚻ RUNIC LETTER HAEGL H
U+16bc ᚼ RUNIC LETTER LONG-BRANCH-HAGALL H
U+16bd ᚽ RUNIC LETTER SHORT-TWIG-HAGALL H
U+16be ᚾ RUNIC LETTER NAUDIZ NYD NAUD N
U+16bf ᚿ RUNIC LETTER SHORT-TWIG-NAUD N
U+16c0 ᛀ RUNIC LETTER DOTTED-N
U+16c1 ᛁ RUNIC LETTER ISAZ IS ISS I
U+16c2 ᛂ RUNIC LETTER E
U+16c3 ᛃ RUNIC LETTER JERAN J
U+16c4 ᛄ RUNIC LETTER GER
U+16c5 ᛅ RUNIC LETTER LONG-BRANCH-AR AE
U+16c6 ᛆ RUNIC LETTER SHORT-TWIG-AR A
U+16c7 ᛇ RUNIC LETTER IWAZ EOH
U+16c8 ᛈ RUNIC LETTER PERTHO PEORTH P
U+16c9 ᛉ RUNIC LETTER ALGIZ EOLHX
U+16ca ᛊ RUNIC LETTER SOWILO S
U+16cb ᛋ RUNIC LETTER SIGEL LONG-BRANCH-SOL S
U+16cc ᛌ RUNIC LETTER SHORT-TWIG-SOL S
U+16cd ᛍ RUNIC LETTER C
U+16ce ᛎ RUNIC LETTER Z
U+16cf ᛏ RUNIC LETTER TIWAZ TIR TYR T
U+16d0 ᛐ RUNIC LETTER SHORT-TWIG-TYR T
U+16d1 ᛑ RUNIC LETTER D
U+16d2 ᛒ RUNIC LETTER BERKANAN BEORC BJARKAN B
U+16d3 ᛓ RUNIC LETTER SHORT-TWIG-BJARKAN B
U+16d4 ᛔ RUNIC LETTER DOTTED-P
U+16d5 ᛕ RUNIC LETTER OPEN-P
U+16d6 ᛖ RUNIC LETTER EHWAZ EH E
U+16d7 ᛗ RUNIC LETTER MANNAZ MAN M
U+16d8 ᛘ RUNIC LETTER LONG-BRANCH-MADR M
U+16d9 ᛙ RUNIC LETTER SHORT-TWIG-MADR M
U+16da ᛚ RUNIC LETTER LAUKAZ LAGU LOGR L
U+16db ᛛ RUNIC LETTER DOTTED-L
U+16dc ᛜ RUNIC LETTER INGWAZ
U+16dd ᛝ RUNIC LETTER ING
U+16de ᛞ RUNIC LETTER DAGAZ DAEG D
U+16df ᛟ RUNIC LETTER OTHALAN ETHEL O
U+16e0 ᛠ RUNIC LETTER EAR
U+16e1 ᛡ RUNIC LETTER IOR
U+16e2 ᛢ RUNIC LETTER CWEORTH
U+16e3 ᛣ RUNIC LETTER CALC
U+16e4 ᛤ RUNIC LETTER CEALC
U+16e5 ᛥ RUNIC LETTER STAN
U+16e6 ᛦ RUNIC LETTER LONG-BRANCH-YR
U+16e7 ᛧ RUNIC LETTER SHORT-TWIG-YR
U+16e8 ᛨ RUNIC LETTER ICELANDIC-YR
U+16e9 ᛩ RUNIC LETTER Q
U+16ea ᛪ RUNIC LETTER X
U+16eb ᛫ RUNIC SINGLE PUNCTUATION
U+16ec ᛬ RUNIC MULTIPLE PUNCTUATION
U+16ed ᛭ RUNIC CROSS PUNCTUATION
U+16ee ᛮ RUNIC ARLAUG SYMBOL
U+16ef ᛯ RUNIC TVIMADUR SYMBOL
U+16f0 ᛰ RUNIC BELGTHOR SYMBOL
…
U+1f600 😀 GRINNING FACE
U+1f601 😁 GRINNING FACE WITH SMILING EYES
U+1f602 😂 FACE WITH TEARS OF JOY
U+1f603 😃 SMILING FACE WITH OPEN MOUTH
U+1f604 😄 SMILING FACE WITH OPEN MOUTH AND SMILING EYES
U+1f605 😅 SMILING FACE WITH OPEN MOUTH AND COLD SWEAT
U+1f606 😆 SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES
U+1f607 😇 SMILING FACE WITH HALO
U+1f608 😈 SMILING FACE WITH HORNS
U+1f609 😉 WINKING FACE
U+1f60a 😊 SMILING FACE WITH SMILING EYES
U+1f60b 😋 FACE SAVOURING DELICIOUS FOOD
U+1f60c 😌 RELIEVED FACE
U+1f60d 😍 SMILING FACE WITH HEART-SHAPED EYES
U+1f60e 😎 SMILING FACE WITH SUNGLASSES
U+1f60f 😏 SMIRKING FACE
U+1f610 😐 NEUTRAL FACE
U+1f611 😑 EXPRESSIONLESS FACE
U+1f612 😒 UNAMUSED FACE
U+1f613 😓 FACE WITH COLD SWEAT
U+1f614 😔 PENSIVE FACE
U+1f615 😕 CONFUSED FACE
U+1f616 😖 CONFOUNDED FACE
U+1f617 😗 KISSING FACE
U+1f618 😘 FACE THROWING A KISS
U+1f619 😙 KISSING FACE WITH SMILING EYES
U+1f61a 😚 KISSING FACE WITH CLOSED EYES
U+1f61b 😛 FACE WITH STUCK-OUT TONGUE
U+1f61c 😜 FACE WITH STUCK-OUT TONGUE AND WINKING EYE
U+1f61d 😝 FACE WITH STUCK-OUT TONGUE AND TIGHTLY-CLOSED EYES
U+1f61e 😞 DISAPPOINTED FACE
U+1f61f 😟 WORRIED FACE
U+1f620 😠 ANGRY FACE
U+1f621 😡 POUTING FACE
U+1f622 😢 CRYING FACE
U+1f623 😣 PERSEVERING FACE
U+1f624 😤 FACE WITH LOOK OF TRIUMPH
U+1f625 😥 DISAPPOINTED BUT RELIEVED FACE
U+1f626 😦 FROWNING FACE WITH OPEN MOUTH
U+1f627 😧 ANGUISHED FACE
U+1f628 😨 FEARFUL FACE
U+1f629 😩 WEARY FACE
U+1f62a 😪 SLEEPY FACE
U+1f62b 😫 TIRED FACE
U+1f62c 😬 GRIMACING FACE
U+1f62d 😭 LOUDLY CRYING FACE
U+1f62e 😮 FACE WITH OPEN MOUTH
U+1f62f 😯 HUSHED FACE
U+1f630 😰 FACE WITH OPEN MOUTH AND COLD SWEAT
U+1f631 😱 FACE SCREAMING IN FEAR
U+1f632 😲 ASTONISHED FACE
U+1f633 😳 FLUSHED FACE
U+1f634 😴 SLEEPING FACE
U+1f635 😵 DIZZY FACE
U+1f636 😶 FACE WITHOUT MOUTH
U+1f637 😷 FACE WITH MEDICAL MASK
U+1f638 😸 GRINNING CAT FACE WITH SMILING EYES
U+1f639 😹 CAT FACE WITH TEARS OF JOY
U+1f63a 😺 SMILING CAT FACE WITH OPEN MOUTH
U+1f63b 😻 SMILING CAT FACE WITH HEART-SHAPED EYES
U+1f63c 😼 CAT FACE WITH WRY SMILE
U+1f63d 😽 KISSING CAT FACE WITH CLOSED EYES
U+1f63e 😾 POUTING CAT FACE
U+1f63f 😿 CRYING CAT FACE
U+1f640 🙀 WEARY CAT FACE
U+1f641 🙁 SLIGHTLY FROWNING FACE
U+1f642 🙂 SLIGHTLY SMILING FACE
U+1f643 🙃 UPSIDE-DOWN FACE
U+1f644 🙄 FACE WITH ROLLING EYES
…
Now, the most straightforward representation for \(2^{16}\) codepoints is
what? Well, it’s simply using 16 bits per character, i.e. 2 bytes. That
encoding exists, it’s called UTF-16
(“UTF” stands for “Unicode
Transformation Format”), but consider the drawbacks:
we’ve lost
ASCII
compatibility by the simple fact of using 2 bytes per character instead of 1 (encoding “a” as01100001
or00000000|01100001
, with the|
indicating an imaginary boundary between bytes, is not the same thing)encoding a string in a language which is mostly written down using basic letters of the Latin alphabet now takes up twice as much space (which is probably not a good idea, given the general dominance of English in electronic communication)
Looks like we’ll have to think outside the box. The box in question here
is fixed-width encodings – all of the real-world encoding schemes we’ve
encountered so far were fixed-width, meaning that each character was
represented by either 7, 8 or 16 bits. In other words, you could jump
around the string in multiples of 7, 8 or 16 and always land at the
beginning of a character. (Not exactly true for UTF-16
, because it is
something more than just a “16-bit ASCII
”: it has ways of handling
characters beyond \(2^{16}\) using so-called surrogate
sequences
– but you get the gist.)
The smart idea that some bright people have come up with was to use a
variable-width encoding, specifically one that doesn’t suck, unlike our
encoding2
. The most ubiquitous one currently is UTF-8
, which
we’ve already met in the HTML example above. UTF-8
is
ASCII
-compatible, i.e. the 1’s and 0’s used to encode text containing
only ASCII
characters are the same regardless of whether you use
ASCII
or UTF-8
: it’s a sequence of 8-bit bytes. But UTF-8
can also
handle many more additional characters, as defined by the Unicode
standard, by using progressively longer and longer sequences of bits.
def print_utf8_bytes(char):
"""Prints binary representation of character as encoded by UTF-8.
"""
# encode the string as UTF-8 and iterate over the bytes;
# iterating over a sequence of bytes yields integers in the
# range [0; 256); the formatting directive "{:08b}" does two
# things:
# - "b" prints the integer in its binary representation
# - "08" left-pads the binary representation with 0's to a total
# width of 8, which is the width of a byte
binary_bytes = [f"{byte:08b}" for byte in char.encode("utf8")]
print(f"{char!r} encoded in UTF-8 is: {binary_bytes}")
print_utf8_bytes("A") # the representations...
print_utf8_bytes("č") # ... keep...
print_utf8_bytes("字") # ... getting longer.
'A' encoded in UTF-8 is: ['01000001']
'č' encoded in UTF-8 is: ['11000100', '10001101']
'字' encoded in UTF-8 is: ['11100101', '10101101', '10010111']
How does that even work? The obvious problem here is that with a fixed-width encoding, you just chop up the string at regular intervals (7, 8, 16 bits) and you know that each interval represents one character. So how do you know where to chop up a variable width-encoded string, if each character can take up a different number of bits?
Essentially, the trick is to use some of the bits in the
representation of a codepoint to store information not about which
character it is (whether it’s an “A” or a “字”), but how many bits it
occupies. This is what we did with our encoding2
, albeit in a very
primitive way, by simply using 0 as a character delimiter. In other
words, if you want to skip ahead 10 characters in a string encoded with
a variable width-encoding, you can’t just skip 10 * 7 or 8 or 16 bits;
you have to read all the intervening characters to figure out how much
space they take up. Take the following example:
for char in "Básník 李白":
print_utf8_bytes(char)
'B' encoded in UTF-8 is: ['01000010']
'á' encoded in UTF-8 is: ['11000011', '10100001']
's' encoded in UTF-8 is: ['01110011']
'n' encoded in UTF-8 is: ['01101110']
'í' encoded in UTF-8 is: ['11000011', '10101101']
'k' encoded in UTF-8 is: ['01101011']
' ' encoded in UTF-8 is: ['00100000']
'李' encoded in UTF-8 is: ['11100110', '10011101', '10001110']
'白' encoded in UTF-8 is: ['11100111', '10011001', '10111101']
Notice the initial bits in each byte of a character follow a pattern depending on how many bytes in total that character has:
if it’s a 1-byte character, that byte starts with 0
if it’s a 2-byte character, the first byte starts with 11 and the following one with 10
if it’s a 3-byte character, the first byte starts with 111 and the following ones with 10
This makes it possible to find out which bytes belong to which characters, and also to spot invalid strings, as the leading byte in a multi-byte sequence always “announces” how many continuation bytes (= starting with 10) should follow.
So much for a quick introduction to UTF-8
(= the encoding), but
there’s much more to Unicode (= the character set). While UTF-8
defines only how integer numbers corresponding to codepoints are to be
represented as 1’s and 0’s in a computer’s memory, Unicode specifies how
those numbers are to be interpreted as characters, what their properties
and mutual relationships are, what conversions (i.e. mappings between
(sequences of) codepoints) they can undergo, etc.
Consider for instance the various ways diacritics are handled: “č” can
be represented either as a single codepoint (LATIN SMALL LETTER C WITH CARON
–
all Unicode codepoints have cute names like this) or a sequence of two
codepoints, the character “c” and a combining diacritic mark (COMBINING CARON
). You can search for the codepoints corresponding to Unicode
characters e.g.
here and play
with them in Python using the chr(0xXXXX)
built-in function or with
the special string escape sequence \uXXXX
(where XXXX
is the
hexadecimal representation of the codepoint) – both are ways to get the
character corresponding to the given codepoint:
# "č" as LATIN SMALL LETTER C WITH CARON, codepoint 010d
print(chr(0x010d))
print("\u010d")
č
č
# "č" as a sequence of LATIN SMALL LETTER C, codepoint 0063, and
# COMBINING CARON, codepoint 030c
print(chr(0x0063) + chr(0x030c))
print("\u0063\u030c")
č
č
# of course, chr() also works with decimal numbers
chr(269)
'č'
This means you have to be careful when working with languages that use accents, because to a computer, the two possible representations are of course different strings, even though to you, they’re conceptually the same:
s1 = "\u010d"
s2 = "\u0063\u030c"
# s1 and s2 look the same to the naked eye...
print(s1, s2)
č č
# ... but they're not
s1 == s2
False
Watch out, they even have different lengths! This might come to bite you if you’re trying to compute the length of a word in letters.
print("s1 is", len(s1), "character(s) long.")
print("s2 is", len(s2), "character(s) long.")
s1 is 1 character(s) long.
s2 is 2 character(s) long.
For this reason, even though we’ve been informally calling these Unicode entities “characters”, it is more accurate and less confusing to use the technical term “codepoints”.
Generally, most text out there will use the first, single-codepoint approach whenever possible, and pre-packaged linguistic corpora will try to be consistent about this (unless they come from the web, which always warrants being suspicious and defensive about your material). If you’re worried about inconsistencies in your data, you can perform a normalization:
from unicodedata import normalize
# NFC stands for Normal Form C; this normalization applies a canonical
# decomposition (into a multi-codepoint representation) followed by a
# canonical composition (into a single-codepoint representation)
s1 = normalize("NFC", s1)
s2 = normalize("NFC", s2)
s1 == s2
True
Let’s wrap things up by saying that Python itself uses Unicode
internally, but the encoding it defaults to when opening an external
file depends on the locale of the system (broadly speaking, the set of
region, language and character-encoding related settings of the
operating system). On most modern Linux and macOS systems, this will
probably be a UTF-8
locale and Python will therefore assume UTF-8
as
the encoding by default. Unfortunately, Windows is different. To be on
the safe side, whenever opening files in Python, you can specify the
encoding explicitly:
with open("unicode.ipynb", encoding="utf-8") as file:
pass
In fact, it’s always a good idea to specify the encoding explicitly,
using UTF-8
as a default if you don’t know, for at least two reasons
– it makes your code more:
portable – it will work the same across different operating systems which assume different default encodings;
and resistant to data corruption –
UTF-8
is more restrictive than fixed-width encodings, in the sense that not all sequences of bytes are validUTF-8
.
That second point probably requires elaboration. For instance, if one
byte starts with 11, then the following one must start with 10 (see
above). If it starts with anything else, it’s an error. By contrast,
in a fixed-width encoding, any sequence of bytes is valid. Decoding
will always succeed, but if you use the wrong fixed-width encoding, the
result will be garbage, which you might not notice. Therefore, it makes
sense to default to UTF-8
: if it works, then there’s a good chance
that the file actually was encoded in UTF-8
and you’ve read the
data in correctly; if it fails, you get an explicit error which prompts
you to investigate further.
Another good idea, when dealing with Unicode text from an unknown and unreliable source, is to look at the set of codepoints contained in it and eliminate or replace those that look suspicious. Here’s a function to help with that:
import unicodedata as ud
from collections import Counter
import pandas as pd
def inspect_codepoints(string):
"""Create a frequency distribution of the codepoints in a string.
"""
char_frequencies = Counter(string)
df = pd.DataFrame.from_records(
(
freq,
char,
f"U+{ord(char):04x}",
ud.name(char),
ud.category(char)
)
for char, freq in char_frequencies.most_common()
)
df.columns = ("freq", "char", "codepoint", "name", "category")
return df
Depending on your font configuration, it may be very hard to spot the
two intruders in the sentence below. The frequency table shows the
string contains regular LATIN SMALL LETTER T
and LATIN SMALL LETTER G
, but also their specialized but visually similar variants
MATHEMATICAL SANS-SERIF SMALL T
and LATIN SMALL LETTER SCRIPT G
. You
might want to replace such codepoints before doing further text
processing…
inspect_codepoints("Intruders here, good 𝗍hinɡ I checked.")
freq | char | codepoint | name | category | |
---|---|---|---|---|---|
0 | 5 | e | U+0065 | LATIN SMALL LETTER E | Ll |
1 | 5 | U+0020 | SPACE | Zs | |
2 | 3 | r | U+0072 | LATIN SMALL LETTER R | Ll |
3 | 3 | d | U+0064 | LATIN SMALL LETTER D | Ll |
4 | 3 | h | U+0068 | LATIN SMALL LETTER H | Ll |
5 | 2 | I | U+0049 | LATIN CAPITAL LETTER I | Lu |
6 | 2 | n | U+006e | LATIN SMALL LETTER N | Ll |
7 | 2 | o | U+006f | LATIN SMALL LETTER O | Ll |
8 | 2 | c | U+0063 | LATIN SMALL LETTER C | Ll |
9 | 1 | t | U+0074 | LATIN SMALL LETTER T | Ll |
10 | 1 | u | U+0075 | LATIN SMALL LETTER U | Ll |
11 | 1 | s | U+0073 | LATIN SMALL LETTER S | Ll |
12 | 1 | , | U+002c | COMMA | Po |
13 | 1 | g | U+0067 | LATIN SMALL LETTER G | Ll |
14 | 1 | 𝗍 | U+1d5cd | MATHEMATICAL SANS-SERIF SMALL T | Ll |
15 | 1 | i | U+0069 | LATIN SMALL LETTER I | Ll |
16 | 1 | ɡ | U+0261 | LATIN SMALL LETTER SCRIPT G | Ll |
17 | 1 | k | U+006b | LATIN SMALL LETTER K | Ll |
18 | 1 | . | U+002e | FULL STOP | Po |
… because of course, for a computer, the word “thing” written with two different variants of “g” is really just two different words, which is probably not what you want:
"thing" == "thinɡ"
False
So to sum up:
Unicode strives to be a universal character set. It contains a lot of characters, many very similar-looking yet different. Appearances can be deceptive, when in doubt, examine which codepoints you’re actually dealing with and/or normalize.
Unicode can be encoded using different encodings. Some are fixed-width (
UTF-32
, which we haven’t mentioned yet), some are almost fixed-width (UTF-16
), some are variable-width (UTF-8
).UTF-8
has many desirable properties, so you should always use it when saving plain text files, and always assume it as a first try when opening files in an unknown encoding.Internally, Python uses a custom representation of Unicode, which is neither of the encodings we already mentioned.
The following functionality is useful for inspecting Unicode data in Python: the
ord()
andchr()
built-in functions, theunicodedata
standard library module, and theregex
external package, which like the standard libraryre
module implements regular expression support from Python, but unlike the latter, it provides much more extensive Unicode support.