2. A tour of Python and NLTK

2.1. The most important piece of programming advice you’ll ever get™

Over the course of this book, you’ll be presented with a lot of information. Like, a lot. It makes no sense trying to memorize all of it on your first pass. No one expects you to. With programming as with anything else, practice makes perfect – as you keep using Python, the parts you use often enough will gradually enter your muscle memory. The parts that you don’t, you’ll have to look up as needed. I do that constantly when I write code that uses other people’s libraries that I don’t use every day.

Luckily, there’s this thing called the internet, where you can search for information using this thing called search engines, and these search engines are pretty smart nowadays, so you can basically ask them “how do I do X in Python” and they’ll probably come up with links to reasonable answers, because until you reach an intermediate to advanced level, you can be pretty sure you’re not the first person on the internet to have that question. To help you make sense of the results that you’re likely to get, here’s an overview of the domains that frequently appear in there:

  • https://stackoverflow.com/, and subdomains of https://stackexchange.com/: in the programming community, these are leading sites offering information in a user-generated question/answer format. If you have a question, it’s probably already been asked and answered there, maybe even multiple times. If it hasn’t, go ask it!

  • https://docs.python.org/: Python’s official documentation, including the many modules included in the standard library

  • subdomains of https://readthedocs.io/: documentation for additional packages, created by people all over the world; however, larger projects often host documentation on their own domain, e.g. https://pandas.pydata.org/

  • https://github.com: this is a site many people use to collaborate on developing software projects in the open (cf. open source). If you encounter weird behavior in a library, and that library’s community uses GitHub for development, it’s a good idea to check that library’s issue tracker to see whether that behavior might be a known bug, and if not, consider reporting it yourself.

2.2. How to use this chapter

Don’t try to memorize all the information contained in here by heart before moving on to the rest of the book. The purpose of this chapter is to familiarize you with basic words and concepts used throughout the book, so that when you run across them, you know where to look them up. That’s why they’re all in one place. There’s also quite a lot of them, and truly mastering them will get you quite far along your Python programming career, so that’s not the goal right now. The goal is to get acquainted, possibly skip ahead if you get bored, and definitely skip back for a refresher whenever things stop making sense.

2.3. Python notebooks: your fancy new calculator

Python notebooks are kind of like a calculator, just a lot more fancy than your typical calculator. But otherwise, they should feel pretty familiar – you type some stuff in, press a button, Python thinks for a while and spits back an answer.

The place where you type stuff is called a cell, and running a cell with the play button (▶) makes Python evaluate the code you typed in, i.e. your program. A very simple Python program could consist of just one number:

1
1

And Python’s answer is, unsuprisingly, the same as a calculator’s answer would be: the expression 1 evaluates to… Well, it evaluates to 1. Unlike in a calculator though, you can have multiple cells in a notebook, jump back and forth between them, edit and re-evaluate old ones, etc. Let’s create a new cell with the plus button (➕) and try some slightly more complicated expressions.

1 + 1
2

Cells can consist of multiple lines. Let’s demonstrate that with a comment. Comments are pieces of programs which are completely ignored by the computer, they’re included solely for the benefit of the humans reading the code. In Python, the syntax for comments is to write a hash mark; everything that follows after that is a comment, which is often signaled by distinctive syntax highlighting.

# This is a comment.
# Python will completely ignore this line, the one above and the one below.
# They're here just for you, as the person who reads this code.
3 - 2
1

If you’d like to write more extensive prose commentary, consider using Markdown cells. In JupyterLab, use the notebook’s top toolbar to switch the cell type from Code to Markdown. Markdown is a lightweight markup language which allows you to add formatting such as italics and bold, hyperlinks and other features, but you don’t have to worry about that, you can mostly just write regular text and it’ll probably render just fine. If you’d like to learn more about Markdown features, you can always double-click on cells like this one in JupyterLab to see their source code, because they’re written in Markdown, or try this interactive tutorial. But let’s focus on Python right now.

The * operator is used for multiplication, / for division, // for truncating division, % for modulo and ** for exponentiation.

2 * 3
6
4 / 3
1.3333333333333333
# "truncating" division means it'll chop off the part of the number
# after the decimal point
4 // 3
1
2**3
8
# modulo wraps counting around a certain number, so you can use it e.g.
# for converting 24h clock hours to 12h clock hours
16 % 12
4

A decimal point, ., is used to separate the whole number part from the fractional part of a number.

0.1 + 0.1
0.2

At this point, if you’re following along interactively in JupyterLab, you’re probably sick and tired of reaching for your mouse to press ▶ all the time. JupyterLab has many handy keyboard shortcuts which I’ll let you discover on your own at your leisure, but the following three are such a huge quality of life improvement that I feel compelled to point them out:

  • Alt+Enter: evaluate, insert new cell and switch to it

  • Shift+Enter: evaluate and switch to next (existing) cell

  • Ctrl+Enter: evaluate and stay at current cell

2.4. Text as strings of characters

But enough about numbers! We’re linguist, so let’s look at how we can represent text. The most basic way of representing text in Python is as one long string of characters. Strings can be created using quotes, both single and double work fine.

"Hello, world!"
'Hello, world!'
'Hello, world!'
'Hello, world!'
"It's me, world!"
"It's me, world!"

Though you can’t just put a single quote inside a single-quote-delimited string because, well, Python would think that’s where the string ends.

'It's me, world!'
  File "/tmp/ipykernel_56602/1132477001.py", line 1
    'It's me, world!'
        ^
SyntaxError: invalid syntax

What Python spits out in this case is an error traceback, telling you that something went wrong. Python usually gives extremely helpful error messages that help you diagnose what went wrong, where the error occurred and how you got there, so it’s well worth spending time reading them and learning to decipher them (it’s not that hard, it just takes practice). The one exception to this is when Python can’t even parse your program as valid Python code. In that case, the best it can do is tell you where it got stuck, raise a SyntaxError, and leave you to figure out what the problem is and how to fix it on your own. Which is what happened here.

The easiest workaround when you want your string to contain one kind of quote is to use the other kind of quote to delimit it, like we did in the second-to-last code cell. Sometimes however, you want both kinds of quotes. In that case, you can use the backslash character \ to escape the special, string-terminating meaning of a quote, and make it into a regular character which is part of the string.

"\"It's me, world!\" she said."
'"It\'s me, world!" she said.'

Escaping means canceling the default meaning of a character or character sequence and using an alternative one. In a double-quoted string, " normally means “end the string here”, but a preceding \ changes that to “just insert a double quote at this point in the string”. Escape sequences exist to help you put characters in your strings which you just can’t put there literally, or which are hard to type on your keyboard. For instance, you can’t just put a newline character inside a string:

"one line
another line"
  File "/tmp/ipykernel_56602/1697716864.py", line 1
    "one line
             ^
SyntaxError: EOL while scanning string literal

Grr, another SyntaxError. What you can do is use the \n escape sequence to represent that newline, without having an actual newline in your source code and triggering that SyntaxError.

"one line\nanother line"
'one line\nanother line'

How do you know that \n is really a newline if the string still just shows \n? When you evaluate a string, Python shows you its canonical representation which you could put back in your code. This means you get a good idea of what the string contains – the newline is otherwise a sort of phantom character, as is most whitespace, but you can actually see it here – but it’s not very pretty nor readable. If you want to get a rendered version of your string, as it would appear in a text file, you need to use the print() function.

print("one line\nanother line")
one line
another line

Ah, much better. At least for reading. For inspecting the contents of strings, the default behavior is in fact very useful.

There are various other escape sequences you can use in Python strings, another handy and commonly encountered one is \t for the tab character:

print("one\ttwo\nthree\tfour")
one	two
three	four

And less commonly seen but pretty neat as well is \N{...}, for inserting characters based on their Unicode names:

"\N{see-no-evil monkey}"
'🙈'

If you want to type a longer piece of text into a Python string though, that newline escape thing can become really annoying – who wants to type several paragraphs as one long line interspersed with \ns? Fear not, triple-quoted strings to the rescue:

"""one line
another line"""
'one line\nanother line'
print('''one line
another line''')
one line
another line

Inside a triple-quoted or multiline string, you can put anything you like, including actual newlines… except of course a sequence of three quotes of the same type that you’re using to delimit the string. In that case, you need to escape at least some of them with that backslash again.

All these various ways of creating strings are called string literals. A literal is a dedicated syntax for creating one type of data. In the previous section, we saw how to write down number literals in Python – the representation was so straightforward that we didn’t even think it needed a special term like ‘literal’ to describe it. Python has a few more core, built-in data structures which have dedicated literal syntax; we’ll encounter them below.

2.5. Objects and variables

Writing Python code consists of interacting with and manipulating various objects. Object is a generic term for anything you can inspect by putting it inside a code cell and evaluating that cell. So far, we’ve seen numbers, strings and one function (that’s right, functions are objects too).

print
<function print>

If you know you’ll be using an object repeatedly, it’s a good idea to store it in a variable using the assignment operator, =. That way, you don’t have to keep writing its literal over and over. The variable name is entirely up to you, though there are some rules you must abide by; for instance, you can’t use spaces, so these are often replaced with underscores _.

string = "one\ttwo\nthree\tfour"
string
'one\ttwo\nthree\tfour'

Anywhere you want to use that object, you can then refer to it using the name of the variable that points to it.

print(string)
one	two
three	four

Multiple names can refer to the same object.

another_name = string
print(another_name)
one	two
three	four

Whether two names refer to the same object or not can be checked using the is operator.

string is another_name
True
string is print
False
string2 = "one\ttwo\nthree\tfour"
# NOTE: when not in the notebook environment, Python will usually be
# smart enough to figure out that string and string2 contain the same
# characters, it will store just one copy of the string to save
# memory and point both variables at it; in that case, the expression
# below will evaluate to True
string is string2
False

When two objects are not the same, they might still be equal, in that they look the same, they have the same contents. You can check for that using the equality operator, ==.

string == string2
True

A quick way to remember that using a real-world analogy: for twins, is would be false, since they’re two different people, but == would be true, since they look the same.

The == is also a good opportunity to check that numbers and strings really are two completely different things in Python:

42 == "42"
False

2.6. Attributes: objects on objects (on objects…)

Only rarely is an object an island entire of itself, most objects have other objects attached to them as attributes, and these in turn have attributes of their own, etc. We saw that the period character . is used as the decimal separator in number literals. The other, probably more important use of . in Python is for accessing those attributes.

Say for instance that we have a complex number, which consists of two parts, a real and an imaginary one. This is the literal syntax for complex numbers in Python:

c = 1 + 2j
c
(1+2j)

Let’s not worry about what complex numbers are good for right now, we’re interested in attributes. Complex numbers store their real and imaginary parts as .real and .imag attributes, respectively.

c.real
1.0
c.imag
2.0

2.6.1. Methods

Functions are regular objects and as such, they can also be attached to other objects as attributes, as little snippets of dynamic behavior which do something interesting with the parent object, instead of just storing static data. Functions attached to an object as attributes are more commonly referred to as that object’s methods, but they’re basically just functions. For instance, inspect the the print() function we already saw above, and compare it with the .conjugate() method.

print
<function print>
c.conjugate
<function complex.conjugate>

Unlike regular data though, functions are generally not meant to be inspected, they’re meant to be called using function call syntax, i.e. by appending () to the function name. Calling a function triggers its behavior, it runs the piece of code that’s associated with it. For instance, the print() function prints objects to the screen, as we saw previously. The .conjugate() method computes the complex number’s conjugate.

c.conjugate()
(1-2j)

Let’s drop complex numbers as the running example and come back to strings – they have a much greater variety of interesting methods we can explore. If you’re running this notebook interactively inside JupyterLab, a great feature which helps you do so is tab completion. If you type a variable name + . and hit the Tab key, a menu should come up with all the attributes available on the object. Try it!

# type string. and press Tab on your keyboard
['capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

You can see there’s quite a lot going on in there. There’s a whole lot of methods starting with the prefix is*, which are there to answer some questions you might have about the contents of your strings.

"cat".islower()
True
"DOG".isupper()
True
"Frank".istitle()
True
"42".isnumeric()
True

If you’d like to learn more about an object, you can use JupyterLab’s interactive help feature. Type a question mark ? before the name of the object you’re interested in, evaluate the cell, and detailed information about that object will be shown. For instance, if we’re interested in what the .isprintable() method does:

?string.isprintable
Signature: ()
Docstring:
Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type:      builtin_function_or_method

The question mark is not regular Python syntax, it only works inside notebooks and is intended to make interactive exploration easier. Stock Python also has a built-in help() function which does something similar, but it displays less information. If you want even more details, you can insist by typing two question marks ?? instead of one:

??string.isprintable
Signature: ()
Docstring:
Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type:      builtin_function_or_method

… though sometimes (as in this case), more detail might not be available and the help output will be the same as with one question mark. For convenience, the question mark(s) can also go behind the object you’re taking a peek at.

string.isprintable?
Signature: ()
Docstring:
Return True if the string is printable, False otherwise.

A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type:      builtin_function_or_method

And another way to trigger interactive help is by pressing Shift-Tab while your typing cursor is inside an object’s name. A floating window will pop up with the help contents inside. This is useful for quick checks, because you don’t even have to evaluate the cell.

There are also several methods which allow you to create new strings with some of the characters changed.

# convert to upper case
"klein".upper()
'KLEIN'
# convert to lower case
"GROß".lower()
'groß'
# convert to lower case in an even more aggressive way, which is usually
# the safer option if you really want to make sure all case distinctions
# are ignored
"GROß".casefold()
'gross'
"pride and prejudice".capitalize()
'Pride and prejudice'
"pride and prejudice".title()
'Pride And Prejudice'
# remove leading and trailing whitespace
"""

                       floating in space

""".strip()
'floating in space'

Some of these methods even require additional arguments to do their work. For instance, if you want to .replace() part(s) of the string with something else, you need to tell Python what to replace with what. Arguments are written out between the parentheses doing the function call, and if you can’t remember what they are or what order they come in, that’s precisely what interactive help is there for!

"I love cats and categories.".replace("cat", "dog")
'I love dogs and dogegories.'

Note that none of these methods modify the original string, they just use it to derive what the new string should look like.

animal = "cat"
animal.upper()
'CAT'
animal
'cat'

This is because strings (and numbers) in Python are immutable – you can’t change them in place. You can create new strings and re-assign them to old variable names, but the old strings will always stay the way they were at the beginning.

# create a string and give it two names
name1 = "string"
name2 = name1
# create a new string which is an uppercase version of the original
# string, and re-assign it to the name2 variable
name2 = name2.upper()
name2
'STRING'
# but the old string is still there, undisturbed
name1
'string'

We’ll talk more about immutability in the next section about collections, because unlike strings and numbers, some Python objects can actually be modified in place.

Using tab completion and interactive help (? or Shift-Tab), I encourage you to familiarize yourself with the methods on string objects, or indeed on any new object type that you come across and intend to use, so that you know what’s available and get a picture of what you can use the object for. No need to go through them exhaustively, just skim the list, read the descriptions of some of the methods that sound particularly useful, leave those that sound confusing for later (or never).

2.7. Collections

Collections or containers are objects intended to contain other objects, so that you can conveniently manipulate them together. We saw above that almost all Python objects in some sense “contain” other objects via their attributes, but those attributes are somehow intimately tied to how any given object works. The whole point of collections is that they (mostly) don’t care what you put into them or what you take out; they’ll just happily keep an eye on it for you and allow you to juggle and re-arrange the items in clever and succinct ways.

To motivate the need for collections, imagine that you want to store the individual tokens in a sentence, “Let it be.”. Without collections, you would have to use separate variables:

string1 = "Let"
string2 = "it"
string3 = "be"
string4 = "."

This gets really tedious really quickly. Instead, you can use a Python list:

strings = ["Let", "it", "be", "."]

Much better!

2.7.1. Collection literals

We’ll start getting acquainted with Python’s builtin collections by learning about their literals, i.e. the special syntax used to create them. We’ll cover lists, tuples, dictionaries and sets. We’ve previously covered strings, which can also be seen as collections. The string "abc" is a collection of characters, or more accurately, a collection of three strings of length 1: "a", "b" and "c".

We’ve already met lists. In general, Python really doesn’t care what kinds of objects you store in your collections, it’s entirely up to you, so you can mix and match at will.

[1, "two", print]
[1, 'two', <function print>]

Lists are great for storing tokenized text:

["Help", "!", "I", "need", "somebody", ",", "help", "!"]
['Help', '!', 'I', 'need', 'somebody', ',', 'help', '!']

This is how you create an empty list:

[]
[]

Closely related to lists are tuples (we’ll discuss the differences below).

(1, "two", print)
(1, 'two', <function print>)

In many cases, the parentheses are actually optional. You’ll probably see me using tuples when I want to output multiple objects from a code cell, because Python only uses the last expression in the cell as its output value.

1
"two"
print
<function print>
1, "two", print
(1, 'two', <function print>)

Whenever you see commas without any parentheses around them, it’s a tuple. You’ll learn over time when it’s safe to omit them, but until then, it might be a good idea to play it safe and always use them. This means “take the number 1, the result of the comparison 2 < 3, and the number 4, and create a 3-tuple out of them”:

1, 2 < 3, 4
(1, True, 4)

Whereas this says, “create a 2-tuple (1, 2) and a 2-tuple (3, 4) and only then do a comparison of the resulting tuples”:

(1, 2) < (3, 4)
True

As you can see, the parentheses work in the same way as in math, as a precedence operator: they say “first create the tuples and then do the rest”, much as \((4 + 3) * 2\) says “first do the addition, then the multiplication”, overriding the default \(4 + 3 * 2\) which goes the other way round. One place they’re never optional though is when creating an empty tuple.

()
()

Moving on, we have dictionaries. Dictionaries are different in that their purpose is not to store only values, but key–value pairs. We say that they map keys to values, kind of like real-world dictionaries map words in one language to another.

{"cat": "chat", "dog": "chien"}
{'cat': 'chat', 'dog': 'chien'}

A constraint on dictionaries is that the keys must be unique. If you provide multiple values per key, only the last one will be retained.

{"odd": 1, "even": 2, "odd": 3, "even": 4, "odd": 5}
{'odd': 5, 'even': 4}

If you need multiple values per key, well… Just store a collection as the value instead!

{"odd": [1, 3, 5], "even": [2, 4]}
{'odd': [1, 3, 5], 'even': [2, 4]}

And this is how you create an empty dictionary:

{}
{}

It may not seem like it at first glance, but dictionaries are an extremely powerful and versatile data structure, and they’re the backbone upon which Python is built – they’re used everywhere.

Sets have a literal syntax which somewhat resembles that of dictionaries: it also uses curly braces {}, but no colons :, because sets again store just values, not key–value pairs. But they do require that their values be unique and throw away any duplicates, so there is a conceptual similarity with dictionaries which motivates that syntactic similarity.

{1, 2, 3, 1, 2, 3}
{1, 2, 3}

This deduplication behavior makes sets great for deriving vocabularies of unique words.

{"the", "cat", "sat", "on", "the", "mat"}
{'cat', 'mat', 'on', 'sat', 'the'}

Since {} is already taken to mean empty dictionary, empty set literals actually look like a function call:

set()
set()

This is a somewhat ugly inconsistency in a language that otherwise tries hard to be consistent, but oh well, what can you do.

2.7.2. len(): number of items in collection

The len() function works on all collections. It tells you how many elements a collection has.

len([1, 2, 3])
3
len("Norwegian Wood")
14

In the case of dictionaries, it counts the number of key–value pairs.

len({"cat": "chat", "dog": "chien"})
2

2.7.3. in: checking collection membership

The in operator also works on all collections. It tells you whether the collection contains a given element.

1 in [1, 2, 3]
True
"one" in {1, 2, 3}
False

For dictionaries, it tests against keys, not values.

"cat" in {"cat": "chat", "dog": "chien"}
True
"chat" in {"cat": "chat", "dog": "chien"}
False

2.7.4. [...]: retrieving and modifying collection elements

lst = [1, 2, 3]
lst[0]
1
lst[0] = 100
lst
[100, 2, 3]

For sequences, i.e. collections which naturally preserve order – lists, tuples and strings – you can also extract slices.

string = "Can't buy me love"
string[13:]
'love'

2.7.5. del: removing collection elements

The del operator can remove elements from collections that you point at with the [...] operator.

lst = [1, 2, 3]
del lst[1]
lst
[1, 3]
dct = {"one": 1, "two": 2}
del dct["one"]
dct
{'two': 2}

It also works for variables.

# create a variable
num = 1
num
1
# poof! it's gone
del num
num
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_56602/56445515.py in <module>
      1 # poof! it's gone
      2 del num
----> 3 num

NameError: name 'num' is not defined

2.7.6. Converting between collections

If you want to convert between the different types of collections, you can mostly use built-in functions named after the target collection. For instance, if you want to turn a set into a list:

list({1, 2, 3, 1, 2, 3})
[1, 2, 3]

Or a list into a tuple:

tuple([1, 2, 3])
(1, 2, 3)

Or a tuple into a set:

set((1, 2, 3, 1))
{1, 2, 3}

With dictionaries, it’s slightly more complicated, as they don’t contain only values, but key–value pairs. When converting from a dictionary, you thus have to decide whether you want just the keys (the default, but you can also request it explicitly with the .keys() method), just the values, or key–value pairs – so-called .items().

en2fr = {"cat": "chat", "dog": "chien"}
list(en2fr)
['cat', 'dog']
list(en2fr.keys())
['cat', 'dog']
tuple(en2fr.values())
('chat', 'chien')
set(en2fr.items())
{('cat', 'chat'), ('dog', 'chien')}

Conversely, when converting to a dictionary, you have to provide a collection which can be interpreted as containing both keys and values, otherwise you can’t really build a dictionary out of it. One possible option is a list of 2-tuples.

dict([("a", "b"), ("c", "d")])
{'a': 'b', 'c': 'd'}

But it’s definitely not the only one. Try to understand and describe what’s going on in the next cell!

dict(["ab", "cd"])
{'a': 'b', 'c': 'd'}

The dict function also allows you to create a fresh dictionary in a way that may be slightly easier to type, with fewer curly braces and quotes, if your keys are strings which also happen to be valid identifiers (i.e., they could be used as variable names).

dict(cat="chat", dog="chien")
{'cat': 'chat', 'dog': 'chien'}

Strings are kind of the odd one out in this company because the str function doesn’t convert another collection to a string, at least not in the same sense the other functions we’ve seen work. It returns a string representation of the collection, intended to suggest how you could create such a collection using literal syntax.

str([1, 2, 3])
'[1, 2, 3]'
str(en2fr)
"{'cat': 'chat', 'dog': 'chien'}"

In the other direction, the other collection functions split strings at character boundaries.

list("abracadabra")
['a', 'b', 'r', 'a', 'c', 'a', 'd', 'a', 'b', 'r', 'a']
set("abracadabra")
{'a', 'b', 'c', 'd', 'r'}

If you want to split anywhere else, you’ll have to use the .split() method on strings. By default, it splits on whitespace, any amount and any kind of it.

"  foo\nbar  \n baz   qux  ".split()
['foo', 'bar', 'baz', 'qux']

But you can also tell it explicitly what string to use as a delimiter, and in that case, it’ll follow your orders to the letter.

"  foo\nbar  \n baz   qux  ".split("\n")
['  foo', 'bar  ', ' baz   qux  ']

Even creating empty strings if two designated delimiters immediately adjoin each other.

"  foo\nbar  \n baz   qux  ".split(" ")
['', '', 'foo\nbar', '', '\n', 'baz', '', '', 'qux', '', '']

The delimiter can consist of multiple characters.

"the cat sat on the mat".split("at")
['the c', ' s', ' on the m', '']

2.7.7. Combining collections

2.7.8. Further exploration

The character, specificities and possible use cases of each collection type are further revealed by the methods they expose. We’ll point them out as we encounter them in practice throughout the rest of the book, but if you’re curious, I encourage you to play around with the individual collections and explore their abilities via the previously described tab completion + interactive help approach.

2.8. Importing additional libraries

We’re about to dive into the magical world of conditionals and for-loops, but to make it more interesting, I thought we’d throw in some data and tools provided by the NLTK (which stands for Natural Language Toolkit) library. In order to do that however, we need to know how to import it, so bear with me for this short interlude.

In every Python session, some core functions and data types are available by default – everything we’ve seen so far is part of these so-called built-ins. In and of themselves, they’re already amazingly useful and allow you to do lots of stuff, but if everyone always had to start from these basic building blocks, programming would be repetitive and tedious. That’s why people build reusable pieces of code that can be imported into Python to extend its functionality. These are called libraries or packages or modules. Strictly speaking, each of these terms means slightly different things, but informally, they can be used interchangeably.

Import syntax in Python is simple and intuitive; it has a few basic variations which we’ll presently go through.

import nltk

This imports the nltk module and creates an nltk variable which you can use to access the objects inside the module via attribute syntax. For instance, the word_tokenize() function splits text into words, or technically, tokens.

nltk.word_tokenize("Let it be.")
['Let', 'it', 'be', '.']

Notice how by default, Python tries to keep your objects and imported objects in separate namespaces, so that they don’t collide. If you happen to have previously defined a word_tokenize() function of your own, importing the nltk module won’t clobber it, because nltk’s word_tokenize() function is kept tucked away in the nltk namespace.

Of course, it may be the case that you actually have a previously created object named nltk that you don’t want to clobber. If so, then you can use renaming imports to pick the namespace yourself.

import nltk as ling

This imports the nltk module, but stores it in the variable / namespace ling instead of nltk.

ling.word_tokenize("Dig a pony.")
['Dig', 'a', 'pony', '.']

If you know you’re going to be using a specific object a lot and you don’t want to go to the trouble of typing the namespace prefix nltk. over and over again, then you can specifically request that it be added to your own namespace with the following syntax (notice that it’s customary to separate imports from regular code by at least one empty line):

from nltk import word_tokenize

word_tokenize("Two of us wearing raincoats.")
['Two', 'of', 'us', 'wearing', 'raincoats', '.']

And of course, you can combine this with renaming if necessary.

from nltk import word_tokenize as tokenize

tokenize("I, me, mine.")
['I', ',', 'me', ',', 'mine', '.']

By the way, if you ever accidentally overwrite a built-in function with another object, this is how you can restore it, by importing it from the builtins module.

# oops
len = 5
len("five")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_56602/474153094.py in <module>
      1 # oops
      2 len = 5
----> 3 len("five")

TypeError: 'int' object is not callable
# problem solved
from builtins import len

len("five")
4

You can perform multiple imports per line by separating the different things to import with a comma.

import nltk, builtins
from builtins import len, set

If you want to import all objects defined in a module under the names given to them in that module, you can use star import syntax.

from builtins import *

At first, this might seem convenient and appealing. Those of you who know R might be intuitively drawn to this form because this is R’s default. Resist the urge. This variant makes it hard to track where objects came from, because a lot of names can hide under that *, not to mention if you do this with multiple modules. In time, you’ll grow to appreciate Python’s more verbose but cleaner approach to namespaces, which makes it much easier to see at a glance where your variables came from.

Finally, the Python standard library contains many useful modules (the saying goes that Python comes with batteries included), but for many tasks, you’ll likely want to install additional packages. This can be done in a variety of different of ways, including a GUI manager if you’re using the Anaconda Python distribution, but the official Python package manager that should always be available is a command-line tool called pip.

This chapter is not about learning to use the command line, so just a quick crash course on pip. First, you need to figure out the name of the package you need. That can be done by searching the internet for keywords related to the functionality you want + Python, or by directly searching the Python Package Index.

When you have have the name, you need to run the pip install command at the command line. Conveniently, you can do so directly from JupyterLab by prefixing it with ! (this is another one of those special JupyterLab features which isn’t actually part of Python itself). For instance, to instal the nltk library, you would run the following command:

!pip install nltk
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/akopytov/sysbench/pypi/simple
Requirement already satisfied: nltk in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (3.6.4)
Requirement already satisfied: click in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (7.1.2)
Requirement already satisfied: tqdm in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (4.62.3)
Requirement already satisfied: joblib in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (1.1.0)
Requirement already satisfied: regex in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (2021.9.30)

If you get some sort of permission denied error, it’s because pip is trying to write into a system-wide library for all users which you don’t have write access to. In that case, try running pip install --user nltk instead.

That’s enough about libraries right now. On to control flow!