2. A tour of Python and NLTK¶
2.1. The most important piece of programming advice you’ll ever get™¶
Over the course of this book, you’ll be presented with a lot of information. Like, a lot. It makes no sense trying to memorize all of it on your first pass. No one expects you to. With programming as with anything else, practice makes perfect – as you keep using Python, the parts you use often enough will gradually enter your muscle memory. The parts that you don’t, you’ll have to look up as needed. I do that constantly when I write code that uses other people’s libraries that I don’t use every day.
Luckily, there’s this thing called the internet, where you can search for information using this thing called search engines, and these search engines are pretty smart nowadays, so you can basically ask them “how do I do X in Python” and they’ll probably come up with links to reasonable answers, because until you reach an intermediate to advanced level, you can be pretty sure you’re not the first person on the internet to have that question. To help you make sense of the results that you’re likely to get, here’s an overview of the domains that frequently appear in there:
https://stackoverflow.com/, and subdomains of https://stackexchange.com/: in the programming community, these are leading sites offering information in a user-generated question/answer format. If you have a question, it’s probably already been asked and answered there, maybe even multiple times. If it hasn’t, go ask it!
https://docs.python.org/: Python’s official documentation, including the many modules included in the standard library
subdomains of https://readthedocs.io/: documentation for additional packages, created by people all over the world; however, larger projects often host documentation on their own domain, e.g. https://pandas.pydata.org/
https://github.com: this is a site many people use to collaborate on developing software projects in the open (cf. open source). If you encounter weird behavior in a library, and that library’s community uses GitHub for development, it’s a good idea to check that library’s issue tracker to see whether that behavior might be a known bug, and if not, consider reporting it yourself.
2.2. How to use this chapter¶
Don’t try to memorize all the information contained in here by heart before moving on to the rest of the book. The purpose of this chapter is to familiarize you with basic words and concepts used throughout the book, so that when you run across them, you know where to look them up. That’s why they’re all in one place. There’s also quite a lot of them, and truly mastering them will get you quite far along your Python programming career, so that’s not the goal right now. The goal is to get acquainted, possibly skip ahead if you get bored, and definitely skip back for a refresher whenever things stop making sense.
2.3. Python notebooks: your fancy new calculator¶
Python notebooks are kind of like a calculator, just a lot more fancy than your typical calculator. But otherwise, they should feel pretty familiar – you type some stuff in, press a button, Python thinks for a while and spits back an answer.
The place where you type stuff is called a cell, and running a cell with the play button (▶) makes Python evaluate the code you typed in, i.e. your program. A very simple Python program could consist of just one number:
1
1
And Python’s answer is, unsuprisingly, the same as a calculator’s answer
would be: the expression 1
evaluates to… Well, it evaluates to
1
. Unlike in a calculator though, you can have multiple cells in a
notebook, jump back and forth between them, edit and re-evaluate old
ones, etc. Let’s create a new cell with the plus button (➕) and try
some slightly more complicated expressions.
1 + 1
2
Cells can consist of multiple lines. Let’s demonstrate that with a comment. Comments are pieces of programs which are completely ignored by the computer, they’re included solely for the benefit of the humans reading the code. In Python, the syntax for comments is to write a hash mark; everything that follows after that is a comment, which is often signaled by distinctive syntax highlighting.
# This is a comment.
# Python will completely ignore this line, the one above and the one below.
# They're here just for you, as the person who reads this code.
3 - 2
1
If you’d like to write more extensive prose commentary, consider using Markdown cells. In JupyterLab, use the notebook’s top toolbar to switch the cell type from Code to Markdown. Markdown is a lightweight markup language which allows you to add formatting such as italics and bold, hyperlinks and other features, but you don’t have to worry about that, you can mostly just write regular text and it’ll probably render just fine. If you’d like to learn more about Markdown features, you can always double-click on cells like this one in JupyterLab to see their source code, because they’re written in Markdown, or try this interactive tutorial. But let’s focus on Python right now.
The *
operator is used for multiplication, /
for division, //
for
truncating division, %
for modulo and **
for exponentiation.
2 * 3
6
4 / 3
1.3333333333333333
# "truncating" division means it'll chop off the part of the number
# after the decimal point
4 // 3
1
2**3
8
# modulo wraps counting around a certain number, so you can use it e.g.
# for converting 24h clock hours to 12h clock hours
16 % 12
4
A decimal point, .
, is used to separate the whole number part from the
fractional part of a number.
0.1 + 0.1
0.2
At this point, if you’re following along interactively in JupyterLab, you’re probably sick and tired of reaching for your mouse to press ▶ all the time. JupyterLab has many handy keyboard shortcuts which I’ll let you discover on your own at your leisure, but the following three are such a huge quality of life improvement that I feel compelled to point them out:
Alt+Enter
: evaluate, insert new cell and switch to itShift+Enter
: evaluate and switch to next (existing) cellCtrl+Enter
: evaluate and stay at current cell
2.4. Text as strings of characters¶
But enough about numbers! We’re linguist, so let’s look at how we can represent text. The most basic way of representing text in Python is as one long string of characters. Strings can be created using quotes, both single and double work fine.
"Hello, world!"
'Hello, world!'
'Hello, world!'
'Hello, world!'
"It's me, world!"
"It's me, world!"
Though you can’t just put a single quote inside a single-quote-delimited string because, well, Python would think that’s where the string ends.
'It's me, world!'
File "/tmp/ipykernel_56602/1132477001.py", line 1
'It's me, world!'
^
SyntaxError: invalid syntax
What Python spits out in this case is an error traceback, telling
you that something went wrong. Python usually gives extremely helpful
error messages that help you diagnose what went wrong, where the error
occurred and how you got there, so it’s well worth spending time reading
them and learning to decipher them (it’s not that hard, it just takes
practice). The one exception to this is when Python can’t even parse
your program as valid Python code. In that case, the best it can do is
tell you where it got stuck, raise a SyntaxError
, and leave you to
figure out what the problem is and how to fix it on your own. Which is
what happened here.
The easiest workaround when you want your string to contain one kind of
quote is to use the other kind of quote to delimit it, like we did in
the second-to-last code cell. Sometimes however, you want both kinds
of quotes. In that case, you can use the backslash character \
to
escape the special, string-terminating meaning of a quote, and make
it into a regular character which is part of the string.
"\"It's me, world!\" she said."
'"It\'s me, world!" she said.'
Escaping means canceling the default meaning of a character or character
sequence and using an alternative one. In a double-quoted string, "
normally means “end the string here”, but a preceding \
changes that
to “just insert a double quote at this point in the string”. Escape
sequences exist to help you put characters in your strings which you
just can’t put there literally, or which are hard to type on your
keyboard. For instance, you can’t just put a newline character
inside a string:
"one line
another line"
File "/tmp/ipykernel_56602/1697716864.py", line 1
"one line
^
SyntaxError: EOL while scanning string literal
Grr, another SyntaxError
. What you can do is use the \n
escape
sequence to represent that newline, without having an actual newline in
your source code and triggering that SyntaxError
.
"one line\nanother line"
'one line\nanother line'
How do you know that \n
is really a newline if the string still just
shows \n
? When you evaluate a string, Python shows you its canonical
representation which you could put back in your code. This means you get
a good idea of what the string contains – the newline is otherwise a sort
of phantom character, as is most whitespace, but you can actually
see it here – but it’s not very pretty nor readable. If you want to
get a rendered version of your string, as it would appear in a text
file, you need to use the print()
function.
print("one line\nanother line")
one line
another line
Ah, much better. At least for reading. For inspecting the contents of strings, the default behavior is in fact very useful.
There are various other escape
sequences
you can use in Python strings, another handy and commonly encountered
one is \t
for the tab character:
print("one\ttwo\nthree\tfour")
one two
three four
And less commonly seen but pretty neat as well is \N{...}
, for
inserting characters based on their Unicode names:
"\N{see-no-evil monkey}"
'🙈'
If you want to type a longer piece of text into a Python string though,
that newline escape thing can become really annoying – who wants to
type several paragraphs as one long line interspersed with \n
s?
Fear not, triple-quoted strings to the rescue:
"""one line
another line"""
'one line\nanother line'
print('''one line
another line''')
one line
another line
Inside a triple-quoted or multiline string, you can put anything you like, including actual newlines… except of course a sequence of three quotes of the same type that you’re using to delimit the string. In that case, you need to escape at least some of them with that backslash again.
All these various ways of creating strings are called string literals. A literal is a dedicated syntax for creating one type of data. In the previous section, we saw how to write down number literals in Python – the representation was so straightforward that we didn’t even think it needed a special term like ‘literal’ to describe it. Python has a few more core, built-in data structures which have dedicated literal syntax; we’ll encounter them below.
2.5. Objects and variables¶
Writing Python code consists of interacting with and manipulating various objects. Object is a generic term for anything you can inspect by putting it inside a code cell and evaluating that cell. So far, we’ve seen numbers, strings and one function (that’s right, functions are objects too).
print
<function print>
If you know you’ll be using an object repeatedly, it’s a good idea to
store it in a variable using the assignment operator, =
. That
way, you don’t have to keep writing its literal over and over. The
variable name is entirely up to you, though there are some
rules
you must abide by; for instance, you can’t use spaces, so these are
often replaced with underscores _
.
string = "one\ttwo\nthree\tfour"
string
'one\ttwo\nthree\tfour'
Anywhere you want to use that object, you can then refer to it using the name of the variable that points to it.
print(string)
one two
three four
Multiple names can refer to the same object.
another_name = string
print(another_name)
one two
three four
Whether two names refer to the same object or not can be checked using
the is
operator.
string is another_name
True
string is print
False
string2 = "one\ttwo\nthree\tfour"
# NOTE: when not in the notebook environment, Python will usually be
# smart enough to figure out that string and string2 contain the same
# characters, it will store just one copy of the string to save
# memory and point both variables at it; in that case, the expression
# below will evaluate to True
string is string2
False
When two objects are not the same, they might still be equal, in
that they look the same, they have the same contents. You can check
for that using the equality operator, ==
.
string == string2
True
A quick way to remember that using a real-world analogy: for twins, is
would be false, since they’re two different people, but ==
would be
true, since they look the same.
The ==
is also a good opportunity to check that numbers and strings
really are two completely different things in Python:
42 == "42"
False
2.6. Attributes: objects on objects (on objects…)¶
Only rarely is an object an island entire of
itself, most
objects have other objects attached to them as attributes, and these
in turn have attributes of their own, etc. We saw that the period
character .
is used as the decimal separator in number literals. The
other, probably more important use of .
in Python is for accessing
those attributes.
Say for instance that we have a complex number, which consists of two parts, a real and an imaginary one. This is the literal syntax for complex numbers in Python:
c = 1 + 2j
c
(1+2j)
Let’s not worry about what complex numbers are good for right now, we’re
interested in attributes. Complex numbers store their real and imaginary
parts as .real
and .imag
attributes, respectively.
c.real
1.0
c.imag
2.0
2.6.1. Methods¶
Functions are regular objects and as such, they can also be attached to
other objects as attributes, as little snippets of dynamic behavior
which do something interesting with the parent object, instead of just
storing static data. Functions attached to an object as attributes are
more commonly referred to as that object’s methods, but they’re
basically just functions. For instance, inspect the the print()
function we already saw above, and compare it with the .conjugate()
method.
print
<function print>
c.conjugate
<function complex.conjugate>
Unlike regular data though, functions are generally not meant to be
inspected, they’re meant to be called using function call syntax,
i.e. by appending ()
to the function name. Calling a function triggers
its behavior, it runs the piece of code that’s associated with it. For
instance, the print()
function prints objects to the screen, as we saw
previously. The .conjugate()
method computes the complex number’s
conjugate.
c.conjugate()
(1-2j)
Let’s drop complex numbers as the running example and come back to
strings – they have a much greater variety of interesting methods we
can explore. If you’re running this notebook interactively inside
JupyterLab, a great feature which helps you do so is tab completion.
If you type a variable name + .
and hit the Tab
key, a menu should
come up with all the attributes available on the object. Try it!
# type string. and press Tab on your keyboard
['capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isascii',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'removeprefix',
'removesuffix',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']
You can see there’s quite a lot going on in there. There’s a whole lot
of methods starting with the prefix is*
, which are there to answer
some questions you might have about the contents of your strings.
"cat".islower()
True
"DOG".isupper()
True
"Frank".istitle()
True
"42".isnumeric()
True
If you’d like to learn more about an object, you can use JupyterLab’s
interactive help feature. Type a question mark ?
before the name
of the object you’re interested in, evaluate the cell, and detailed
information about that object will be shown. For instance, if we’re
interested in what the .isprintable()
method does:
?string.isprintable
Signature: ()
Docstring:
Return True if the string is printable, False otherwise.
A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type: builtin_function_or_method
The question mark is not regular Python syntax, it only works inside
notebooks and is intended to make interactive exploration easier. Stock
Python also has a built-in help()
function which does something
similar, but it displays less information. If you want even more
details, you can insist by typing two question marks ??
instead of
one:
??string.isprintable
Signature: ()
Docstring:
Return True if the string is printable, False otherwise.
A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type: builtin_function_or_method
… though sometimes (as in this case), more detail might not be available and the help output will be the same as with one question mark. For convenience, the question mark(s) can also go behind the object you’re taking a peek at.
string.isprintable?
Signature: ()
Docstring:
Return True if the string is printable, False otherwise.
A string is printable if all of its characters are considered printable in
repr() or if it is empty.
Type: builtin_function_or_method
And another way to trigger interactive help is by pressing Shift-Tab
while your typing cursor is inside an object’s name. A floating window
will pop up with the help contents inside. This is useful for quick
checks, because you don’t even have to evaluate the cell.
There are also several methods which allow you to create new strings with some of the characters changed.
# convert to upper case
"klein".upper()
'KLEIN'
# convert to lower case
"GROß".lower()
'groß'
# convert to lower case in an even more aggressive way, which is usually
# the safer option if you really want to make sure all case distinctions
# are ignored
"GROß".casefold()
'gross'
"pride and prejudice".capitalize()
'Pride and prejudice'
"pride and prejudice".title()
'Pride And Prejudice'
# remove leading and trailing whitespace
"""
floating in space
""".strip()
'floating in space'
Some of these methods even require additional arguments to do their
work. For instance, if you want to .replace()
part(s) of the string
with something else, you need to tell Python what to replace with what.
Arguments are written out between the parentheses doing the function
call, and if you can’t remember what they are or what order they come
in, that’s precisely what interactive help is there for!
"I love cats and categories.".replace("cat", "dog")
'I love dogs and dogegories.'
Note that none of these methods modify the original string, they just use it to derive what the new string should look like.
animal = "cat"
animal.upper()
'CAT'
animal
'cat'
This is because strings (and numbers) in Python are immutable – you can’t change them in place. You can create new strings and re-assign them to old variable names, but the old strings will always stay the way they were at the beginning.
# create a string and give it two names
name1 = "string"
name2 = name1
# create a new string which is an uppercase version of the original
# string, and re-assign it to the name2 variable
name2 = name2.upper()
name2
'STRING'
# but the old string is still there, undisturbed
name1
'string'
We’ll talk more about immutability in the next section about collections, because unlike strings and numbers, some Python objects can actually be modified in place.
Using tab completion and interactive help (?
or Shift-Tab
), I
encourage you to familiarize yourself with the methods on string
objects, or indeed on any new object type that you come across and
intend to use, so that you know what’s available and get a picture of
what you can use the object for. No need to go through them
exhaustively, just skim the list, read the descriptions of some of the
methods that sound particularly useful, leave those that sound confusing
for later (or never).
2.7. Collections¶
Collections or containers are objects intended to contain other objects, so that you can conveniently manipulate them together. We saw above that almost all Python objects in some sense “contain” other objects via their attributes, but those attributes are somehow intimately tied to how any given object works. The whole point of collections is that they (mostly) don’t care what you put into them or what you take out; they’ll just happily keep an eye on it for you and allow you to juggle and re-arrange the items in clever and succinct ways.
To motivate the need for collections, imagine that you want to store the individual tokens in a sentence, “Let it be.”. Without collections, you would have to use separate variables:
string1 = "Let"
string2 = "it"
string3 = "be"
string4 = "."
This gets really tedious really quickly. Instead, you can use a Python list:
strings = ["Let", "it", "be", "."]
Much better!
2.7.1. Collection literals¶
We’ll start getting acquainted with Python’s builtin collections by
learning about their literals, i.e. the special syntax used to create
them. We’ll cover lists, tuples, dictionaries and sets. We’ve previously
covered strings, which can also be seen as collections. The string
"abc"
is a collection of characters, or more accurately, a collection
of three strings of length 1: "a"
, "b"
and "c"
.
We’ve already met lists. In general, Python really doesn’t care what kinds of objects you store in your collections, it’s entirely up to you, so you can mix and match at will.
[1, "two", print]
[1, 'two', <function print>]
Lists are great for storing tokenized text:
["Help", "!", "I", "need", "somebody", ",", "help", "!"]
['Help', '!', 'I', 'need', 'somebody', ',', 'help', '!']
This is how you create an empty list:
[]
[]
Closely related to lists are tuples (we’ll discuss the differences below).
(1, "two", print)
(1, 'two', <function print>)
In many cases, the parentheses are actually optional. You’ll probably see me using tuples when I want to output multiple objects from a code cell, because Python only uses the last expression in the cell as its output value.
1
"two"
print
<function print>
1, "two", print
(1, 'two', <function print>)
Whenever you see commas without any parentheses around them, it’s a tuple. You’ll learn over time when it’s safe to omit them, but until then, it might be a good idea to play it safe and always use them. This means “take the number 1, the result of the comparison 2 < 3, and the number 4, and create a 3-tuple out of them”:
1, 2 < 3, 4
(1, True, 4)
Whereas this says, “create a 2-tuple (1, 2)
and a 2-tuple (3, 4)
and
only then do a comparison of the resulting tuples”:
(1, 2) < (3, 4)
True
As you can see, the parentheses work in the same way as in math, as a precedence operator: they say “first create the tuples and then do the rest”, much as \((4 + 3) * 2\) says “first do the addition, then the multiplication”, overriding the default \(4 + 3 * 2\) which goes the other way round. One place they’re never optional though is when creating an empty tuple.
()
()
Moving on, we have dictionaries. Dictionaries are different in that their purpose is not to store only values, but key–value pairs. We say that they map keys to values, kind of like real-world dictionaries map words in one language to another.
{"cat": "chat", "dog": "chien"}
{'cat': 'chat', 'dog': 'chien'}
A constraint on dictionaries is that the keys must be unique. If you provide multiple values per key, only the last one will be retained.
{"odd": 1, "even": 2, "odd": 3, "even": 4, "odd": 5}
{'odd': 5, 'even': 4}
If you need multiple values per key, well… Just store a collection as the value instead!
{"odd": [1, 3, 5], "even": [2, 4]}
{'odd': [1, 3, 5], 'even': [2, 4]}
And this is how you create an empty dictionary:
{}
{}
It may not seem like it at first glance, but dictionaries are an extremely powerful and versatile data structure, and they’re the backbone upon which Python is built – they’re used everywhere.
Sets have a literal syntax which somewhat resembles that of
dictionaries: it also uses curly braces {}
, but no colons :
, because
sets again store just values, not key–value pairs. But they do require
that their values be unique and throw away any duplicates, so there is a
conceptual similarity with dictionaries which motivates that syntactic
similarity.
{1, 2, 3, 1, 2, 3}
{1, 2, 3}
This deduplication behavior makes sets great for deriving vocabularies of unique words.
{"the", "cat", "sat", "on", "the", "mat"}
{'cat', 'mat', 'on', 'sat', 'the'}
Since {}
is already taken to mean empty dictionary, empty set literals
actually look like a function call:
set()
set()
This is a somewhat ugly inconsistency in a language that otherwise tries hard to be consistent, but oh well, what can you do.
2.7.2. len()
: number of items in collection¶
The len()
function works on all collections. It tells you how many
elements a collection has.
len([1, 2, 3])
3
len("Norwegian Wood")
14
In the case of dictionaries, it counts the number of key–value pairs.
len({"cat": "chat", "dog": "chien"})
2
2.7.3. in
: checking collection membership¶
The in
operator also works on all collections. It tells you
whether the collection contains a given element.
1 in [1, 2, 3]
True
"one" in {1, 2, 3}
False
For dictionaries, it tests against keys, not values.
"cat" in {"cat": "chat", "dog": "chien"}
True
"chat" in {"cat": "chat", "dog": "chien"}
False
2.7.4. [...]
: retrieving and modifying collection elements¶
lst = [1, 2, 3]
lst[0]
1
lst[0] = 100
lst
[100, 2, 3]
For sequences, i.e. collections which naturally preserve order – lists, tuples and strings – you can also extract slices.
string = "Can't buy me love"
string[13:]
'love'
2.7.5. del
: removing collection elements¶
The del
operator can remove elements from collections that you point
at with the [...]
operator.
lst = [1, 2, 3]
del lst[1]
lst
[1, 3]
dct = {"one": 1, "two": 2}
del dct["one"]
dct
{'two': 2}
It also works for variables.
# create a variable
num = 1
num
1
# poof! it's gone
del num
num
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/tmp/ipykernel_56602/56445515.py in <module>
1 # poof! it's gone
2 del num
----> 3 num
NameError: name 'num' is not defined
2.7.6. Converting between collections¶
If you want to convert between the different types of collections, you can mostly use built-in functions named after the target collection. For instance, if you want to turn a set into a list:
list({1, 2, 3, 1, 2, 3})
[1, 2, 3]
Or a list into a tuple:
tuple([1, 2, 3])
(1, 2, 3)
Or a tuple into a set:
set((1, 2, 3, 1))
{1, 2, 3}
With dictionaries, it’s slightly more complicated, as they don’t contain
only values, but key–value pairs. When converting from a dictionary,
you thus have to decide whether you want just the keys (the default, but
you can also request it explicitly with the .keys()
method), just the
values, or key–value pairs – so-called .items()
.
en2fr = {"cat": "chat", "dog": "chien"}
list(en2fr)
['cat', 'dog']
list(en2fr.keys())
['cat', 'dog']
tuple(en2fr.values())
('chat', 'chien')
set(en2fr.items())
{('cat', 'chat'), ('dog', 'chien')}
Conversely, when converting to a dictionary, you have to provide a collection which can be interpreted as containing both keys and values, otherwise you can’t really build a dictionary out of it. One possible option is a list of 2-tuples.
dict([("a", "b"), ("c", "d")])
{'a': 'b', 'c': 'd'}
But it’s definitely not the only one. Try to understand and describe what’s going on in the next cell!
dict(["ab", "cd"])
{'a': 'b', 'c': 'd'}
The dict
function also allows you to create a fresh dictionary in a
way that may be slightly easier to type, with fewer curly braces and
quotes, if your keys are strings which also happen to be valid
identifiers (i.e., they could be used as variable names).
dict(cat="chat", dog="chien")
{'cat': 'chat', 'dog': 'chien'}
Strings are kind of the odd one out in this company because the str
function doesn’t convert another collection to a string, at least not in
the same sense the other functions we’ve seen work. It returns a
string representation of the collection, intended to suggest how you
could create such a collection using literal syntax.
str([1, 2, 3])
'[1, 2, 3]'
str(en2fr)
"{'cat': 'chat', 'dog': 'chien'}"
In the other direction, the other collection functions split strings at character boundaries.
list("abracadabra")
['a', 'b', 'r', 'a', 'c', 'a', 'd', 'a', 'b', 'r', 'a']
set("abracadabra")
{'a', 'b', 'c', 'd', 'r'}
If you want to split anywhere else, you’ll have to use the .split()
method on strings. By default, it splits on whitespace, any amount and
any kind of it.
" foo\nbar \n baz qux ".split()
['foo', 'bar', 'baz', 'qux']
But you can also tell it explicitly what string to use as a delimiter, and in that case, it’ll follow your orders to the letter.
" foo\nbar \n baz qux ".split("\n")
[' foo', 'bar ', ' baz qux ']
Even creating empty strings if two designated delimiters immediately adjoin each other.
" foo\nbar \n baz qux ".split(" ")
['', '', 'foo\nbar', '', '\n', 'baz', '', '', 'qux', '', '']
The delimiter can consist of multiple characters.
"the cat sat on the mat".split("at")
['the c', ' s', ' on the m', '']
2.7.7. Combining collections¶
2.7.8. Further exploration¶
The character, specificities and possible use cases of each collection type are further revealed by the methods they expose. We’ll point them out as we encounter them in practice throughout the rest of the book, but if you’re curious, I encourage you to play around with the individual collections and explore their abilities via the previously described tab completion + interactive help approach.
2.8. Importing additional libraries¶
We’re about to dive into the magical world of conditionals and for-loops, but to make it more interesting, I thought we’d throw in some data and tools provided by the NLTK (which stands for Natural Language Toolkit) library. In order to do that however, we need to know how to import it, so bear with me for this short interlude.
In every Python session, some core functions and data types are available by default – everything we’ve seen so far is part of these so-called built-ins. In and of themselves, they’re already amazingly useful and allow you to do lots of stuff, but if everyone always had to start from these basic building blocks, programming would be repetitive and tedious. That’s why people build reusable pieces of code that can be imported into Python to extend its functionality. These are called libraries or packages or modules. Strictly speaking, each of these terms means slightly different things, but informally, they can be used interchangeably.
Import syntax in Python is simple and intuitive; it has a few basic variations which we’ll presently go through.
import nltk
This imports the nltk
module and creates an nltk
variable which you
can use to access the objects inside the module via attribute syntax.
For instance, the word_tokenize()
function splits text into words, or
technically, tokens.
nltk.word_tokenize("Let it be.")
['Let', 'it', 'be', '.']
Notice how by default, Python tries to keep your objects and imported
objects in separate namespaces, so that they don’t collide. If you
happen to have previously defined a word_tokenize()
function of your
own, importing the nltk
module won’t clobber it, because nltk
’s
word_tokenize()
function is kept tucked away in the nltk
namespace.
Of course, it may be the case that you actually have a previously
created object named nltk
that you don’t want to clobber. If so, then
you can use renaming imports to pick the namespace yourself.
import nltk as ling
This imports the nltk
module, but stores it in the variable /
namespace ling
instead of nltk
.
ling.word_tokenize("Dig a pony.")
['Dig', 'a', 'pony', '.']
If you know you’re going to be using a specific object a lot and you
don’t want to go to the trouble of typing the namespace prefix nltk.
over and over again, then you can specifically request that it be
added to your own namespace with the following syntax (notice that it’s
customary to separate imports from regular code by at least one empty
line):
from nltk import word_tokenize
word_tokenize("Two of us wearing raincoats.")
['Two', 'of', 'us', 'wearing', 'raincoats', '.']
And of course, you can combine this with renaming if necessary.
from nltk import word_tokenize as tokenize
tokenize("I, me, mine.")
['I', ',', 'me', ',', 'mine', '.']
By the way, if you ever accidentally overwrite a built-in function with
another object, this is how you can restore it, by importing it from the
builtins
module.
# oops
len = 5
len("five")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_56602/474153094.py in <module>
1 # oops
2 len = 5
----> 3 len("five")
TypeError: 'int' object is not callable
# problem solved
from builtins import len
len("five")
4
You can perform multiple imports per line by separating the different things to import with a comma.
import nltk, builtins
from builtins import len, set
If you want to import all objects defined in a module under the names given to them in that module, you can use star import syntax.
from builtins import *
At first, this might seem convenient and appealing. Those of you who
know R might be intuitively drawn to this form because this is R’s
default. Resist the urge. This variant makes it hard to track where
objects came from, because a lot of names can hide under that *
, not
to mention if you do this with multiple modules. In time, you’ll grow to
appreciate Python’s more verbose but cleaner approach to namespaces,
which makes it much easier to see at a glance where your variables came
from.
Finally, the Python standard library contains many useful modules (the
saying goes that Python comes with batteries included), but for many
tasks, you’ll likely want to install additional packages. This can be
done in a variety of different of ways, including a GUI
manager
if you’re using the Anaconda Python distribution, but the official
Python package manager that should always be available is a command-line
tool called pip
.
This chapter is not about learning to use the command line, so just a
quick crash course on pip
. First, you need to figure out the name of
the package you need. That can be done by searching the internet for
keywords related to the functionality you want + Python, or by directly
searching the Python Package Index.
When you have have the name, you need to run the pip install
command
at the command line. Conveniently, you can do so directly from
JupyterLab by prefixing it with !
(this is another one of those
special JupyterLab features which isn’t actually part of Python itself).
For instance, to instal the nltk
library, you would run the following
command:
!pip install nltk
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/akopytov/sysbench/pypi/simple
Requirement already satisfied: nltk in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (3.6.4)
Requirement already satisfied: click in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (7.1.2)
Requirement already satisfied: tqdm in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (4.62.3)
Requirement already satisfied: joblib in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (1.1.0)
Requirement already satisfied: regex in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (from nltk) (2021.9.30)
If you get some sort of permission denied error, it’s because pip
is
trying to write into a system-wide library for all users which you don’t
have write access to. In that case, try running pip install --user nltk
instead.
That’s enough about libraries right now. On to control flow!