4. Getting your data into Python¶
4.1. Overview¶
This chapter is about interacting with data on your computer’s disk – mostly loading it into Python, but also writing it back, and storing it for quick reloading later.
4.2. Plain text files¶
A plain text file is a file where every single bit contributes towards representing plain text content. What fits into the concept of ‘plain text’ is actually surprisingly fuzzy – it depends on which characters are available in the character set you’re using, and character sets are just arbitrary human conventions. Historically, there were computer systems that didn’t make case distinctions, so upper vs. lower case was not something you could express using just plain text.
Nowadays, Unicode is making inroads into text properties which were previously the domain of rich text, assigning dedicated codepoints to 𝘪𝘵𝘢𝘭𝘪𝘤𝘴 or ˢᵘᵖᵉʳˢᶜʳᶦᵖᵗ, which allow you, among other things, to work around the lack of rich text capabilities in Twitter or Facebook posts (although that’s obviously not what they were originally intended for). But the further you go along the axis of content vs. appearance, the more you encounter properties which are unlikely to ever make it into plain text, like the ability to specify which particular font the text should be displayed with.
from unicodedata import name
for char in "𝘪𝘵𝘢𝘭𝘪𝘤𝘴ˢᵘᵖᵉʳˢᶜʳᶦᵖᵗ":
print(char, name(char), sep="\t")
𝘪 MATHEMATICAL SANS-SERIF ITALIC SMALL I
𝘵 MATHEMATICAL SANS-SERIF ITALIC SMALL T
𝘢 MATHEMATICAL SANS-SERIF ITALIC SMALL A
𝘭 MATHEMATICAL SANS-SERIF ITALIC SMALL L
𝘪 MATHEMATICAL SANS-SERIF ITALIC SMALL I
𝘤 MATHEMATICAL SANS-SERIF ITALIC SMALL C
𝘴 MATHEMATICAL SANS-SERIF ITALIC SMALL S
ˢ MODIFIER LETTER SMALL S
ᵘ MODIFIER LETTER SMALL U
ᵖ MODIFIER LETTER SMALL P
ᵉ MODIFIER LETTER SMALL E
ʳ MODIFIER LETTER SMALL R
ˢ MODIFIER LETTER SMALL S
ᶜ MODIFIER LETTER SMALL C
ʳ MODIFIER LETTER SMALL R
ᶦ MODIFIER LETTER SMALL CAPITAL I
ᵖ MODIFIER LETTER SMALL P
ᵗ MODIFIER LETTER SMALL T
Plain text files can be opened with the built-in open()
function. As
we’ve seen in our discussion of Unicode, your safest bet with
a plain text file in an unknown encoding is to start by trying to open
it as UTF-8
– not because that will always work, but precisely
because it won’t: if it’s not actually UTF-8
, you’re likely to get an
error, which will tell you there’s something fishy and prevent you from
corrupting your data. Also, UTF-8
is becoming more and more prevalent,
so chances are good that file actually is UTF-8
, in which case
you’re golden. The text.txt
file below, though, isn’t.
with open("data/text.txt", encoding="utf-8") as file:
print(file.read())
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/tmp/ipykernel_11227/3031899450.py in <module>
1 with open("data/text.txt", encoding="utf-8") as file:
----> 2 print(file.read())
~/.local/pyenv/versions/3.9.6/lib/python3.9/codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 1-2: invalid continuation byte
OK, that didn’t work. Now you can start thinking about alternatives. Do
you know anything about the language the text in the file is supposed to
be in? If it’s a Western European language, then the encoding might be
latin1
or
cp1252
; if it’s Central
European, then perhaps
latin2
or
cp1250
(beyond those
two regions, I’m afraid I can’t give any tips, you’ll have to rely on
web search engines).
Let’s say we know text.txt
is in Czech, a Central European
language. We might want to try cp1250
, because that’s still (ugh)
the default encoding for text files created under the Czech version of
Windows, and that’s where the file might have come from.
with open("data/text.txt", encoding="cp1250") as file:
print(file.read())
Běľela Magda kaňonem, sráľela banány rádiem.
Helenka líbala na kolínko robustního cestáře France.
Pan čáp ztratil čepičku, měla barvu barvičku...?
The tricky thing here is that everything went seemingly fine, no error
occurred. That’s because cp1250
is an 8-bit fixed-width encoding, and
any sequence of bytes can be interpreted as a cp1250
-encoded file –
even if it was originally intended to be something else. If a Czech
speaker tries to read this output, some words will be off. It may be
hard to spot them, because many 8-bit encodings share a lot of the
byte–character mappings, at least in the ASCII range [0; 128), but to a
smaller extent even beyond. This makes it easier to still read at least
parts of the file if you’ve guessed the encoding wrong, but it also
makes it harder to realize that.
The 1990s were the age of many different 8-bit encodings for different
groups of languages. It sucked. Be glad you live in the age of UTF-8
.
Finally, if we try latin2
, that Czech person helpfully standing behind
you can tell you that now, the result looks alright. If you compare with
the previous attempt, you may see that there are indeed minute
differences.
with open("data/text.txt", encoding="latin2") as file:
print(file.read())
Běžela Magda kaňonem, srážela banány rádiem.
Helenka líbala na kolínko robustního cestáře France.
Pan čáp ztratil čepičku, měla barvu barvičku...?
Sometimes, you may encounter a plain text file that you know should be
in UTF-8
, but it has become corrupted for some reason – maybe it was
spliced together with another file in a different encoding, or maybe
some evil spirit flipped a few bits here and there. In that case, the
contents won’t look right in any one encoding, but you might still want
to at least have a glimpse at what it contains. To make Python soldier
through in spite of encountering invalid byte sequences (i.e. encoding
errors), you can specify an error handler. A preview of a few of the
more common options is given below; for a full list, refer
here.
with open("data/text.txt", encoding="utf-8", errors="replace") as file:
print(file.read())
B�ela Magda ka�onem, sr�ela ban�ny r�diem.
Helenka l�bala na kol�nko robustn�ho cest��e France.
Pan ��p ztratil �epi�ku, m�la barvu barvi�ku...?
with open("data/text.txt", encoding="utf-8", errors="ignore") as file:
print(file.read())
Bela Magda kaonem, srela banny rdiem.
Helenka lbala na kolnko robustnho ceste France.
Pan p ztratil epiku, mla barvu barviku...?
with open("data/text.txt", encoding="utf-8", errors="backslashreplace") as file:
print(file.read())
B\xec\xbeela Magda ka\xf2onem, sr\xe1\xbeela ban\xe1ny r\xe1diem.
Helenka l\xedbala na kol\xednko robustn\xedho cest\xe1\xf8e France.
Pan \xe8\xe1p ztratil \xe8epi\xe8ku, m\xecla barvu barvi\xe8ku...?
Incidentally, this allows us to take a peek at what a rich text file
looks like under the hood. If you open text.doc
in a word processor,
you’ll see that it has the same textual content as text.txt
. But
opening it as plain text, we see that the file definitely contains some
more stuff besides that, and most of it can’t be interpreted as text –
there’s a lot of those question mark replacement characters, and even
more null bytes (\x00
, i.e. a byte consisting of all 0’s). Some of it
is metadata, e.g. who wrote the text and when, or possibly encoding
information, so that the word processor doesn’t have to take blind
guesses at which encoding the text is stored in like we just had to.
with open("data/text.doc", encoding="utf-8", errors="replace") as file:
txt = file.read()
txt[:10]
'��\x11\u0871\x1a�\x00\x00\x00\x00'
# this is probably author metadata?
index = txt.find("Lukeš")
txt[index-2:index+15]
'\x00\x00Lukeš, David\x00\x00\x00'
# and this is probably the part which corresponds to "Magda kaňonem" in
# our plain text file
index = txt.find("M")
txt[index:index+25]
'M\x00a\x00g\x00d\x00a\x00 \x00k\x00a\x00H\x01o\x00n\x00e\x00m'
4.3. Manipulating tabular data¶
Data often comes in tabular format – Excel files, CSV files and the
like. The easiest and most convenient way to load this type of data into
Python and manipulate it is using the
pandas
library. While manual
modifications of individual cells will probably always be more ergonomic
in a spreadsheet editor like Excel, any kind of mass data manipulation
is what pandas
… excels at, if you’ll pardon the pun. It does so in a
clean and efficient way, mostly without even breaking a sweat, and best
of all, you can always retrace your steps and check whether you’ve made
a mistake at some point because you’re writing them down as Python
commands instead of clicking around in a graphical user interface. Not
to mention that this makes it trivial to apply the same series of
processing steps to similarly shaped data, once you’ve figured them out.
4.3.1. The pandas
library¶
Let’s fire up pandas
and take a look at how it can help you to slice
and dice tables in Python. Or should I say DataFrame
s, because that’s
what pandas
calls them, acknowledging inspiration from R’s trademark
data structure.
import pandas as pd
df = pd.read_excel("data/concordance_corpus.xlsx")
type(df)
pandas.core.frame.DataFrame
df
newadvent.org | by Theodor Mommsen , undertook its monumental publication , the | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flattering letter to | |
---|---|---|---|---|
0 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
1 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
2 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus/corpus/NN | , or invade sovereign states . He did n't... |
3 | hinduism.co.za | corporeal being in the fullness of time , assu... | corpus/corpus/NN | . It arises and perishes in due order . And |
4 | lg-legal.com | relation to the takeover offer announced today... | Corpora/corpus/NNS | valuing Archipelago at £ 340 m. 27 Sep 2013 MORE |
... | ... | ... | ... | ... |
848 | blogs.ulster.ac.uk | University . For example , Corpus Christi Coll... | Corpus/Corpus/NP | Irish Missal , 12 th century ( MS 282 ) |
849 | ginnysaustin.com | set by a youth cutting my teeth on tacos in | Corpus/Corpus/NP | Christi ( RIP Elva ’s ) . Now , I |
850 | patriotsquestion911.com | of the Geneva Conventions , and the repeal of ... | corpus/corpus/NN | ( a fundamental point of law that has been with |
851 | biblicalarchaeology.org | Keel recently pointed out , even in the highly... | Corpus/Corpus/NP | of West Semitic Stamp Seals published by Nahma... |
852 | rrojasdatabank.info | developing ones , were interpreted through the... | corpus/corpus/NN | of knowledge recognized as Keynesian economics... |
853 rows × 4 columns
pd.read_excel(
"data/concordance_corpus.xlsx",
header=None
)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flatteri... |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
3 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus/corpus/NN | , or invade sovereign states . He did n't... |
4 | hinduism.co.za | corporeal being in the fullness of time , assu... | corpus/corpus/NN | . It arises and perishes in due order . And |
... | ... | ... | ... | ... |
849 | blogs.ulster.ac.uk | University . For example , Corpus Christi Coll... | Corpus/Corpus/NP | Irish Missal , 12 th century ( MS 282 ) |
850 | ginnysaustin.com | set by a youth cutting my teeth on tacos in | Corpus/Corpus/NP | Christi ( RIP Elva ’s ) . Now , I |
851 | patriotsquestion911.com | of the Geneva Conventions , and the repeal of ... | corpus/corpus/NN | ( a fundamental point of law that has been with |
852 | biblicalarchaeology.org | Keel recently pointed out , even in the highly... | Corpus/Corpus/NP | of West Semitic Stamp Seals published by Nahma... |
853 | rrojasdatabank.info | developing ones , were interpreted through the... | corpus/corpus/NN | of knowledge recognized as Keynesian economics... |
854 rows × 4 columns
pd.read_excel?
df = pd.read_excel(
"data/concordance_corpus.xlsx",
header=None,
names=["domain", "left", "kwic", "right"]
)
df
domain | left | kwic | right | |
---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flatteri... |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
3 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus/corpus/NN | , or invade sovereign states . He did n't... |
4 | hinduism.co.za | corporeal being in the fullness of time , assu... | corpus/corpus/NN | . It arises and perishes in due order . And |
... | ... | ... | ... | ... |
849 | blogs.ulster.ac.uk | University . For example , Corpus Christi Coll... | Corpus/Corpus/NP | Irish Missal , 12 th century ( MS 282 ) |
850 | ginnysaustin.com | set by a youth cutting my teeth on tacos in | Corpus/Corpus/NP | Christi ( RIP Elva ’s ) . Now , I |
851 | patriotsquestion911.com | of the Geneva Conventions , and the repeal of ... | corpus/corpus/NN | ( a fundamental point of law that has been with |
852 | biblicalarchaeology.org | Keel recently pointed out , even in the highly... | Corpus/Corpus/NP | of West Semitic Stamp Seals published by Nahma... |
853 | rrojasdatabank.info | developing ones , were interpreted through the... | corpus/corpus/NN | of knowledge recognized as Keynesian economics... |
854 rows × 4 columns
df["domain"]
0 newadvent.org
1 newadvent.org
2 newadvent.org
3 freerepublic.com
4 hinduism.co.za
...
849 blogs.ulster.ac.uk
850 ginnysaustin.com
851 patriotsquestion911.com
852 biblicalarchaeology.org
853 rrojasdatabank.info
Name: domain, Length: 854, dtype: object
len(set(df["domain"]))
234
df[["domain", "kwic"]]
domain | kwic | |
---|---|---|
0 | newadvent.org | Corpus/Corpus/NP |
1 | newadvent.org | Corpus/Corpus/NP |
2 | newadvent.org | Corpus/Corpus/NP |
3 | freerepublic.com | corpus/corpus/NN |
4 | hinduism.co.za | corpus/corpus/NN |
... | ... | ... |
849 | blogs.ulster.ac.uk | Corpus/Corpus/NP |
850 | ginnysaustin.com | Corpus/Corpus/NP |
851 | patriotsquestion911.com | corpus/corpus/NN |
852 | biblicalarchaeology.org | Corpus/Corpus/NP |
853 | rrojasdatabank.info | corpus/corpus/NN |
854 rows × 2 columns
df.loc[1:3, "domain":"kwic"]
domain | left | kwic | |
---|---|---|---|
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP |
3 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus/corpus/NN |
df.loc[1:3, ["domain", "kwic"]]
domain | kwic | |
---|---|---|
1 | newadvent.org | Corpus/Corpus/NP |
2 | newadvent.org | Corpus/Corpus/NP |
3 | freerepublic.com | corpus/corpus/NN |
df.kwic
0 Corpus/Corpus/NP
1 Corpus/Corpus/NP
2 Corpus/Corpus/NP
3 corpus/corpus/NN
4 corpus/corpus/NN
...
849 Corpus/Corpus/NP
850 Corpus/Corpus/NP
851 corpus/corpus/NN
852 Corpus/Corpus/NP
853 corpus/corpus/NN
Name: kwic, Length: 854, dtype: object
df.domain == "newadvent.org"
0 True
1 True
2 True
3 False
4 False
...
849 False
850 False
851 False
852 False
853 False
Name: domain, Length: 854, dtype: bool
df.loc[1:3]
domain | left | kwic | right | |
---|---|---|---|---|
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
3 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus/corpus/NN | , or invade sovereign states . He did n't... |
df.loc[df.domain == "newadvent.org"]
domain | left | kwic | right | |
---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flatteri... |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
155 | newadvent.org | made their way into the earlier editions of the " | Corpus/Corpus/NP | Juris Civilis " , the " Corpus Juris Canonici " |
156 | newadvent.org | of the " Corpus Juris Civilis " , the " | Corpus/Corpus/NP | Juris Canonici " , and the large collections o... |
308 | newadvent.org | deceased : e.g. QUI LEGIS , ORA PRO EO ( | Corpus/Corpus/NP | Inscript . Lat . , X , n. 3312 ) |
df.loc
<pandas.core.indexing._LocIndexer at 0x7f08f08b1d10>
df.query("domain == 'newadvent.org'")
domain | left | kwic | right | |
---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flatteri... |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
155 | newadvent.org | made their way into the earlier editions of the " | Corpus/Corpus/NP | Juris Civilis " , the " Corpus Juris Canonici " |
156 | newadvent.org | of the " Corpus Juris Civilis " , the " | Corpus/Corpus/NP | Juris Canonici " , and the large collections o... |
308 | newadvent.org | deceased : e.g. QUI LEGIS , ORA PRO EO ( | Corpus/Corpus/NP | Inscript . Lat . , X , n. 3312 ) |
df.domain.str.endswith(".org")
0 True
1 True
2 True
3 False
4 False
...
849 False
850 False
851 False
852 True
853 False
Name: domain, Length: 854, dtype: bool
df.loc[df.domain.str.endswith(".org")]
domain | left | kwic | right | |
---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flatteri... |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... |
23 | thefullwiki.org | the fundamental rule from which can be deduced... | corpus/corpus/NN | of libertarian theory . [ 13 ] ” W. D. |
25 | schools-wikipedia.org | of the thousands of extant inscriptions are pu... | Corpus/Corpus/NP | of Indus Seals and Inscriptions ( 1987 , 1991 , |
... | ... | ... | ... | ... |
835 | obdurodon.org | goal is to provide scholars with a large but m... | corpus/corpus/NN | of data for comparative study of calendar trad... |
838 | realizingrights.org | adopt the principles and rights outlined in th... | corpus/corpus/NN | as a set of public health ethics then we are |
843 | unispal.un.org | , venerated by Christians , Jews and Moslems , a | corpus/corpus/NN | separatum , which should be under internationa... |
844 | home.igc.org | heads of state Blocking the small arms treaty ... | corpus/corpus/NN | to prisoners on Guantanamo and other secret pr... |
852 | biblicalarchaeology.org | Keel recently pointed out , even in the highly... | Corpus/Corpus/NP | of West Semitic Stamp Seals published by Nahma... |
161 rows × 4 columns
df.kwic.str.split("/")
0 [Corpus, Corpus, NP]
1 [Corpus, Corpus, NP]
2 [Corpus, Corpus, NP]
3 [corpus, corpus, NN]
4 [corpus, corpus, NN]
...
849 [Corpus, Corpus, NP]
850 [Corpus, Corpus, NP]
851 [corpus, corpus, NN]
852 [Corpus, Corpus, NP]
853 [corpus, corpus, NN]
Name: kwic, Length: 854, dtype: object
for row in df.kwic.str.split("/"):
print(row)
break
['Corpus', 'Corpus', 'NP']
df.kwic.str.split("/", expand=True)
0 | 1 | 2 | |
---|---|---|---|
0 | Corpus | Corpus | NP |
1 | Corpus | Corpus | NP |
2 | Corpus | Corpus | NP |
3 | corpus | corpus | NN |
4 | corpus | corpus | NN |
... | ... | ... | ... |
849 | Corpus | Corpus | NP |
850 | Corpus | Corpus | NP |
851 | corpus | corpus | NN |
852 | Corpus | Corpus | NP |
853 | corpus | corpus | NN |
854 rows × 3 columns
df[["kwic", "domain"]]
kwic | domain | |
---|---|---|
0 | Corpus/Corpus/NP | newadvent.org |
1 | Corpus/Corpus/NP | newadvent.org |
2 | Corpus/Corpus/NP | newadvent.org |
3 | corpus/corpus/NN | freerepublic.com |
4 | corpus/corpus/NN | hinduism.co.za |
... | ... | ... |
849 | Corpus/Corpus/NP | blogs.ulster.ac.uk |
850 | Corpus/Corpus/NP | ginnysaustin.com |
851 | corpus/corpus/NN | patriotsquestion911.com |
852 | Corpus/Corpus/NP | biblicalarchaeology.org |
853 | corpus/corpus/NN | rrojasdatabank.info |
854 rows × 2 columns
df[["word", "lemma", "tag"]]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/tmp/ipykernel_11227/2526799359.py in <module>
----> 1 df[["word", "lemma", "tag"]]
~/repos/v4py.github.io/.venv/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
3462 if is_iterator(key):
3463 key = list(key)
-> 3464 indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
3465
3466 # take() does not accept boolean indexers
~/repos/v4py.github.io/.venv/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
1312 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1313
-> 1314 self._validate_read_indexer(keyarr, indexer, axis)
1315
1316 if needs_i8_conversion(ax.dtype) or isinstance(
~/repos/v4py.github.io/.venv/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
1372 if use_interval_msg:
1373 key = list(key)
-> 1374 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
1375
1376 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
KeyError: "None of [Index(['word', 'lemma', 'tag'], dtype='object')] are in the [columns]"
df[["word", "lemma", "tag"]] = df.kwic.str.split("/", expand=True)
df
domain | left | kwic | right | word | lemma | tag | |
---|---|---|---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus/Corpus/NP | Inscriptionum Latinarum " , it sent a flatteri... | Corpus | Corpus | NP |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus/Corpus/NP | " , among them Edwin Bormann , the noted autho... | Corpus | Corpus | NP |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus/Corpus/NP | Inscriptionum " , which appeared in the monthl... | Corpus | Corpus | NP |
3 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus/corpus/NN | , or invade sovereign states . He did n't... | corpus | corpus | NN |
4 | hinduism.co.za | corporeal being in the fullness of time , assu... | corpus/corpus/NN | . It arises and perishes in due order . And | corpus | corpus | NN |
... | ... | ... | ... | ... | ... | ... | ... |
849 | blogs.ulster.ac.uk | University . For example , Corpus Christi Coll... | Corpus/Corpus/NP | Irish Missal , 12 th century ( MS 282 ) | Corpus | Corpus | NP |
850 | ginnysaustin.com | set by a youth cutting my teeth on tacos in | Corpus/Corpus/NP | Christi ( RIP Elva ’s ) . Now , I | Corpus | Corpus | NP |
851 | patriotsquestion911.com | of the Geneva Conventions , and the repeal of ... | corpus/corpus/NN | ( a fundamental point of law that has been with | corpus | corpus | NN |
852 | biblicalarchaeology.org | Keel recently pointed out , even in the highly... | Corpus/Corpus/NP | of West Semitic Stamp Seals published by Nahma... | Corpus | Corpus | NP |
853 | rrojasdatabank.info | developing ones , were interpreted through the... | corpus/corpus/NN | of knowledge recognized as Keynesian economics... | corpus | corpus | NN |
854 rows × 7 columns
df2 = df[["domain", "left", "word", "lemma", "tag", "right"]]
df2
domain | left | word | lemma | tag | right | |
---|---|---|---|---|---|---|
0 | newadvent.org | by Theodor Mommsen , undertook its monumental ... | Corpus | Corpus | NP | Inscriptionum Latinarum " , it sent a flatteri... |
1 | newadvent.org | Mommsen . The latter 's numerous collaborators... | Corpus | Corpus | NP | " , among them Edwin Bormann , the noted autho... |
2 | newadvent.org | ) , concerning the preparatory work for the ab... | Corpus | Corpus | NP | Inscriptionum " , which appeared in the monthl... |
3 | freerepublic.com | , or arrest the MD assembly , or suspend habeus | corpus | corpus | NN | , or invade sovereign states . He did n't... |
4 | hinduism.co.za | corporeal being in the fullness of time , assu... | corpus | corpus | NN | . It arises and perishes in due order . And |
... | ... | ... | ... | ... | ... | ... |
849 | blogs.ulster.ac.uk | University . For example , Corpus Christi Coll... | Corpus | Corpus | NP | Irish Missal , 12 th century ( MS 282 ) |
850 | ginnysaustin.com | set by a youth cutting my teeth on tacos in | Corpus | Corpus | NP | Christi ( RIP Elva ’s ) . Now , I |
851 | patriotsquestion911.com | of the Geneva Conventions , and the repeal of ... | corpus | corpus | NN | ( a fundamental point of law that has been with |
852 | biblicalarchaeology.org | Keel recently pointed out , even in the highly... | Corpus | Corpus | NP | of West Semitic Stamp Seals published by Nahma... |
853 | rrojasdatabank.info | developing ones , were interpreted through the... | corpus | corpus | NN | of knowledge recognized as Keynesian economics... |
854 rows × 6 columns
df.plot?
df["domain"]
0 newadvent.org
1 newadvent.org
2 newadvent.org
3 freerepublic.com
4 hinduism.co.za
...
849 blogs.ulster.ac.uk
850 ginnysaustin.com
851 patriotsquestion911.com
852 biblicalarchaeology.org
853 rrojasdatabank.info
Name: domain, Length: 854, dtype: object
df["domain"].value_counts()
ucrel.lancs.ac.uk 123
nltk.googlecode.com 90
quinndombrowski.com 65
medicolegal.tripod.com 28
cass.lancs.ac.uk 21
...
publiusonline.com 1
brainethics.wordpress.com 1
epluribusmedia.org 1
news.art.fsu.edu 1
rrojasdatabank.info 1
Name: domain, Length: 234, dtype: int64
df["domain"].value_counts().plot(kind="bar")
<AxesSubplot:>
df["domain"].value_counts().head().plot(kind="bar")
<AxesSubplot:>
pd.read_csv("data/frequencies_intensifiers.csv")
1;"completely different";"672";"" | |
---|---|
0 | 2;"entirely different";"386";"" |
1 | 3;"entirely new";"334";"" |
2 | 4;"totally different";"282";"" |
3 | 5;"completely new";"261";"" |
4 | 6;"completely free";"147";"" |
... | ... |
2844 | 2846;"entirely relative";"1";"" |
2845 | 2847;"completely brown";"1";"" |
2846 | 2848;"completely literate";"1";"" |
2847 | 2849;"totally boneheaded";"1";"" |
2848 | 2850;"totally housebound";"1";"" |
2849 rows × 1 columns
pd.read_csv?
pd.read_csv(
"data/frequencies_intensifiers.csv",
sep=";",
header=None,
names=["rank", "collocation", "freq", "empty"]
)
rank | collocation | freq | empty | |
---|---|---|---|---|
0 | 1 | completely different | 672 | NaN |
1 | 2 | entirely different | 386 | NaN |
2 | 3 | entirely new | 334 | NaN |
3 | 4 | totally different | 282 | NaN |
4 | 5 | completely new | 261 | NaN |
... | ... | ... | ... | ... |
2845 | 2846 | entirely relative | 1 | NaN |
2846 | 2847 | completely brown | 1 | NaN |
2847 | 2848 | completely literate | 1 | NaN |
2848 | 2849 | totally boneheaded | 1 | NaN |
2849 | 2850 | totally housebound | 1 | NaN |
2850 rows × 4 columns
If you want to go in the other direction and store DataFrame
s on disk
as Excel spreadsheets, CSV files or many other formats, check out the
methods starting with to_*
on DataFrame
objects (remember you can
use Tab
in JupyterLab to bring up a completion menu if you start
typing just to_
).
pandas
is an impressively featureful library and we’ve barely
scratched the surface of what you can do with it. It has also only
fairly recently hit 1.0 status, which means a lot of polish has been
applied to its website and documentation. Previously, the documentation,
though extensive and complete, was somewhat hard to navigate; this has
gotten much better. For more information, I suggest reviewing the
library’s Getting
started,
which contains a list of practical
tasks
you might want to use pandas
for along with recipes telling you how to
achieve that.
4.3.2. The csv
module in the standard library¶
The Python standard library also comes with a csv
module. This is useful when
you don’t have the option to install pandas
, or when you don’t really
need to work with the CSV file as a table, you just need to pull out
some values and put them in a dictionary for instance. In that case,
pandas
may be an unnecessarily heavy dependency (as a Swiss Army knife
for data manipulation, it’s pretty hefty), not to mention that loading
the entire table into memory at once might be wasteful, especially if
it’s large and you just want one or two columns.
Let’s first take a peek at the contents of a CSV file. As mentioned, it’s basically just a plain text file. This particular CSV file contains a frequency distribution of intensifier + adverb combinations.
with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
for line in file:
print(line)
break
"1";"completely different";"672";""
import csv
with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
reader = csv.reader(file)
for row in reader:
print(row)
break
['1;"completely different";"672";""']
row
['1;"completely different";"672";""']
len(row)
1
csv.reader?
with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
reader = csv.reader(file, delimiter=";")
for row in reader:
print(row)
break
['1', 'completely different', '672', '']
len(row)
4
int("4")
4
float("4.5")
4.5
row[1]
'completely different'
row[1].split()
['completely', 'different']
adv, adj = row[1].split()
Let’s divide up the adjectives into sets based on which intensifiers they co-occur.
completely = set()
totally = set()
entirely = set()
utterly = set()
with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
reader = csv.reader(file, delimiter=";")
for row in reader:
adv, adj = row[1].split()
if adv == "completely":
completely.add(adj)
elif adv == "totally":
totally.add(adj)
elif adv == "entirely":
entirely.add(adj)
elif adv == "utterly":
utterly.add(adj)
else:
print("unexpected adverb:", adv)
By using set operations, we can now figure out which intensifiers tend (not) to co-occur with which adjectives.
not_utterly = completely | totally | entirely
# or: not_utterly = completely.union(totally).union(entirely)
utterly - not_utterly
# or: utterly.difference(not_utterly)
{'appalling',
'arresting',
'bigoted',
'blasphemous',
'blasphomous',
'cack-handed',
'childish',
'clever',
'conditional',
'contemptible',
'crippling',
'damnable',
'defective',
'degrading',
'depraved',
'despicable',
'detestable',
'devoted',
'digestible',
'disappointing',
'disconsolate',
'disgraceful',
'dismal',
'disposable',
'distasteful',
'distinguished',
'disturbing',
'downcast',
'dreadful',
'earthy',
'effeminate',
'endless',
'energetic',
'enraged',
'exquisite',
'extraordinary',
'fatuous',
'fluid',
'forgettable',
'fragile',
'geeky',
'graceful',
'gracious',
'guileless',
'heartbreaking',
'hopeful',
'hysterical',
'impassioned',
'important',
'impoverished',
'indistinguishable',
'indivisible',
'infrequent',
'lawless',
'materialistic',
'minuscule',
'non-essential',
'nugatory',
'obscene',
'obtuse',
'one-of-a-kind',
'orthodox',
'paltry',
'partisan',
'passé',
'pathological',
'perverse',
'phenomenal',
'praiseworthy',
'prepared',
'preposterous',
'profound',
'reductionistic',
'remarkable',
'remiss',
'riveting',
'ruthless',
'self-destructive',
'shambolic',
'shameful',
'similar',
'simple',
'simplistic',
'situational',
'smug',
'spectacular',
'splendid',
'squalid',
'stifling',
'stirring',
'stubborn',
'stunning',
'subdued',
'therapeutic',
'thoughtless',
'totalitarian',
'un-australian',
'uncared',
'uncollectable',
'undesirable',
'unexceptional',
'unfathomable',
'ungodly',
'uninteresting',
'unprovable',
'unreconcilable',
'unscientific',
'unscrupulous',
'unspooky',
'unsuspecting',
'unwinnable',
'vain',
'vastated',
'vulgar'}
4.4. Storing objects on disk and reloading them¶
Some values take a long time to compute, so you don’t want to have to compute them again and again each time you close and reopen JupyterLab. Instead, you’d like to compute them once, store them somewhere, and reload them (almost) instantaneously whenever you need.
4.4.1. The %store
magic function¶
The %store
magic function can store individual variables; it’s perhaps
the simplest option, but you don’t really control where the object gets
stored.
a = 2
%store a
Stored 'a' (int)
a = 3
a
3
Reload the stored value of the a
variable:
%store -r a
a
2
For more information, consult %store
’s docstring.
?%store
4.4.2. The json
standard library module¶
The standard library json
module can also be used
for this purpose.
import json
JSON serialization actually results in plain text, which is nice and mostly human readable, if it’s pretty-printed. It looks close to how the same data structure is written down in Python (can you spot the differences?).
person = {
"name": "John Doe",
"age": 31,
"interests": ["Python", "linguistics"],
"single": False,
"pet": None,
}
print(json.dumps(person, indent=2))
{
"name": "John Doe",
"age": 31,
"interests": [
"Python",
"linguistics"
],
"single": false,
"pet": null
}
When writing to disk, you get to pick where the object is stored, at the
expense of having to type more than with %store
. The "w"
argument
sets the mode of the open file to write (the "r"
mode for reading
is the default, so we didn’t need to set it explicitly before when
reading files).
with open("person.json", "w") as file:
json.dump(person, file, indent=2)
%cat person.json
{
"name": "John Doe",
"age": 31,
"interests": [
"Python",
"linguistics"
],
"single": false,
"pet": null
}
with open("person.json") as file:
data = json.load(file)
data
{'name': 'John Doe',
'age': 31,
'interests': ['Python', 'linguistics'],
'single': False,
'pet': None}
# cleanup
%rm person.json
If you want to store multiple objects like this, just put them in a
dictionary and json.dump
the whole thing.
JSON was created as an interchange format, which comes both with
advantages and a drawbacks. The advantage is it can be easily loaded
into different languages / tools, almost every programming language now
has an easily accessible JSON library. The main drawback is that it only
works for storing a limited range of types: dicts, lists, strings,
numbers, Boolean values (True
and False
) and None
. As an
interchange format, it makes sense that it has to stick to the lowest
common denominator of what’s available in some form in almost every
programming language, otherwise there couldn’t be much interchange.
Some additional types can be stored as JSON, but only by being converted
to one of the above – e.g. if you store a tuple in JSON and load it
back, it will become a list.
json.loads(json.dumps((1, 2, 3)))
[1, 2, 3]
4.4.3. The pickle
standard library module¶
Pickling objects works
in a very similar way to dumping them as JSON, just make sure to open
the file for writing in binary mode ("wb"
):
import pickle
with open("person.pickle", "wb") as file:
pickle.dump(person, file)
… and for reading as well ("rb"
):
with open("person.pickle", "rb") as file:
data = pickle.load(file)
data
{'name': 'John Doe',
'age': 31,
'interests': ['Python', 'linguistics'],
'single': False,
'pet': None}
This is because pickling doesn’t use a plain text format, but a custom binary format.
%cat person.pickle
��W}�(�name��John Doe��age�K� interests�]�(�Python��linguistics�e�single���pet�Nu.
# cleanup
%rm person.pickle
The advantage of pickle
is that unlike JSON, it can faithfully
preserve a much wider spectrum of Python objects (most things you’re
likely to need in normal practice). This flexibility is partially
achieved by allowing arbitrary code to run during unpickling, based on
what’s stored in the pickle, which is a security flaw – someone could
in theory send you a maliciously crafted pickle which deletes your home
directory upon unpickling. So only unpickle data from sources you
trust.
Another disadvantage is that the format is specific to Python, and it can even change between versions of the language: there are several versions of the pickle protocol, as it gets improved over time. This means that if you want to share pickled objects across Python versions, you need to be careful about which protocol version you use in order to retain backwards compatibility.
Like with JSON, if you want to pickle multiple objects, you still have to store them separately, or put them all in a dict manually and store the dict.
4.4.4. The dill
library¶
dill
is pickle
on steroids. For any
less proficient English speakers reading and/or those without a
background in English literature, the name is a pun on dill
pickle, or maybe
even A Dill Pickle.
Its biggest advantage is that it can pickle entire sessions (kind of
like R does, if you’re familiar with R), you don’t have to specify
objects one by one. Before we demonstrate this though, we’ll need to get
rid of our CSV reader
object from before, because it turns out that’s
one of the objects which can’t be even dill-pickled. (You don’t have to
remember this by heart, I certainly don’t – Python will complain loudly
if you try to pickle something that can’t be pickled.)
import dill
del reader
dill.dump_session("session.pickle")
dill.load_session("session.pickle")
# cleanup
%rm session.pickle
Apart from the advantages of being able to pickle entire sessions,
dill
also extends pickling support to more types of objects. You can
just switch your code to using dill.dump()
/ dill.load()
instead of
pickle.dump()
/ pickle.load()
and you get this extended support for
free.
The disadvantages are the same as for pickle
, plus it’s not bundled
with Python, so it’s another dependency you have to install it
separately.