4. Getting your data into Python

4.1. Overview

This chapter is about interacting with data on your computer’s disk – mostly loading it into Python, but also writing it back, and storing it for quick reloading later.

4.2. Plain text files

A plain text file is a file where every single bit contributes towards representing plain text content. What fits into the concept of ‘plain text’ is actually surprisingly fuzzy – it depends on which characters are available in the character set you’re using, and character sets are just arbitrary human conventions. Historically, there were computer systems that didn’t make case distinctions, so upper vs. lower case was not something you could express using just plain text.

Nowadays, Unicode is making inroads into text properties which were previously the domain of rich text, assigning dedicated codepoints to 𝘪𝘵𝘢𝘭𝘪𝘤𝘴 or ˢᵘᵖᵉʳˢᶜʳᶦᵖᵗ, which allow you, among other things, to work around the lack of rich text capabilities in Twitter or Facebook posts (although that’s obviously not what they were originally intended for). But the further you go along the axis of content vs. appearance, the more you encounter properties which are unlikely to ever make it into plain text, like the ability to specify which particular font the text should be displayed with.

from unicodedata import name

for char in "𝘪𝘵𝘢𝘭𝘪𝘤𝘴ˢᵘᵖᵉʳˢᶜʳᶦᵖᵗ":
    print(char, name(char), sep="\t")
𝘪	MATHEMATICAL SANS-SERIF ITALIC SMALL I
𝘵	MATHEMATICAL SANS-SERIF ITALIC SMALL T
𝘢	MATHEMATICAL SANS-SERIF ITALIC SMALL A
𝘭	MATHEMATICAL SANS-SERIF ITALIC SMALL L
𝘪	MATHEMATICAL SANS-SERIF ITALIC SMALL I
𝘤	MATHEMATICAL SANS-SERIF ITALIC SMALL C
𝘴	MATHEMATICAL SANS-SERIF ITALIC SMALL S
ˢ	MODIFIER LETTER SMALL S
ᵘ	MODIFIER LETTER SMALL U
ᵖ	MODIFIER LETTER SMALL P
ᵉ	MODIFIER LETTER SMALL E
ʳ	MODIFIER LETTER SMALL R
ˢ	MODIFIER LETTER SMALL S
ᶜ	MODIFIER LETTER SMALL C
ʳ	MODIFIER LETTER SMALL R
ᶦ	MODIFIER LETTER SMALL CAPITAL I
ᵖ	MODIFIER LETTER SMALL P
ᵗ	MODIFIER LETTER SMALL T

Plain text files can be opened with the built-in open() function. As we’ve seen in our discussion of Unicode, your safest bet with a plain text file in an unknown encoding is to start by trying to open it as UTF-8 – not because that will always work, but precisely because it won’t: if it’s not actually UTF-8, you’re likely to get an error, which will tell you there’s something fishy and prevent you from corrupting your data. Also, UTF-8 is becoming more and more prevalent, so chances are good that file actually is UTF-8, in which case you’re golden. The text.txt file below, though, isn’t.

with open("data/text.txt", encoding="utf-8") as file:
    print(file.read())
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
/tmp/ipykernel_11227/3031899450.py in <module>
      1 with open("data/text.txt", encoding="utf-8") as file:
----> 2     print(file.read())

~/.local/pyenv/versions/3.9.6/lib/python3.9/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 1-2: invalid continuation byte

OK, that didn’t work. Now you can start thinking about alternatives. Do you know anything about the language the text in the file is supposed to be in? If it’s a Western European language, then the encoding might be latin1 or cp1252; if it’s Central European, then perhaps latin2 or cp1250 (beyond those two regions, I’m afraid I can’t give any tips, you’ll have to rely on web search engines).

Let’s say we know text.txt is in Czech, a Central European language. We might want to try cp1250, because that’s still (ugh) the default encoding for text files created under the Czech version of Windows, and that’s where the file might have come from.

with open("data/text.txt", encoding="cp1250") as file:
    print(file.read())
Běľela Magda kaňonem, sráľela banány rádiem.
Helenka líbala na kolínko robustního cestáře France.
Pan čáp ztratil čepičku, měla barvu barvičku...?

The tricky thing here is that everything went seemingly fine, no error occurred. That’s because cp1250 is an 8-bit fixed-width encoding, and any sequence of bytes can be interpreted as a cp1250-encoded file – even if it was originally intended to be something else. If a Czech speaker tries to read this output, some words will be off. It may be hard to spot them, because many 8-bit encodings share a lot of the byte–character mappings, at least in the ASCII range [0; 128), but to a smaller extent even beyond. This makes it easier to still read at least parts of the file if you’ve guessed the encoding wrong, but it also makes it harder to realize that.

The 1990s were the age of many different 8-bit encodings for different groups of languages. It sucked. Be glad you live in the age of UTF-8.

Finally, if we try latin2, that Czech person helpfully standing behind you can tell you that now, the result looks alright. If you compare with the previous attempt, you may see that there are indeed minute differences.

with open("data/text.txt", encoding="latin2") as file:
    print(file.read())
Běžela Magda kaňonem, srážela banány rádiem.
Helenka líbala na kolínko robustního cestáře France.
Pan čáp ztratil čepičku, měla barvu barvičku...?

Sometimes, you may encounter a plain text file that you know should be in UTF-8, but it has become corrupted for some reason – maybe it was spliced together with another file in a different encoding, or maybe some evil spirit flipped a few bits here and there. In that case, the contents won’t look right in any one encoding, but you might still want to at least have a glimpse at what it contains. To make Python soldier through in spite of encountering invalid byte sequences (i.e. encoding errors), you can specify an error handler. A preview of a few of the more common options is given below; for a full list, refer here.

with open("data/text.txt", encoding="utf-8", errors="replace") as file:
    print(file.read())
B�ela Magda ka�onem, sr�ela ban�ny r�diem.
Helenka l�bala na kol�nko robustn�ho cest��e France.
Pan ��p ztratil �epi�ku, m�la barvu barvi�ku...?
with open("data/text.txt", encoding="utf-8", errors="ignore") as file:
    print(file.read())
Bela Magda kaonem, srela banny rdiem.
Helenka lbala na kolnko robustnho ceste France.
Pan p ztratil epiku, mla barvu barviku...?
with open("data/text.txt", encoding="utf-8", errors="backslashreplace") as file:
    print(file.read())
B\xec\xbeela Magda ka\xf2onem, sr\xe1\xbeela ban\xe1ny r\xe1diem.
Helenka l\xedbala na kol\xednko robustn\xedho cest\xe1\xf8e France.
Pan \xe8\xe1p ztratil \xe8epi\xe8ku, m\xecla barvu barvi\xe8ku...?

Incidentally, this allows us to take a peek at what a rich text file looks like under the hood. If you open text.doc in a word processor, you’ll see that it has the same textual content as text.txt. But opening it as plain text, we see that the file definitely contains some more stuff besides that, and most of it can’t be interpreted as text – there’s a lot of those question mark replacement characters, and even more null bytes (\x00, i.e. a byte consisting of all 0’s). Some of it is metadata, e.g. who wrote the text and when, or possibly encoding information, so that the word processor doesn’t have to take blind guesses at which encoding the text is stored in like we just had to.

with open("data/text.doc", encoding="utf-8", errors="replace") as file:
    txt = file.read()
txt[:10]
'��\x11\u0871\x1a�\x00\x00\x00\x00'
# this is probably author metadata?
index = txt.find("Lukeš")
txt[index-2:index+15]
'\x00\x00Lukeš, David\x00\x00\x00'
# and this is probably the part which corresponds to "Magda kaňonem" in
# our plain text file
index = txt.find("M")
txt[index:index+25]
'M\x00a\x00g\x00d\x00a\x00 \x00k\x00a\x00H\x01o\x00n\x00e\x00m'

4.3. Manipulating tabular data

Data often comes in tabular format – Excel files, CSV files and the like. The easiest and most convenient way to load this type of data into Python and manipulate it is using the pandas library. While manual modifications of individual cells will probably always be more ergonomic in a spreadsheet editor like Excel, any kind of mass data manipulation is what pandas… excels at, if you’ll pardon the pun. It does so in a clean and efficient way, mostly without even breaking a sweat, and best of all, you can always retrace your steps and check whether you’ve made a mistake at some point because you’re writing them down as Python commands instead of clicking around in a graphical user interface. Not to mention that this makes it trivial to apply the same series of processing steps to similarly shaped data, once you’ve figured them out.

4.3.1. The pandas library

Let’s fire up pandas and take a look at how it can help you to slice and dice tables in Python. Or should I say DataFrames, because that’s what pandas calls them, acknowledging inspiration from R’s trademark data structure.

import pandas as pd
df = pd.read_excel("data/concordance_corpus.xlsx")
type(df)
pandas.core.frame.DataFrame
df
newadvent.org by Theodor Mommsen , undertook its monumental publication , the Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flattering letter to
0 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
1 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
2 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus/corpus/NN , or invade sovereign states . ​He​ ​did​ ​n't...
3 hinduism.co.za corporeal being in the fullness of time , assu... corpus/corpus/NN . It arises and perishes in due order . And
4 lg-legal.com relation to the takeover offer announced today... Corpora/corpus/NNS valuing Archipelago at £ 340 m. 27 Sep 2013 MORE
... ... ... ... ...
848 blogs.ulster.ac.uk University . For example , Corpus Christi Coll... Corpus/Corpus/NP Irish Missal , 12 th century ( MS 282 )
849 ginnysaustin.com set by a youth cutting my teeth on tacos in Corpus/Corpus/NP Christi ( RIP Elva ’s ) . Now , I
850 patriotsquestion911.com of the Geneva Conventions , and the repeal of ... corpus/corpus/NN ( a fundamental point of law that has been with
851 biblicalarchaeology.org Keel recently pointed out , even in the highly... Corpus/Corpus/NP of West Semitic Stamp Seals published by Nahma...
852 rrojasdatabank.info developing ones , were interpreted through the... corpus/corpus/NN of knowledge recognized as Keynesian economics...

853 rows × 4 columns

pd.read_excel(
    "data/concordance_corpus.xlsx",
    header=None
)
0 1 2 3
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flatteri...
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
3 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus/corpus/NN , or invade sovereign states . ​He​ ​did​ ​n't...
4 hinduism.co.za corporeal being in the fullness of time , assu... corpus/corpus/NN . It arises and perishes in due order . And
... ... ... ... ...
849 blogs.ulster.ac.uk University . For example , Corpus Christi Coll... Corpus/Corpus/NP Irish Missal , 12 th century ( MS 282 )
850 ginnysaustin.com set by a youth cutting my teeth on tacos in Corpus/Corpus/NP Christi ( RIP Elva ’s ) . Now , I
851 patriotsquestion911.com of the Geneva Conventions , and the repeal of ... corpus/corpus/NN ( a fundamental point of law that has been with
852 biblicalarchaeology.org Keel recently pointed out , even in the highly... Corpus/Corpus/NP of West Semitic Stamp Seals published by Nahma...
853 rrojasdatabank.info developing ones , were interpreted through the... corpus/corpus/NN of knowledge recognized as Keynesian economics...

854 rows × 4 columns

pd.read_excel?
df = pd.read_excel(
    "data/concordance_corpus.xlsx",
    header=None,
    names=["domain", "left", "kwic", "right"]
)
df
domain left kwic right
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flatteri...
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
3 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus/corpus/NN , or invade sovereign states . ​He​ ​did​ ​n't...
4 hinduism.co.za corporeal being in the fullness of time , assu... corpus/corpus/NN . It arises and perishes in due order . And
... ... ... ... ...
849 blogs.ulster.ac.uk University . For example , Corpus Christi Coll... Corpus/Corpus/NP Irish Missal , 12 th century ( MS 282 )
850 ginnysaustin.com set by a youth cutting my teeth on tacos in Corpus/Corpus/NP Christi ( RIP Elva ’s ) . Now , I
851 patriotsquestion911.com of the Geneva Conventions , and the repeal of ... corpus/corpus/NN ( a fundamental point of law that has been with
852 biblicalarchaeology.org Keel recently pointed out , even in the highly... Corpus/Corpus/NP of West Semitic Stamp Seals published by Nahma...
853 rrojasdatabank.info developing ones , were interpreted through the... corpus/corpus/NN of knowledge recognized as Keynesian economics...

854 rows × 4 columns

df["domain"]
0                newadvent.org
1                newadvent.org
2                newadvent.org
3             freerepublic.com
4               hinduism.co.za
                ...           
849         blogs.ulster.ac.uk
850           ginnysaustin.com
851    patriotsquestion911.com
852    biblicalarchaeology.org
853        rrojasdatabank.info
Name: domain, Length: 854, dtype: object
len(set(df["domain"]))
234
df[["domain", "kwic"]]
domain kwic
0 newadvent.org Corpus/Corpus/NP
1 newadvent.org Corpus/Corpus/NP
2 newadvent.org Corpus/Corpus/NP
3 freerepublic.com corpus/corpus/NN
4 hinduism.co.za corpus/corpus/NN
... ... ...
849 blogs.ulster.ac.uk Corpus/Corpus/NP
850 ginnysaustin.com Corpus/Corpus/NP
851 patriotsquestion911.com corpus/corpus/NN
852 biblicalarchaeology.org Corpus/Corpus/NP
853 rrojasdatabank.info corpus/corpus/NN

854 rows × 2 columns

df.loc[1:3, "domain":"kwic"]
domain left kwic
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP
3 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus/corpus/NN
df.loc[1:3, ["domain", "kwic"]]
domain kwic
1 newadvent.org Corpus/Corpus/NP
2 newadvent.org Corpus/Corpus/NP
3 freerepublic.com corpus/corpus/NN
df.kwic
0      Corpus/Corpus/NP
1      Corpus/Corpus/NP
2      Corpus/Corpus/NP
3      corpus/corpus/NN
4      corpus/corpus/NN
             ...       
849    Corpus/Corpus/NP
850    Corpus/Corpus/NP
851    corpus/corpus/NN
852    Corpus/Corpus/NP
853    corpus/corpus/NN
Name: kwic, Length: 854, dtype: object
df.domain == "newadvent.org"
0       True
1       True
2       True
3      False
4      False
       ...  
849    False
850    False
851    False
852    False
853    False
Name: domain, Length: 854, dtype: bool
df.loc[1:3]
domain left kwic right
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
3 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus/corpus/NN , or invade sovereign states . ​He​ ​did​ ​n't...
df.loc[df.domain == "newadvent.org"]
domain left kwic right
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flatteri...
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
155 newadvent.org made their way into the earlier editions of the " Corpus/Corpus/NP Juris Civilis " , the " Corpus Juris Canonici "
156 newadvent.org of the " Corpus Juris Civilis " , the " Corpus/Corpus/NP Juris Canonici " , and the large collections o...
308 newadvent.org deceased : e.g. QUI LEGIS , ORA PRO EO ( Corpus/Corpus/NP Inscript . Lat . , X , n. 3312 )
df.loc
<pandas.core.indexing._LocIndexer at 0x7f08f08b1d10>
df.query("domain == 'newadvent.org'")
domain left kwic right
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flatteri...
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
155 newadvent.org made their way into the earlier editions of the " Corpus/Corpus/NP Juris Civilis " , the " Corpus Juris Canonici "
156 newadvent.org of the " Corpus Juris Civilis " , the " Corpus/Corpus/NP Juris Canonici " , and the large collections o...
308 newadvent.org deceased : e.g. QUI LEGIS , ORA PRO EO ( Corpus/Corpus/NP Inscript . Lat . , X , n. 3312 )
df.domain.str.endswith(".org")
0       True
1       True
2       True
3      False
4      False
       ...  
849    False
850    False
851    False
852     True
853    False
Name: domain, Length: 854, dtype: bool
df.loc[df.domain.str.endswith(".org")]
domain left kwic right
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flatteri...
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl...
23 thefullwiki.org the fundamental rule from which can be deduced... corpus/corpus/NN of libertarian theory . [ 13 ] ” W. D.
25 schools-wikipedia.org of the thousands of extant inscriptions are pu... Corpus/Corpus/NP of Indus Seals and Inscriptions ( 1987 , 1991 ,
... ... ... ... ...
835 obdurodon.org goal is to provide scholars with a large but m... corpus/corpus/NN of data for comparative study of calendar trad...
838 realizingrights.org adopt the principles and rights outlined in th... corpus/corpus/NN as a set of public health ethics then we are
843 unispal.un.org , venerated by Christians , Jews and Moslems , a corpus/corpus/NN separatum , which should be under internationa...
844 home.igc.org heads of state Blocking the small arms treaty ... corpus/corpus/NN to prisoners on Guantanamo and other secret pr...
852 biblicalarchaeology.org Keel recently pointed out , even in the highly... Corpus/Corpus/NP of West Semitic Stamp Seals published by Nahma...

161 rows × 4 columns

df.kwic.str.split("/")
0      [Corpus, Corpus, NP]
1      [Corpus, Corpus, NP]
2      [Corpus, Corpus, NP]
3      [corpus, corpus, NN]
4      [corpus, corpus, NN]
               ...         
849    [Corpus, Corpus, NP]
850    [Corpus, Corpus, NP]
851    [corpus, corpus, NN]
852    [Corpus, Corpus, NP]
853    [corpus, corpus, NN]
Name: kwic, Length: 854, dtype: object
for row in df.kwic.str.split("/"):
    print(row)
    break
['Corpus', 'Corpus', 'NP']
df.kwic.str.split("/", expand=True)
0 1 2
0 Corpus Corpus NP
1 Corpus Corpus NP
2 Corpus Corpus NP
3 corpus corpus NN
4 corpus corpus NN
... ... ... ...
849 Corpus Corpus NP
850 Corpus Corpus NP
851 corpus corpus NN
852 Corpus Corpus NP
853 corpus corpus NN

854 rows × 3 columns

df[["kwic", "domain"]]
kwic domain
0 Corpus/Corpus/NP newadvent.org
1 Corpus/Corpus/NP newadvent.org
2 Corpus/Corpus/NP newadvent.org
3 corpus/corpus/NN freerepublic.com
4 corpus/corpus/NN hinduism.co.za
... ... ...
849 Corpus/Corpus/NP blogs.ulster.ac.uk
850 Corpus/Corpus/NP ginnysaustin.com
851 corpus/corpus/NN patriotsquestion911.com
852 Corpus/Corpus/NP biblicalarchaeology.org
853 corpus/corpus/NN rrojasdatabank.info

854 rows × 2 columns

df[["word", "lemma", "tag"]]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_11227/2526799359.py in <module>
----> 1 df[["word", "lemma", "tag"]]

~/repos/v4py.github.io/.venv/lib/python3.9/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3462             if is_iterator(key):
   3463                 key = list(key)
-> 3464             indexer = self.loc._get_listlike_indexer(key, axis=1)[1]
   3465 
   3466         # take() does not accept boolean indexers

~/repos/v4py.github.io/.venv/lib/python3.9/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis)
   1312             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1313 
-> 1314         self._validate_read_indexer(keyarr, indexer, axis)
   1315 
   1316         if needs_i8_conversion(ax.dtype) or isinstance(

~/repos/v4py.github.io/.venv/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis)
   1372                 if use_interval_msg:
   1373                     key = list(key)
-> 1374                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1375 
   1376             not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())

KeyError: "None of [Index(['word', 'lemma', 'tag'], dtype='object')] are in the [columns]"
df[["word", "lemma", "tag"]] = df.kwic.str.split("/", expand=True)
df
domain left kwic right word lemma tag
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus/Corpus/NP Inscriptionum Latinarum " , it sent a flatteri... Corpus Corpus NP
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus/Corpus/NP " , among them Edwin Bormann , the noted autho... Corpus Corpus NP
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus/Corpus/NP Inscriptionum " , which appeared in the monthl... Corpus Corpus NP
3 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus/corpus/NN , or invade sovereign states . ​He​ ​did​ ​n't... corpus corpus NN
4 hinduism.co.za corporeal being in the fullness of time , assu... corpus/corpus/NN . It arises and perishes in due order . And corpus corpus NN
... ... ... ... ... ... ... ...
849 blogs.ulster.ac.uk University . For example , Corpus Christi Coll... Corpus/Corpus/NP Irish Missal , 12 th century ( MS 282 ) Corpus Corpus NP
850 ginnysaustin.com set by a youth cutting my teeth on tacos in Corpus/Corpus/NP Christi ( RIP Elva ’s ) . Now , I Corpus Corpus NP
851 patriotsquestion911.com of the Geneva Conventions , and the repeal of ... corpus/corpus/NN ( a fundamental point of law that has been with corpus corpus NN
852 biblicalarchaeology.org Keel recently pointed out , even in the highly... Corpus/Corpus/NP of West Semitic Stamp Seals published by Nahma... Corpus Corpus NP
853 rrojasdatabank.info developing ones , were interpreted through the... corpus/corpus/NN of knowledge recognized as Keynesian economics... corpus corpus NN

854 rows × 7 columns

df2 = df[["domain", "left", "word", "lemma", "tag", "right"]]
df2
domain left word lemma tag right
0 newadvent.org by Theodor Mommsen , undertook its monumental ... Corpus Corpus NP Inscriptionum Latinarum " , it sent a flatteri...
1 newadvent.org Mommsen . The latter 's numerous collaborators... Corpus Corpus NP " , among them Edwin Bormann , the noted autho...
2 newadvent.org ) , concerning the preparatory work for the ab... Corpus Corpus NP Inscriptionum " , which appeared in the monthl...
3 freerepublic.com , or arrest the MD assembly , or suspend habeus corpus corpus NN , or invade sovereign states . ​He​ ​did​ ​n't...
4 hinduism.co.za corporeal being in the fullness of time , assu... corpus corpus NN . It arises and perishes in due order . And
... ... ... ... ... ... ...
849 blogs.ulster.ac.uk University . For example , Corpus Christi Coll... Corpus Corpus NP Irish Missal , 12 th century ( MS 282 )
850 ginnysaustin.com set by a youth cutting my teeth on tacos in Corpus Corpus NP Christi ( RIP Elva ’s ) . Now , I
851 patriotsquestion911.com of the Geneva Conventions , and the repeal of ... corpus corpus NN ( a fundamental point of law that has been with
852 biblicalarchaeology.org Keel recently pointed out , even in the highly... Corpus Corpus NP of West Semitic Stamp Seals published by Nahma...
853 rrojasdatabank.info developing ones , were interpreted through the... corpus corpus NN of knowledge recognized as Keynesian economics...

854 rows × 6 columns

df.plot?
df["domain"]
0                newadvent.org
1                newadvent.org
2                newadvent.org
3             freerepublic.com
4               hinduism.co.za
                ...           
849         blogs.ulster.ac.uk
850           ginnysaustin.com
851    patriotsquestion911.com
852    biblicalarchaeology.org
853        rrojasdatabank.info
Name: domain, Length: 854, dtype: object
df["domain"].value_counts()
ucrel.lancs.ac.uk            123
nltk.googlecode.com           90
quinndombrowski.com           65
medicolegal.tripod.com        28
cass.lancs.ac.uk              21
                            ... 
publiusonline.com              1
brainethics.wordpress.com      1
epluribusmedia.org             1
news.art.fsu.edu               1
rrojasdatabank.info            1
Name: domain, Length: 234, dtype: int64
df["domain"].value_counts().plot(kind="bar")
<AxesSubplot:>
_images/data_50_1.png
df["domain"].value_counts().head().plot(kind="bar")
<AxesSubplot:>
_images/data_51_1.png
pd.read_csv("data/frequencies_intensifiers.csv")
1;"completely different";"672";""
0 2;"entirely different";"386";""
1 3;"entirely new";"334";""
2 4;"totally different";"282";""
3 5;"completely new";"261";""
4 6;"completely free";"147";""
... ...
2844 2846;"entirely relative";"1";""
2845 2847;"completely brown";"1";""
2846 2848;"completely literate";"1";""
2847 2849;"totally boneheaded";"1";""
2848 2850;"totally housebound";"1";""

2849 rows × 1 columns

pd.read_csv?
pd.read_csv(
    "data/frequencies_intensifiers.csv",
    sep=";",
    header=None,
    names=["rank", "collocation", "freq", "empty"]
)
rank collocation freq empty
0 1 completely different 672 NaN
1 2 entirely different 386 NaN
2 3 entirely new 334 NaN
3 4 totally different 282 NaN
4 5 completely new 261 NaN
... ... ... ... ...
2845 2846 entirely relative 1 NaN
2846 2847 completely brown 1 NaN
2847 2848 completely literate 1 NaN
2848 2849 totally boneheaded 1 NaN
2849 2850 totally housebound 1 NaN

2850 rows × 4 columns

If you want to go in the other direction and store DataFrames on disk as Excel spreadsheets, CSV files or many other formats, check out the methods starting with to_* on DataFrame objects (remember you can use Tab in JupyterLab to bring up a completion menu if you start typing just to_).

pandas is an impressively featureful library and we’ve barely scratched the surface of what you can do with it. It has also only fairly recently hit 1.0 status, which means a lot of polish has been applied to its website and documentation. Previously, the documentation, though extensive and complete, was somewhat hard to navigate; this has gotten much better. For more information, I suggest reviewing the library’s Getting started, which contains a list of practical tasks you might want to use pandas for along with recipes telling you how to achieve that.

4.3.2. The csv module in the standard library

The Python standard library also comes with a csv module. This is useful when you don’t have the option to install pandas, or when you don’t really need to work with the CSV file as a table, you just need to pull out some values and put them in a dictionary for instance. In that case, pandas may be an unnecessarily heavy dependency (as a Swiss Army knife for data manipulation, it’s pretty hefty), not to mention that loading the entire table into memory at once might be wasteful, especially if it’s large and you just want one or two columns.

Let’s first take a peek at the contents of a CSV file. As mentioned, it’s basically just a plain text file. This particular CSV file contains a frequency distribution of intensifier + adverb combinations.

with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
    for line in file:
        print(line)
        break
"1";"completely  different";"672";""
import csv
with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)
        break
['1;"completely  different";"672";""']
row
['1;"completely  different";"672";""']
len(row)
1
csv.reader?
with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
    reader = csv.reader(file, delimiter=";")
    for row in reader:
        print(row)
        break
['1', 'completely  different', '672', '']
len(row)
4
int("4")
4
float("4.5")
4.5
row[1]
'completely  different'
row[1].split()
['completely', 'different']
adv, adj = row[1].split()

Let’s divide up the adjectives into sets based on which intensifiers they co-occur.

completely = set()
totally = set()
entirely = set()
utterly = set()

with open("data/frequencies_intensifiers.csv", encoding="utf-8") as file:
    reader = csv.reader(file, delimiter=";")
    for row in reader:
        adv, adj = row[1].split()
        if adv == "completely":
            completely.add(adj)
        elif adv == "totally":
            totally.add(adj)
        elif adv == "entirely":
            entirely.add(adj)
        elif adv == "utterly":
            utterly.add(adj)
        else:
            print("unexpected adverb:", adv)

By using set operations, we can now figure out which intensifiers tend (not) to co-occur with which adjectives.

not_utterly = completely | totally | entirely
# or: not_utterly = completely.union(totally).union(entirely)
utterly - not_utterly
# or: utterly.difference(not_utterly)
{'appalling',
 'arresting',
 'bigoted',
 'blasphemous',
 'blasphomous',
 'cack-handed',
 'childish',
 'clever',
 'conditional',
 'contemptible',
 'crippling',
 'damnable',
 'defective',
 'degrading',
 'depraved',
 'despicable',
 'detestable',
 'devoted',
 'digestible',
 'disappointing',
 'disconsolate',
 'disgraceful',
 'dismal',
 'disposable',
 'distasteful',
 'distinguished',
 'disturbing',
 'downcast',
 'dreadful',
 'earthy',
 'effeminate',
 'endless',
 'energetic',
 'enraged',
 'exquisite',
 'extraordinary',
 'fatuous',
 'fluid',
 'forgettable',
 'fragile',
 'geeky',
 'graceful',
 'gracious',
 'guileless',
 'heartbreaking',
 'hopeful',
 'hysterical',
 'impassioned',
 'important',
 'impoverished',
 'indistinguishable',
 'indivisible',
 'infrequent',
 'lawless',
 'materialistic',
 'minuscule',
 'non-essential',
 'nugatory',
 'obscene',
 'obtuse',
 'one-of-a-kind',
 'orthodox',
 'paltry',
 'partisan',
 'passé',
 'pathological',
 'perverse',
 'phenomenal',
 'praiseworthy',
 'prepared',
 'preposterous',
 'profound',
 'reductionistic',
 'remarkable',
 'remiss',
 'riveting',
 'ruthless',
 'self-destructive',
 'shambolic',
 'shameful',
 'similar',
 'simple',
 'simplistic',
 'situational',
 'smug',
 'spectacular',
 'splendid',
 'squalid',
 'stifling',
 'stirring',
 'stubborn',
 'stunning',
 'subdued',
 'therapeutic',
 'thoughtless',
 'totalitarian',
 'un-australian',
 'uncared',
 'uncollectable',
 'undesirable',
 'unexceptional',
 'unfathomable',
 'ungodly',
 'uninteresting',
 'unprovable',
 'unreconcilable',
 'unscientific',
 'unscrupulous',
 'unspooky',
 'unsuspecting',
 'unwinnable',
 'vain',
 'vastated',
 'vulgar'}

4.4. Storing objects on disk and reloading them

Some values take a long time to compute, so you don’t want to have to compute them again and again each time you close and reopen JupyterLab. Instead, you’d like to compute them once, store them somewhere, and reload them (almost) instantaneously whenever you need.

4.4.1. The %store magic function

The %store magic function can store individual variables; it’s perhaps the simplest option, but you don’t really control where the object gets stored.

a = 2
%store a
Stored 'a' (int)
a = 3
a
3

Reload the stored value of the a variable:

%store -r a
a
2

For more information, consult %store’s docstring.

?%store

4.4.2. The json standard library module

The standard library json module can also be used for this purpose.

import json

JSON serialization actually results in plain text, which is nice and mostly human readable, if it’s pretty-printed. It looks close to how the same data structure is written down in Python (can you spot the differences?).

person = {
    "name": "John Doe",
    "age": 31,
    "interests": ["Python", "linguistics"],
    "single": False,
    "pet": None,
}
print(json.dumps(person, indent=2))
{
  "name": "John Doe",
  "age": 31,
  "interests": [
    "Python",
    "linguistics"
  ],
  "single": false,
  "pet": null
}

When writing to disk, you get to pick where the object is stored, at the expense of having to type more than with %store. The "w" argument sets the mode of the open file to write (the "r" mode for reading is the default, so we didn’t need to set it explicitly before when reading files).

with open("person.json", "w") as file:
    json.dump(person, file, indent=2)
%cat person.json
{
  "name": "John Doe",
  "age": 31,
  "interests": [
    "Python",
    "linguistics"
  ],
  "single": false,
  "pet": null
}
with open("person.json") as file:
    data = json.load(file)
data
{'name': 'John Doe',
 'age': 31,
 'interests': ['Python', 'linguistics'],
 'single': False,
 'pet': None}
# cleanup
%rm person.json

If you want to store multiple objects like this, just put them in a dictionary and json.dump the whole thing.

JSON was created as an interchange format, which comes both with advantages and a drawbacks. The advantage is it can be easily loaded into different languages / tools, almost every programming language now has an easily accessible JSON library. The main drawback is that it only works for storing a limited range of types: dicts, lists, strings, numbers, Boolean values (True and False) and None. As an interchange format, it makes sense that it has to stick to the lowest common denominator of what’s available in some form in almost every programming language, otherwise there couldn’t be much interchange. Some additional types can be stored as JSON, but only by being converted to one of the above – e.g. if you store a tuple in JSON and load it back, it will become a list.

json.loads(json.dumps((1, 2, 3)))
[1, 2, 3]

4.4.3. The pickle standard library module

Pickling objects works in a very similar way to dumping them as JSON, just make sure to open the file for writing in binary mode ("wb"):

import pickle

with open("person.pickle", "wb") as file:
    pickle.dump(person, file)

… and for reading as well ("rb"):

with open("person.pickle", "rb") as file:
    data = pickle.load(file)
data
{'name': 'John Doe',
 'age': 31,
 'interests': ['Python', 'linguistics'],
 'single': False,
 'pet': None}

This is because pickling doesn’t use a plain text format, but a custom binary format.

%cat person.pickle
��W}�(�name��John Doe��age�K�	interests�]�(�Python��linguistics�e�single���pet�Nu.
# cleanup
%rm person.pickle

The advantage of pickle is that unlike JSON, it can faithfully preserve a much wider spectrum of Python objects (most things you’re likely to need in normal practice). This flexibility is partially achieved by allowing arbitrary code to run during unpickling, based on what’s stored in the pickle, which is a security flaw – someone could in theory send you a maliciously crafted pickle which deletes your home directory upon unpickling. So only unpickle data from sources you trust.

Another disadvantage is that the format is specific to Python, and it can even change between versions of the language: there are several versions of the pickle protocol, as it gets improved over time. This means that if you want to share pickled objects across Python versions, you need to be careful about which protocol version you use in order to retain backwards compatibility.

Like with JSON, if you want to pickle multiple objects, you still have to store them separately, or put them all in a dict manually and store the dict.

4.4.4. The dill library

dill is pickle on steroids. For any less proficient English speakers reading and/or those without a background in English literature, the name is a pun on dill pickle, or maybe even A Dill Pickle. Its biggest advantage is that it can pickle entire sessions (kind of like R does, if you’re familiar with R), you don’t have to specify objects one by one. Before we demonstrate this though, we’ll need to get rid of our CSV reader object from before, because it turns out that’s one of the objects which can’t be even dill-pickled. (You don’t have to remember this by heart, I certainly don’t – Python will complain loudly if you try to pickle something that can’t be pickled.)

import dill

del reader
dill.dump_session("session.pickle")
dill.load_session("session.pickle")
# cleanup
%rm session.pickle

Apart from the advantages of being able to pickle entire sessions, dill also extends pickling support to more types of objects. You can just switch your code to using dill.dump() / dill.load() instead of pickle.dump() / pickle.load() and you get this extended support for free.

The disadvantages are the same as for pickle, plus it’s not bundled with Python, so it’s another dependency you have to install it separately.