7. Parting words

7.1. Where to go from here

Now that you’ve made it to the end, perhaps you’re wondering where to go from here. Programming is perhaps the easiest skill to learn using only the internet, so there are many options. There’s also always room for improvement and learning new things, it’s a lifelong journey. In lieu of parting words, here are some tips on where to get started, at least in the world of Python.

Thank you for reading so far, and if you have any suggestions for improvement, additions, or just spotted a few typos, please report them on GitHub!

7.2. Books

As a general introduction to Python programming which focuses on linguistic applications, I’ve already recommended Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit by Steven Bird, Ewan Klein and Edward Loper, and I’m going to recommend it again. It’s a great resource, all the more useful since it’s freely available online. It doesn’t provide just recipes on how to use the latest and greatest fancy stuff in NLP, treating the tools as black boxes, but rather focuses on understanding algorithms and concepts and improving your programming skills. This means that it often spends time on less cutting-edge methods, which are however conceptually simpler and thus have better teaching value. Depending on what your immediate needs are, this may be a strength or a weakness, but in the long run, I would argue that every programming linguist should spend some time honing their programming skills instead of always blindly following how-to style recipes, because even copy-pasting black box code can go seriously wrong if you don’t have a larger picture of what’s going on.

On perhaps a more practical note, I can definitely recommend Jake VanderPlas’s Python Data Science Handbook, an other great resource which is also freely available online. This book is not concerned with NLP per se but rather with data analysis, i.e. with what happens after you’ve processed your text data and want to do some statistical modeling or machine learning with it. This has traditionally been the domain of R, especially among linguists, but R is a very idiosyncratic language which encourages the copy-paste, black box approach: while it sometimes provides pleasant and easy-to-use abstractions (especially in the tidyverse third-party packages), building them yourself or wiring them together can be challenging because the underlying language is not really well-designed, edge cases and surprising behaviors abound. Python is much easier to wrap your head around, perhaps because it has always been intended as a more general-purpose programming language, but by the same token, it can be sometimes hard to know which libraries and techniques to use when getting started with data analysis in Python. The Python Data Science Handbook is there to help you with that.

Finally, if you just want a fast-paced overview of Python syntax and features, another great free resource, also by Jake VanderPlas, is A Whirlwind Tour of Python. It’s pretty condensed and expects the reader to be reasonably familiar with programming concepts terminology, but in exchange, it offers a practical-minded and well-organized reference resource which you can use to quickly refresh your knowledge on specific areas of Python programming.

7.3. Videos

Unlike conferences in linguistics, programming conferences are often recorded and professionally produced videos (including presentation slides) are subsequently made available, most often for free. For tutorials and workshops, the materials often remain available long after the conference via sites like GitHub, so you can follow along at your leisure.

Since conferences are popular, there’s potentially a lot of watching material, not all of it great. Some people are better programmers than speakers or teachers, some are good at neither, but there are so many conferences that they can accommodate all of them. Below is a collection of videos that I either consider rare gems of the Python conference circuit, or that are particularly relevant to the subject of analyzing language data, or both.

If you end up searching yourself, I can recommend almost anything by either Raymond Hettinger, Ned Batchelder or David Beazley. Their contributions are consistently extremely informative, well-prepared and entertaining at the same time.

7.3.1. Improving your Python chops


7.3.2. Data analysis and NLP


7.3.3. War stories


7.3.4. Nuts and bolts (advanced)

And finally, here are a few more advanced talks which I heartily recommend watching after you’ve spent a little more time with Python.


7.4. Libraries

Finally, let’s take a look at some library recommendations. In Python, there are often multiple libraries available to help you do the same thing, or that at least partially overlap in the domain they’re trying to cover. For a newcomer, it may be sometimes hard to decide which to use when they’ve never heard about any of them. The purpose of this (admittedly non-exhaustive and biased) list is to familiarize yourself with some of the more popular and well-designed ones, so that when they come up in your searches, you can lean towards them as a first choice.

Some of these have already come up over the course of the book, some haven’t, and some you probably won’t need until you’ve programmed in Python for a while, so don’t feel like you immediately need to start using every single one of those.

As a reminder, additional libraries are typically installed using the pip command line tool, and command line programs can be run from within JupyterLab by prefixing them with a !. For instance, to install the package pyrsistent:

!pip install pyrsistent
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/akopytov/sysbench/pypi/simple
Requirement already satisfied: pyrsistent in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (0.18.0)

Or !pip install --user pyrsistent if that fails with some kind of permission error. If you’ve installed Python on your own computer using the Anaconda Distribution, then you can also use the Anaconda Navigator GUI or the conda command line tool.

7.4.1. NLP

  • nltk: the Natural Language Toolkit offers great resources both for learning about NLP and doing it in practice

  • spacy focuses more on applied NLP and offers convenient black-box-type APIs for a variety of practical tasks

  • ufal.morphodita and ufal.udpipe for morphological tagging and syntactic parsing. These are automatically generated wrappers for the C++ libraries morphodita and udpipe and as such, they might be somewhat hard to use for beginners, though examples are provided. For a more convenient API added on top of these more low-level libraries, see the corpy library.

  • regex is a regular expression library with enhanced Unicode support compared to the standard library re module. The API is the same though, so use re’s documentation to learn how to use regex, and only consult regex’s documentation to learn about additional features.

  • language data sometimes comes in XML format. Though Python does have facilities for processing XML in the standard library, the lxml package offers more functionality and robustness. The docs are somewhat old-fashioned but the tutorial parts (e.g. here) are well-written. What can sometimes be a somewhat painful experience is searching through the API reference for specific functions or methods – even after years of intermitten use, it still feels a bit like a maze to me.

7.4.2. Fetching data from the web

  • requests for fetching individual web pages and interacting with REST APIs

  • requests_html for fetching individual web pages and slicing and dicing their HTML content

  • scrapy is a flexible, configurable web crawler, i.e. a tool which can help you fetch large amounts of data from the web without having to specify each page manually as with requests

  • spiderling, a web crawler optimized for creating language corpora. Clever and sophisticated, but the documentation is on the lighter side, so getting it up and running might require some effort.

7.4.3. Data analysis, machine learning and statistics

  • pandas as the workhorse library for manipulating tabular data, including some basic analyses and visualizations

  • scikit-learn for training and applying machine learning models through a beautiful, unified API, and also learning about machine learning

  • statsmodels for conducting statistical tests and statistical data exploration. This package is still evolving and you’re of course still much more likely to find an obscure statistical procedure implemented in R than here, but it shows great promise.

  • numpy and scipy: the foundational libraries most of the Python scientific computing ecosystem is built on. Whenever large quantities of numbers are manipulated in Python, it tends to be done using numpy objects for efficiency, so that’s where to look if you encounter such operations and find them confusing. scipy adds numerical routines in various domains on top of that for convenience, e.g. statistical functions in scipy.stats. The documentation of these packages can be hard to navigate, a useful starting resource if you have time for a deep dive is https://scipy-lectures.org/.

7.4.4. Data visualization

  • altair is a newer library which hopefully anticipates the future of data visualization in Python. It tries to provide an intuitive declarative API where you just tell Python what data you want to visualize, using which visual cues (points, lines, colors…), and Python figures out the how to make the plot informative and aesthetically pleasing.

  • matplotlib, Python’s traditional plotting library, battle-tested and versatile (lots of different output formats) but fairly low-level – you often have to tweak plots manually

  • seaborn builds on top of matplotlib by offering more appealing default visual styles and easy high-level functions for commonly used plot types

7.4.5. Miscellaneous and advanced

  • pendulum for easier handling of dates and times than with standard library datetime module. (Dates and times across various locales and timezones are actually really tricky to get right, if you ever need to do so, I strongly advise you to use a library like this one to do the heavy lifting for you.)

  • if you use the terminal, then rich can help you generate rich terminal output, including colors, tables, progress bars, and more

  • larger projects often have a battery of tests which are run automatically to make sure that changes don’t break the code. The easiest way to add tests to your code is via the standard library doctest module, but if you find yourself needing a more featureful solution, I would suggest either pytest, the incumbent go-to solution in this space, or wardpy, a challenger which appeared relatively recently but shows promise.

  • poetry: larger projects also need to keep track of which (versions of) other packages they depend on, so that you can easily recreate the environment they need for running correctly when moving between computers. The traditional and somewhat barebones way of achieving this is via pip and a requirements.txt file. poetry is a much more modern and convenient tool which will gently nudge you to adopt current best practices in this area.

  • trio for asynchronous programming. If you don’t know what that is and you don’t care, that’s perfectly fine. If you’re at least a tiny bit curious, the trio docs will do a great job at teaching you – seriously, they’re probably the best technical documentation I’ve ever read – and also show you how much thought and care goes into designing a polished library in a non-trivial domain.

  • asks is a library which is inspired by requests but provides an asynchronous API using trio, which can make your program run faster if you’re trying to fetch a lot of resources from various different servers