{ "cells": [ { "cell_type": "markdown", "id": "2b28d1d2", "metadata": {}, "source": [ "# Parting words\n", "\n", "## Where to go from here\n", "\n", "Now that you've made it to the end, perhaps you're wondering where to go\n", "from here. Programming is perhaps the easiest skill to learn using only\n", "the internet, so there are many options. There's also always room for\n", "improvement and learning new things, it's a lifelong journey. In lieu of\n", "parting words, here are some tips on where to get started, at least in\n", "the world of Python.\n", "\n", "Thank you for reading so far, and if you have any suggestions for\n", "improvement, additions, or just spotted a few typos, please [report them\n", "on GitHub](https://github.com/v4py/v4py.github.io/issues)!\n", "\n", "## Books\n", "\n", "```{margin}\n", "\n", "\n", "\"Natural\n", "\n", "\n", "```\n", "\n", "As a general introduction to Python programming which focuses on\n", "linguistic applications, I've already recommended [*Natural Language\n", "Processing with Python: Analyzing Text with the Natural Language\n", "Toolkit*](https://www.nltk.org/book/) by Steven Bird, Ewan Klein and\n", "Edward Loper, and I'm going to recommend it again. It's a great\n", "resource, all the more useful since it's freely available online. It\n", "doesn't provide just recipes on how to use the latest and greatest fancy\n", "stuff in\n", "[NLP](https://en.wikipedia.org/wiki/Natural_language_processing),\n", "treating the tools as black boxes, but rather focuses on understanding\n", "algorithms and concepts and improving your programming skills. This\n", "means that it often spends time on less cutting-edge methods, which are\n", "however conceptually simpler and thus have better teaching value.\n", "Depending on what your immediate needs are, this may be a strength or a\n", "weakness, but in the long run, I would argue that every programming\n", "linguist should spend some time honing their programming skills instead\n", "of always blindly following how-to style recipes, because even\n", "copy-pasting black box code can go seriously wrong if you don't have a\n", "larger picture of what's going on.\n", "\n", "```{margin}\n", "\n", "\n", "\"Python\n", "\n", "\n", "```\n", "\n", "On perhaps a more practical note, I can definitely recommend Jake\n", "VanderPlas's [*Python Data Science\n", "Handbook*](https://jakevdp.github.io/PythonDataScienceHandbook/), an\n", "other great resource which is also freely available online. This book is\n", "not concerned with NLP per se but rather with data analysis, i.e. with\n", "what happens after you've processed your text data and want to do some\n", "statistical modeling or machine learning with it. This has traditionally\n", "been the domain of [R](https://www.r-project.org/), especially among\n", "linguists, but R is a very idiosyncratic language which encourages the\n", "copy-paste, black box approach: while it sometimes provides pleasant and\n", "easy-to-use abstractions (especially in the\n", "[tidyverse](https://www.tidyverse.org/) third-party packages), building\n", "them yourself or wiring them together can be challenging because the\n", "underlying language is not really well-designed, edge cases and\n", "surprising behaviors abound. Python is much easier to wrap your head\n", "around, perhaps because it has always been intended as a more\n", "general-purpose programming language, but by the same token, it can be\n", "sometimes hard to know which libraries and techniques to use when\n", "getting started with data analysis in Python. The *Python Data Science\n", "Handbook* is there to help you with that.\n", "\n", "```{margin}\n", "\n", "\n", "\"A\n", "\n", "\n", "```\n", "\n", "Finally, if you just want a fast-paced overview of Python syntax and\n", "features, another great free resource, also by Jake VanderPlas, is [*A\n", "Whirlwind Tour of\n", "Python*](https://jakevdp.github.io/WhirlwindTourOfPython/). It's pretty\n", "condensed and expects the reader to be reasonably familiar with\n", "programming concepts terminology, but in exchange, it offers a\n", "practical-minded and well-organized reference resource which you can use\n", "to quickly refresh your knowledge on specific areas of Python\n", "programming.\n", "\n", "## Videos\n", "\n", "Unlike conferences in linguistics, programming conferences are often\n", "recorded and professionally produced videos (including presentation\n", "slides) are subsequently made available, most often for free. For\n", "tutorials and workshops, the materials often remain available long after\n", "the conference via sites like GitHub, so you can follow along at your\n", "leisure.\n", "\n", "Since conferences are popular, there's potentially *a lot* of watching\n", "material, not all of it great. Some people are better programmers than\n", "speakers or teachers, some are good at neither, but there are so many\n", "conferences that they can accommodate all of them. Below is a collection\n", "of videos that I either consider rare gems of the Python conference\n", "circuit, or that are particularly relevant to the subject of analyzing\n", "language data, or both.\n", "\n", "If you end up searching yourself, I can recommend almost anything by\n", "either Raymond Hettinger, Ned Batchelder or David Beazley. Their\n", "contributions are consistently extremely informative, well-prepared and\n", "entertaining at the same time.\n", "\n", "### Improving your Python chops\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "A great tour of Python's built-in functionality, i.e. stuff that's\n", "always available, without having to load any libraries, and tips and\n", "tricks on how to use it. A great way to top off your Python initiation\n", "and graduate to a proficient beginner.\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "This is perhaps the most useful intermediate Python talk ever. It'll\n", "clear up any misconceptions about how variables work in Python that you\n", "might have accumulated on your programming journey so far, and enable\n", "you to work on more complicated and larger pieces of code with more\n", "confidence.\n", "```\n", "\n", "\n", "```{div} full-width\n", "
\n", "```\n", "\n", "### Data analysis and NLP\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "If you're interested in using Python for data analysis, I can recommend\n", "anything by Jake VanderPlas (who wrote the *Python Data Science\n", "Handbook* mentioned earlier). This is an introductory talk which\n", "provides basic orientation in the Python data analysis landscape -- what\n", "tools exist and when to use which. As a keynote, it's somewhat longer\n", "and also provides a bit of historical background on Python, with a bias\n", "for data science applications of course.\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "A bit unsure how statistics works, or what it's even good for? This\n", "particular talk may be titled *Statistics for Hackers*, but in reality\n", "it's geared towards anyone with a keen mind who's interested in\n", "statistics but hasn't had extensive formal training in math, which means\n", "they sometimes struggle with a formula-heavy approach. Which often\n", "applies to linguists, including myself. This may also be a good place to\n", "point out that ['hacker'](https://en.wikipedia.org/wiki/Hacker) doesn't\n", "always (and certainly not here) refer to someone who breaks into other\n", "people's computers.\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "Visualization is currently a rapidly evolving landscape in Python, and\n", "this talk is about new developments based on the *grammar of graphics*\n", "and *declarative visualization* approach, which was popularized by the R\n", "package [ggplot2](https://ggplot2.tidyverse.org/). The first part is a\n", "teaches you how to think about visualization in general, while the\n", "second introduces the [Altair](https://altair-viz.github.io/) library.\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "By contrast, this older talk gives an overview of the Python\n", "visualization landscape including the more traditional and established\n", "Python visualization tools, which many people continue using and which\n", "aren't going away anytime soon. Their advantage is that they're mature,\n", "stable and widely known, so it can be much easier to get help on how to\n", "use them from random people on the internet. A good accompanying\n", "resource for this talk is [Part\n", "4](https://jakevdp.github.io/PythonDataScienceHandbook/04.00-introduction-to-matplotlib.html)\n", "of the *Python Data Science Handbook*.\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "An NLP tutorial whose most valuable aspect is that if offers an extended\n", "worked example of data analysis, from collecting raw data to\n", "communicating insights. It gives a very good idea of what this entire\n", "process typically looks like and what are the pitfalls to look out for.\n", "```\n", "\n", "\n", "```{div} full-width\n", "
\n", "```\n", "\n", "### War stories\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "A real-life story on how Python helped David Beazley to make sense of\n", "large amounts of unknown data in order to prepare an expert testimony in\n", "a legal case. More on the entertaining side than the educational, but it\n", "*will* teach you that Python is a Swiss-army knife for slicing and\n", "dicing data. The *Mission: Impossible* of programming conference talks,\n", "with Python starring as agent Ethan Hunt!\n", "```\n", "\n", "\n", "```{div} full-width\n", "
\n", "```\n", "\n", "### Nuts and bolts (advanced)\n", "\n", "And finally, here are a few more advanced talks which I heartily\n", "recommend watching after you've spent a little more time with Python.\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "Dictionaries are the bread and butter of the Python programmer, and\n", "they're also at the core of how many of the features in the language\n", "work. As such, their implementation has evolved over the years to\n", "incorporate increasingly clever tricks. Learn what they are from the\n", "proverbial horse's mouth, Python core developer Raymond Hettinger, who's\n", "also one of the most consistently entertaining conference speakers I've\n", "seen!\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "Another talk by Raymond Hettinger, this time about making the computer\n", "do multiple things at the same time. Spoiler: it's hard to get this\n", "right, and you should probably think twice whether you really need to do\n", "so before you start tinkering with it.\n", "```\n", "\n", "\n", "\n", "\n", "\n", "```{margin}\n", "On a similar topic as the previous one, a talk on how computers can\n", "*pretend* they're doing several things at the same time by quickly\n", "switching between tasks, and on designing a Python library which makes\n", "such programming fairly intuitive and less error-prone. If you've heard\n", "the buzzwords `async` and `await`, they feature prominently.\n", "```\n", "\n", "\n", "```{div} full-width\n", "
\n", "```\n", "\n", "## Libraries\n", "\n", "Finally, let's take a look at some library recommendations. In Python,\n", "there are often multiple libraries available to help you do the same\n", "thing, or that at least partially overlap in the domain they're trying\n", "to cover. For a newcomer, it may be sometimes hard to decide which to\n", "use when they've never heard about any of them. The purpose of this\n", "(admittedly non-exhaustive and biased) list is to familiarize yourself\n", "with some of the more popular and well-designed ones, so that when they\n", "come up in your searches, you can lean towards them as a first choice.\n", "\n", "Some of these have already come up over the course of the book, some\n", "haven't, and some you probably won't need until you've programmed in\n", "Python for a while, so don't feel like you immediately need to start\n", "using every single one of those.\n", "\n", "As a reminder, additional libraries are typically installed using the\n", "`pip` command line tool, and command line programs can be run from\n", "within JupyterLab by prefixing them with a `!`. For instance, to install\n", "the package [`pyrsistent`](https://pyrsistent.readthedocs.io/):" ] }, { "cell_type": "code", "execution_count": 1, "id": "87f761b2", "metadata": { "tags": [ "output_scroll" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Looking in indexes: https://pypi.org/simple, https://packagecloud.io/akopytov/sysbench/pypi/simple\r\n", "Requirement already satisfied: pyrsistent in /home/david/repos/v4py.github.io/.venv/lib/python3.9/site-packages (0.18.0)\r\n" ] } ], "source": [ "!pip install pyrsistent" ] }, { "cell_type": "markdown", "id": "2d556171", "metadata": {}, "source": [ "Or `!pip install --user pyrsistent` if that fails with some kind of\n", "permission error. If you've installed Python on your own computer using\n", "the [Anaconda Distribution](https://www.anaconda.com/distribution/),\n", "then you can also use the [Anaconda\n", "Navigator](https://docs.anaconda.com/anaconda/navigator/) GUI or the\n", "[`conda`](https://docs.conda.io/) command line tool.\n", "\n", "### NLP\n", "\n", "- [`nltk`](https://nltk.org/): the Natural Language Toolkit offers great\n", " resources both for learning about NLP and doing it in practice\n", "- [`spacy`](https://spacy.io/) focuses more on applied NLP and offers\n", " convenient black-box-type APIs for a variety of practical tasks\n", "- [`ufal.morphodita`](https://pypi.org/project/ufal.morphodita/) and\n", " [`ufal.udpipe`](https://pypi.org/project/ufal.udpipe/) for\n", " morphological tagging and syntactic parsing. These are automatically\n", " generated wrappers for the C++ libraries `morphodita` and `udpipe` and\n", " as such, they might be somewhat hard to use for beginners, though\n", " examples are provided. For a more convenient API added on top of these\n", " more low-level libraries, see the\n", " [`corpy`](https://corpy.readthedocs.io/) library.\n", "- [`regex`](https://pypi.org/project/regex/) is a regular expression\n", " library with enhanced Unicode support compared to the standard library\n", " `re` module. The API is the same though, so use [`re`'s\n", " documentation](https://docs.python.org/3/library/re.html) to learn how\n", " to use `regex`, and only consult `regex`'s documentation to learn\n", " about additional features.\n", "- language data sometimes comes in\n", " [XML](https://en.wikipedia.org/wiki/XML) format. Though Python does\n", " have facilities for [processing XML in the standard\n", " library](https://docs.python.org/3.8/library/xml.etree.elementtree.html#tutorial),\n", " the [`lxml`](https://lxml.de/) package offers more functionality and\n", " robustness. The docs are somewhat old-fashioned but the tutorial parts\n", " (e.g. [here](https://lxml.de/tutorial.html)) are well-written. What\n", " *can* sometimes be a somewhat painful experience is searching through\n", " the [API reference](https://lxml.de/api/index.html) for specific\n", " functions or methods -- even after years of intermitten use, it still\n", " feels a bit like a maze to me.\n", "\n", "### Fetching data from the web\n", "\n", "- [`requests`](https://requests.readthedocs.io/) for fetching individual\n", " web pages and interacting with [REST APIs](rest)\n", "- [`requests_html`](https://requests.readthedocs.io/projects/requests-html/)\n", " for fetching individual web pages and slicing and dicing their HTML\n", " content\n", "- [`scrapy`](https://scrapy.org/) is a flexible, configurable [web\n", " crawler](https://en.wikipedia.org/wiki/Web_crawler), i.e. a tool which\n", " can help you fetch large amounts of data from the web without having\n", " to specify each page manually as with `requests`\n", "- [`spiderling`](http://corpus.tools/wiki/SpiderLing), a web crawler\n", " optimized for creating language corpora. Clever and sophisticated, but\n", " the documentation is on the lighter side, so getting it up and running\n", " might require some effort.\n", "\n", "### Data analysis, machine learning and statistics\n", "\n", "- [`pandas`](https://pandas.pydata.org/) as the workhorse library for\n", " manipulating tabular data, including some basic analyses and\n", " visualizations\n", "- [`scikit-learn`](https://scikit-learn.org/) for training and applying\n", " machine learning models through a beautiful, unified API, and also\n", " [learning about machine learning](https://youtu.be/HC0J_SPm9co)\n", "- [`statsmodels`](https://www.statsmodels.org/) for conducting\n", " statistical tests and statistical data exploration. This package is\n", " still evolving and you're of course still much more likely to find an\n", " obscure statistical procedure implemented in R than here, but it shows\n", " great promise.\n", "- [`numpy`](https://numpy.org/) and\n", " [`scipy`](https://docs.scipy.org/doc/scipy/reference/tutorial/index.html):\n", " the foundational libraries most of the Python scientific computing\n", " ecosystem is built on. Whenever large quantities of numbers are\n", " manipulated in Python, it tends to be done using `numpy` objects for\n", " efficiency, so that's where to look if you encounter such operations\n", " and find them confusing. `scipy` adds numerical routines in various\n", " domains on top of that for convenience, e.g. statistical functions in\n", " [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html).\n", " The documentation of these packages can be hard to navigate, a useful\n", " starting resource if you have time for a deep dive is\n", " .\n", "\n", "### Data visualization\n", "\n", "- [`altair`](https://altair-viz.github.io/) is a newer library which\n", " hopefully anticipates the future of data visualization in Python. It\n", " tries to provide an intuitive declarative API where you just tell\n", " Python what data you want to visualize, using which visual cues\n", " (points, lines, colors...), and Python figures out the how to make the\n", " plot informative and aesthetically pleasing.\n", "- [`matplotlib`](https://matplotlib.org/), Python's traditional plotting\n", " library, battle-tested and versatile (lots of different output\n", " formats) but fairly low-level -- you often have to tweak plots\n", " manually\n", "- [`seaborn`](https://seaborn.pydata.org/) builds on top of `matplotlib`\n", " by offering more appealing default visual styles and easy high-level\n", " functions for commonly used plot types\n", "\n", "### Miscellaneous and advanced\n", "\n", "- [`pendulum`](https://pendulum.eustace.io/) for easier handling of\n", " dates and times than with standard library\n", " [`datetime`](https://docs.python.org/3/library/datetime.html) module.\n", " (Dates and times across various locales and timezones are actually\n", " really tricky to get right, if you ever need to do so, I strongly\n", " advise you to use a library like this one to do the heavy lifting for\n", " you.)\n", "- if you use the terminal, then [`rich`](https://rich.readthedocs.io/)\n", " can help you generate rich terminal output, including colors, tables,\n", " progress bars, and more\n", "- larger projects often have a battery of\n", " [tests](https://en.wikipedia.org/wiki/Software_testing) which are run\n", " automatically to make sure that changes don't break the code. The\n", " easiest way to add tests to your code is via the standard library\n", " [`doctest`](https://docs.python.org/3/library/doctest.html) module,\n", " but if you find yourself needing a more featureful solution, I would\n", " suggest either [`pytest`](https://docs.pytest.org/), the incumbent\n", " go-to solution in this space, or [`wardpy`](https://wardpy.com/), a\n", " challenger which appeared relatively recently but shows promise.\n", "- [`poetry`](https://python-poetry.org/): larger projects also need to\n", " keep track of which (versions of) other packages they depend on, so\n", " that you can easily recreate the environment they need for running\n", " correctly when moving between computers. The traditional and somewhat\n", " barebones way of achieving this is via [`pip` and a `requirements.txt`\n", " file](https://pip.pypa.io/en/stable/user_guide/#requirements-files).\n", " `poetry` is a much more modern and convenient tool which will gently\n", " nudge you to adopt current best practices in this area.\n", "- [`trio`](https://trio.readthedocs.io/) for asynchronous programming.\n", " If you don't know what that is and you don't care, that's perfectly\n", " fine. If you're at least a tiny bit curious, the `trio` docs will do a\n", " great job at teaching you -- seriously, they're probably the best\n", " technical documentation I've ever read -- and also show you how much\n", " thought and care goes into designing a polished library in a\n", " non-trivial domain.\n", "- [`asks`](https://asks.readthedocs.io/) is a library which is inspired\n", " by `requests` but provides an asynchronous API using `trio`, which can\n", " make your program run faster if you're trying to fetch a lot of\n", " resources from various different servers\n", "\n", "" ] } ], "metadata": { "jupytext": { "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.12, "jupytext_version": "1.6.0" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" }, "source_map": [ 13, 368, 372 ] }, "nbformat": 4, "nbformat_minor": 5 }