Pandas Workout

The last couple of months I’ve been working through Reuven Lerner’s Pandas Workout book.

Summary

The book covers the basics of pandas in a way that I felt was generally easy to understand and absorb, particularly the first couple of chapters on Series and DataFrames. The latter half of the book moves a bit more quickly than I would have liked. Each exercise is a discussion of the topic, a set of worked steps and questions, and three or so “beyond the exercise” questions to solve without the help of the book. There is a GitHub repo of notebooks containing all of the book code and the “beyond” solutions.

There are two “projects” built into the book at the halfway point and end which I like the idea of, but were too heavily structured to be satisfying, and I think would have benefited from less specific questions, and a sample solution.

For a book on data analysis, it can sometimes be inconsistent about discussing some of the approaches and results in the exercises. Some of the results are well explained, but there are many where only a technical solution is given, and since all of the data sets are real, there’s considerable nuance in understanding and interpreting some of the results.

Overall I’m pretty happy with the book. My understanding of the “pandas way” has increased considerably, and there were some good data sets provided (with some exceptions). Juggling family obligations, Christmas/New Year etc, it took me a couple of months to work through it all, which I felt wasn’t too bad.

I bought this book through Apple Books which I don’t recommend as it prevents you from copy and pasting any content. Although I try not to copy/paste code since the whole point is building up things like muscle memory and making the dumb syntax mistakes getting them out of my system, there are a number of long column names in the data sets that I would have loved to copy and paste before VSCode took over and let me autocomplete.

I don’t think I would buy a paper copy of a book like this, since there are lots of links to documentation and examples, but having an eBook platform that was more permissive would have been better. Chalk another one up to the hidden cost of easy purchasing.

Thoughts and Gripes

Sample Data

The good

There were quite a few interesting data sets that needed some interpretation, cleaning, and generally had a pretty realistic level of mess, extra data, etc to deal with. I think the standouts amongst the data were:

NYC Parking violations
NYC Taxi trips
US SAT data (Original data may come from here but there’s a lot there to sift through)

These were messy, had lots of extra data to look at, and many different angles to investigate.

The not so good

The final project data on US universities was not terribly compelling to me, even as an educator, and I found it aggregated away to the point of being quite shallow.

The Titanic data set was, in my opinion, a terrible choice. The fields were barely explained in the book, and looking at other places that the data has been used like in various Kaggle projects, no one seemed to have any good descriptions of most of the data set at all. After some digging around I found this which gave a decent account of the data. I found it really surprising that this data seemed to be so popular without anyone trying to explain what was actually in it! This also led to some pretty broad substitution of missing data that, to me, felt pretty transformative rather than interpretive.

Depth

As I said in the summary, the first couple of chapters were really good, providing what I felt was a solid basis for understanding much of the rest of the book effectively. In terms of overall topics, the book has a good progression from the basics of pandas, cleaning data, analysis, visualisation, and performance. Some are glossed over more than others, but there’s enough to get a get an idea of what pandas does, and where you might want to reach for other tools like Seaborn.

Where the depth wasn’t really sufficient was where technical understanding and reasoning and interpretation were equally important. I think there were several examples throughout the book where cleaning decisions were made without a good discussion of the implications it had on interpreting the results.

Gripes

To be clear, I quite liked the book and don’t regret buying it or the time invested in going through all of the exercises, but there were some annoyances along the way I could have done without:

Quite a few Beyond the Exercise problems had discrepancies between what the book said and what the code in the repo actually did. Some of this might be reconciling edits of various versions of the book, but in the cases I noticed the question, the code, and the results were in the notebook in the repo.
There were a number of examples of questions where the wording was ambiguous enough that I had to look at the either the worked example in the book or the solution in the repo to understand what was being asked. This was frustrating since I wanted to test my understanding of the techniques.
Some of the Beyond the Exercise questions involved techniques not covered in the book without any links to relevant documentation. The pandas docs aren’t too bad, but are fairly terse and technical, and if you don’t know what you’re looking for it doesn’t help much. LLMs like Copilot are more help since they do better at interpreting intent, but suffer from all the LLM issues like offering up code for older versions, consistently offering the same wrong answers, or providing example data with the code that doesn’t demonstrate the differences that were being described.

Summary#

Thoughts and Gripes#

Sample Data#

The good#

The not so good#

Depth#

Gripes#