Reminds me of how thinking using frequencies rather than computing probabilities is easier and can avoid errors (e.g. a 99% accurate test being positive does not mean 99% likelihood of having disease for a disease with a 1/10,000 prevalence in population).
In short I think it's hard to strike an appropriate balance between these but this seems to be a good intro level book.
Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).
Pandas code is untestable, unreadable, hard to refactor and impossible to reuse.
Trillions of dollars are wasted every year by people having to rewrite pandas code.
Pandas insist you never use a for loop. So, I feel guilty if I ever need a throwaway variable on the way to creating a new column. Sometimes methods are attached to objects, other times they aren't. And if you need to use a function that isn't vectorised, you've got to do df.apply anyway. You have to remember to change the 'axis' too. Plotting is another thing that I can't get my head around. Am I supposed to use Pandas' helpers like df.plot() all the time? Or ditch it and use the low level matplotlib directly? What is idiomatic? I cannot find answers to much of it, even with ChatGPT. Worse, I can't seem to create a mental model of what Pandas expects me to do in a given situation.
Pandas has disabused me of the notion that Python syntax is self-explanatory and executable-pseudocode. I find it terrible to look at. Matlab was infinitely more enjoyable.
Regarding your plotting question: use seaborn when you can, but you’ll still need to know matplotlib.
The thousand-plus data integrity tests I've written in pandas tell a different story...
That said, the polars/narwals style API is better than pandas API for sure. More readable and composable, simpler (no index) and a bit less weird overall.
I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.
But the code base I work on has thousands and THOUSANDS of lines of Pandas churning through big data, and I can't remember the last time it lead to a bug or error in production.
We use pandas + static schema wrapper + type checker, so you'll have to get exotic to break things.
> Copyright 2016
What other framework would you replace it with?
No, polars or spark is not a good answer, those are optimized for data engineering performance, not a holistic approach to data science.
Today all serious DS work will ultimately become data engineering work anyway. The time when DS can just fiddle around in notebooks all day has passed.
If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.
Polars is trying to solve an issue that just doesn't exist for the vast majority of users.
This is pretty laughable. Yes there are very DS specific tools that make good use of Pandas, but `to_pandas` in Polars trivially solves this. The fact that Pandas always feels like injecting some weird DSL into existing Python code bases is one of the major reasons why I really don't like it.
> If you are dealing with huge data sets, you are probably using Spark or something like Dask already where jobs can run in the cloud. If you need speed and efficiency on your local machine, you use NumPy outright. And if you really, really need speed, you rewrite it in C/C++.
Have you used Polars at all? Or for that matter written significant Pandas outside of a notebook? The number one benefit of Polars, imho, is that Polars works using Expressions that allow you to trivially compose and reuse fundamental logic when working with data in a way the works well with other Python code. This solves the biggest problem with Pandas is that it does not abstract well.
Not to mention that Pandas is really poor dataframe experience outside of it's original use case which was financial time series. The entire multi-index experience is awful and I know that either you are calling 'reset_index' multiple times in your Pandas logic or you have bugs.
What? Speed and better nested data support (arrays/JSON) alone are extremely useful to every data scientist.
My produtivity skyrocketed after switching from pandas to polars.
Oh yeah? Well in my ivory tower the work stops being serious once it becomes engineering, how do you like that elitism?!
But the last startup I was at tried to take a similar approach to research was unable to ship a functioning product and will likely disappear in a year from now. FAIR has been largely disbanded in favor of the way more shipping-centric MSL, and the people I know at Deep Mind are increasingly finding themselves under pressure to actually produce things.
Since you've been hanging out in an ivory tower then you might be unaware that during the peek DS frenzy (2016-2019) there were companies where data scientists were allowed to live entirely in notebooks and it was someone else's problem to ship their notebooks. Today if you have that expectation you won't last long at most companies, if you can even find a job in the first place.
On top of that, I know quite a few people at the major LLM teams and, based on my conversations, all of them are doing pretty serious data engineering work to get things shipped even if they were hired for there modeling expertise. It's honestly hard to even run serious experiments at the scale of modern day LLMs without being pretty proficient at data engineering related tasks.
Can you expand on why Polars isn't optimised for a holistic approach to data science?
It is a curse I know. I would also choose a better interface. Performance is meh to me, I use SQL if i want to do something at scale that involves row/column data.
Since Pandas lacks Polars' concept of an Expression, it's actually quite challenging to programmatically interact with non-trivial Pandas queries. In Polars the query logic can be entirely independent of the data frame while still referencing specific columns of the data frame. This makes Polars data frames work much more naturally with typical programming abstractions.
Pandas multi-index is a bad idea in nearly all contexts other than it's original use case: financial time series (and I'll admit, if you're working with purely financial time series, then Pandas feels much better). Sufficiently large Pandas code bases are littered with seemingly arbitrary uses of 'reset_index', there are many times where multi-index will create bugs, and, most important, I've never seen any non-financial scenario where anyone has ever used Multi-index to their advantage.
Finally Pandas is slow, which is honestly the least priority for me personally, but using Polars is so refreshing.
What other data frames have you used? Having used R's native dataframes extensively (the way they make use of indexing is so much nicer) in addition to Polars both are drastically preferable to Pandas. My experience is that most people use Pandas because it has been the only data frame implementation in Python. But personally I'd rather just not use data frames if I'm forced to used Pandas. Could you expand on what you like about Pandas over other data frames models you've worked with?
I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.
The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.
Finally, as someone who wrote a lot of R pre-tidyverse, I've seen the entire ecosystem radically change over my career.