-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Description
There have been some discussions about this before, but as far as I know there is not a plan or any decision made.
My understanding is (feel free to disagree):
- If we were starting pandas today, we would only use Arrow as storage for DataFrame columns / Series
- After all the great work that has been done on building new pandas types based on PyArrow, we are not considering other Arrow implementations
- Based on 1 and 2, we are moving towards pandas based on PyArrow, and the main question is what's the transition path
@jbrockmendel commented this, and I think many others share this point of view, based on past interactions:
There's a path to making it feasible to use PyArrow types by default, but that path probably takes multiple major release cycles. Doing it for 3.0 would be, frankly, insane.
It would be interesting to know why exactly, I guess it's mainly because of two main reasons:
- Finish PyArrow types and making operations with them as reliable and fast as the original pandas types
- Giving users time to adapt
I don't know the exact state of the PyArrow types, and how often users will face problems if using them instead of the original ones. From my perception, there aren't any major efforts to make them better at this point. So, I'm unsure if the situation in that regard will be very different if we make the PyArrow types the default ones tomorrow, or if we make them the default ones in two years.
My understanding is that the only person who is paid consistently to work on pandas is Matt, and he's doing an amazing job at keeping the project going, reviewing most of the PRs, keeping the CI in good shape... But I don't think not him not anyone else if being able to put hours into developing new things as it used to be. For reference, this is the GitHub chart of pandas activity (commits) since pandas 2.0:
So, in my opinion, the existing problems with PyArrow will start to be addressed significantly, whenever they become the default ones.
So, in my opinion our two main options are:
- We move forward with the PyArrow transition. pandas 3.0 will surely not be the best pandas version ever if we start using PyArrow types, but pandas 3.1 will be much better, and pandas 3.2 may be as good as pandas 2 in reliability and speed, but much closer to what we would like pandas to be.
Of course not all users are ready for pandas 3.0 with Arrow types. They can surely pin to pandas=2
until pandas 3 is more mature and they made the required changes to their code. We can surely add a flag pandas.options.mode.use_arrow = False
that reverts the new default to the old status quo. So users can actually move to pandas 3.0 but stay with the old types until we (pandas) and them are ready to get into the new default types. The transition from Python 2 to 3 (which is surely an example of what not to do) took more than 10 years. I don't think in our case we need as much. And if there is interest (aka money) we can also support the pandas 2 series while needed.
- The other option is to continue with the transition to the new nullable types, that my understanding is that we implemented because PyArrow didn't exist at that time. Continue to put our little resources on them. Making users adapt their code to a new temporary status quo, not the final one we envision, and stay in this transition period and delay the move to PyArrow I assume around 6 years (Brock mentioned
multiple major release cycles
, so I assume something like 3 at a rate of one major release every 2 years).
It will be great to know what are other people's thoughts and ideal plans, and see what makes more sense. But to me personally, based on the above information, it doesn't sound more insane to move to PyArrow in pandas 3, than to move in pandas 6.
Activity
jbrockmendel commentedon Jun 9, 2025
Moving users to pyarrow types by default in 3.0 would be insane because #32265 has not been addressed. Getting that out of the way was the point of PDEP 16, which you and Simon are oddly hostile toward.
mroeschke commentedon Jun 9, 2025
Since I helped draft PDEP-10, I would like a world where the Arrow type system with NA semantics would be the only pandas type system.
Secondarily, I would like a world where pandas has the Arrow type system with NA semantics and the legacy (NaN) Numpy type system which is completely independent from the Arrow type system (e.g. a user cannot mix the two in any way)
I agree with Brock that null semantics (NA vs NaN) must inevitably be discussed with adopting a new type system.
I've also generally been concerned about the growing complexity of PDEP-14 with the configurability of "storage" and NA semantics (like having to define a comparison hierarchy #60639). While I understand that we've been very cautious about compatibility with existing types, I don't think this is maintainable or clearer for users in the long run.
My ideal roadplan would be:
set_option("type_system", "legacy" | "pyarrow")
, that configures the "default" type system as either NaN semantics with NumPy or NA semantics with Arrow (with the default being"legacy"
)"legacy"
as the "default" type system"pyarrow"
as the new "default" type systemjbrockmendel commentedon Jun 10, 2025
So with this deprecation enforced, NaN in a constructor, setitem, or csv would be treated as distinct from pd.NA? If so, I’m on board but I expect that to be a painful deprecation.
datapythonista commentedon Jun 10, 2025
To be clear, I'm not hostile towards PDEP-16 at all. I think it's important for pandas to have clear and simple missing value handling, and while incomplete, I think the PDEP and the discussions have been very useful and insightful. Amd I really appreciate that work.
I just don't see PDEP-16 as a blocker for moving to pyarrow, even less if implemented at the same time. And also, I wouldn't spend time with our own nullable dtypes, I would implement PDEP-16 only for pyarrow types.
I couldn't agree more on the points Matt mentions for pandas 3.x. Personally I would change the default earlier. Sounds like pandas 4.x is mostly about showing a warning to users until they manually change the default, which I personally wouldn't do. But that's a minor point, I like the general idea.
mroeschke commentedon Jun 10, 2025
Correct. Yeah I am optimistic that most of the deprecation would hopefully go into
ArrowExtensionArray
, but of course there are probably a lot of one-off places that need addressing.There is a lot of references about type systems in that PDEP that I think would warrant some re-imagining given which type systems are favored. As mentioned before (unfortunately) I think type systems and missing value semantics need to be discussed together
datapythonista commentedon Jun 10, 2025
I created a separate issue #61620 for the option mentioned in the description and in Matt's roadmap, since I think that's somehow independent, and no blocked by PDEP-15, by this issue, or by nothing else that I know.
I fully agree with this. But I'm not sure I fully understand why PDEP-16 must be a blocker for defaulting to PyArrow types.
For users already using PyArrow, they'll have to follow the deprecation transition if they are using
NaN
as missing. To me, this can be started in 3.0, 4.0 or whenever. And probably the earlier the better, so no more code is written with the undesired behavior.For users not yet using PyArrow, I do understand that it's better to force the move when the PyArrow dtypes behave as we think they should behave. I'm not convinced this should be a blocker, and even less if the deprecation of the special treatment of
NaN
is also implemented in 3.0 or in the version when PyArrow types become the default. Maybe I'm wrong, but if you are getting data from parquet, csv... you are not immediately affected by this missing value semantics problems. You need to create data manually (rare for most professional use cases imho), or you need to be explicitly settingNaN
in existing data (also rare in my personal experience). Am I missing something that makes this point important enough to be a blocker for moving to PyArrow and stop investing time in types that we plan to deprecate? If you have an example of code commonly used that is problematic for what I'm proposing, that can surely convince me, and help identify the optimal transition path.simonjayhawkins commentedon Jun 10, 2025
@mroeschke a couple of questions:
could the ArrowExtensionArray be described as a true extension array, i.e. using the extension array interface for 3rd party EAs.?
does the ArrowExtensionArray rely on any Cython code for the implementation to work?
simonjayhawkins commentedon Jun 10, 2025
I have always understood the ArrowExtensionArray and ArrowDtype to be experimental, no PDEP and no roadmap item, for the purpose of evaluating PyArrow types in pandas to potentially eventually use as a backend for pandas nullable dtypes.
So I can sort of understand why the ArrowDtypes are no longer pure and have allowed pandas semantics to creep into the API.
As experimental dtypes why do they need any deprecation at all? Where do we promote these types as pandas recommended types?
simonjayhawkins commentedon Jun 10, 2025
just to be clear about my previous statement, there is a roadmap item
but I've never interpreted this to cover adopting the current ArrowDtype system throughout pandas
simonjayhawkins commentedon Jun 10, 2025
@mroeschke given the current level of funding and interest for the development of the pandas nullable dtypes and the proportion of core devs that now appear to favor embracing the ArrowDtype instead, I fear that this may be, at this time, the most pragmatic approach in keeping pandas development moving forward as it does seem to have slowed of late. I'm not necessarily comfortable deprecating so much prior effort but then the same could have been said about Panel many years ago and I'm not sure anyone misses it today. If the community wants nullable dtypes by default, they may be less interested in the implementation details or even to some extent the performance. If the situation changes and there is more funding and contributions in the future and we have released a Windows 8 in the meantime then we could perhaps bring back the pandas nullable types.
WillAyd commentedon Jun 10, 2025
The goal of this and PDEP-13 were pretty aligned; prefer PyArrow to build out our default type system where applicable, and fill in the gaps using whatever we have as a fallback. That PDEP conversation stalled; not sure if its worth reviving or if this issue is going to tackle a smaller subset of the problem, but in any case I definitely support this
mroeschke commentedon Jun 10, 2025
@datapythonista I just would like some agreement that that defaulting to PyArrow types also matches PDEP-16's proposal to (only) NA semantics for this type as well when making the change for a consistent story, but I suppose they don't need to be done at the same time
@simonjayhawkins yes, it purely uses
ExtensionArray
andExtensionDtype
to implement functionality.It does not, and ideally it won't. When interacting with other parts of Cython in pandas, e.g.
groupby
, we've created hooks to convert to numpy first.This is a valid point; technically it shouldn't require deprecation.
While our docs don't necessarily state a recommended type, anecdotally, it has felt like in past year or two there's been quite a number of conference talks, blogs posts, books that have "celebrated" the newer Arrow types in pandas. Although attention != usage, it may still warrant some care if changing behavior IMO.
An alternative to completely deprecating and removing the pandas NumPy nullable types is to spin them off into their own repository & package and treat them like any other 3rd party
ExtensionArray
library for users that still want to use them.101 remaining items