Page MenuHomePhabricator

Conda-Analytics packages incompatible with latest versions of Pandas and Numpy
Closed, ResolvedPublic

Description

While working on releasing a new version of Wmfdata-Python for T345482, I noticed that there are some dependency problems that will happen if a user upgrades to the latest versions of Pandas and Numpy:

  • Pandas 2.2 increased the minimum required version of Pyarrow from 7.0.0 to 10.0.1. Conda-Analytics has Pyarrow fixed at 9.0.0. Pyarrow isn't a required dependency of Pandas (although it will become one in Pandas 3.0) but if it is installed, having 9.0 causes errors when trying to use Pandas with Parquet files.
  • Pandas 2.0 removed support for a casting to the unitless datetime64 dtype, which PySpark tries to do when collecting a datetime field to a Pandas dataframe (causing TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.). PySpark fixes this in either 4.0 (according to the bug tracker) or 3.5 (according to this StackOverflow answer).
  • Numpy 1.24 removed np.bool, which is accessed by Pyspark 3.1.2 when reading a boolean field via Spark.

One way to handle this would be to pin Pandas and Numpy below these versions, but note that actually pinning versions requires fixing T356231.

Event Timeline

Gehel triaged this task as Medium priority.Feb 9 2024, 1:28 PM
Gehel moved this task from Incoming to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.
Mayakp.wiki subscribed.

I got the below error when upgraded to Pandas 2.2.2

File ~/.conda/envs/2024-02-05T22.35.00_mayakpwiki/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:732, in DatetimeArray.astype(self, dtype, copy)
    720     raise TypeError(
    721         "Cannot use .astype to convert from timezone-aware dtype to "
    722         "timezone-naive dtype. Use obj.tz_localize(None) or "
    723         "obj.tz_convert('UTC').tz_localize(None) instead."
    724     )
    726 elif (
    727     self.tz is None
    728     and lib.is_np_dtype(dtype, "M")
    729     and dtype != self.dtype
    730     and is_unitless(dtype)
    731 ):
--> 732     raise TypeError(
    733         "Casting to unit-less dtype 'datetime64' is not supported. "
    734         "Pass e.g. 'datetime64[ns]' instead."
    735     )
    737 elif isinstance(dtype, PeriodDtype):
    738     return self.to_period(freq=dtype.freq)

TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.

solution was to downgrade to v1.5.3

Mentioned in SAL (#wikimedia-analytics) [2024-06-26T11:12:10Z] <stevemunene> deploy conda-analytics v 0.0.32 to analytics hadoop worker hosts T356231 T356230

Mentioned in SAL (#wikimedia-analytics) [2024-06-26T11:22:15Z] <stevemunene> deploy conda-analytics v 0.0.32 to analytics hadoop coordinator hosts T356231 T356230

Mentioned in SAL (#wikimedia-analytics) [2024-06-26T11:47:41Z] <stevemunene> deploy conda-analytics v 0.0.32 to analytics airflow hosts T356231 T356230

nshahquinn-wmf lowered the priority of this task from Medium to Low.

I'd like to keep this open, mainly for documentation, since it's still true that we can't use the latest versions of Pandas and Numpy because of the package versions in Conda-Analytics.

I'd like to keep this open, mainly for documentation, since it's still true that we can't use the latest versions of Pandas and Numpy because of the package versions in Conda-Analytics.

If we keep it open, would it be possible to reword the description to reflect the current situation and the desired outcome, even if that is a long-term goal?
My concern is that, now that significant work has been done to enable pinning across clones and to pin the numpy and pandas packages as requested, the next steps in terms of SRE involvement are unclear.

You mention version dependencies between:

  • pandas 2.2 and pyarrow <10.0.1
  • pandas 2.0 and pyspark < 3.5
  • numpy 1.24 and pyspark = 3.1.2

We already have a ticket to upgrade the production version of spark: T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 although we have only gone as far as version 3.4.1 in that ticket at the moment.

Perhaps there would be more value in creating specific tickets for upgrading certain packages, then we can more accurately set out the blockers in terms of upgrading the dependencies.

  • Upgrade pandas in conda-analytics to > 2.0 would depend on Upgrade pyspark to > 3.5
  • Upgrade pandas in conda-analytics to > 2.2 would depend on Upgrade pyarrow to >10.0.1
  • Upgrade numpy in conda-analytics to > 1.24 would depend on something about boolean handling in pyspark 3.1.2, maybe an upgrade

I feel that this approach would be more likely to allow us to prioritise the software upgrades effectively and keep a steady stream of updates happening to conda-analytics.
You could keep this ticket open for your own team's benefit as a parent ticket and a tracking mechanism for these various upgrades, if that helps.

In addition to updating the comments on the pins, I ended up reformatting conda-environment.yml for readability and copyediting the readme. I hope that's welcome!

Marking this as done on the SRE board in favour of the recently created tasks on upgrading numpy, pyarrow, pandas and pyspark.

Marking this as resolved in favour of the created tasks on upgrading numpy, pyarrow, pandas and pyspark.