Conda-Analytics packages incompatible with latest versions of Pandas and Numpy
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Jan 31 2024, 1:53 AM

Description

While working on releasing a new version of Wmfdata-Python for T345482, I noticed that there are some dependency problems that will happen if a user upgrades to the latest versions of Pandas and Numpy:

Pandas 2.2 increased the minimum required version of Pyarrow from 7.0.0 to 10.0.1. Conda-Analytics has Pyarrow fixed at 9.0.0. Pyarrow isn't a required dependency of Pandas (although it will become one in Pandas 3.0) but if it is installed, having 9.0 causes errors when trying to use Pandas with Parquet files.
Pandas 2.0 removed support for a casting to the unitless datetime64 dtype, which PySpark tries to do when collecting a datetime field to a Pandas dataframe (causing TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.). PySpark fixes this in either 4.0 (according to the bug tracker) or 3.5 (according to this StackOverflow answer).
Numpy 1.24 removed np.bool, which is accessed by Pyspark 3.1.2 when reading a boolean field via Spark.

One way to handle this would be to pin Pandas and Numpy below these versions, but note that actually pinning versions requires fixing T356231.

Related Objects

Mentioned In: T356231: Package versions in Conda-Analytics are not pinned
T345482: Wmfdata should connect to Presto using the analytics-presto CNAME
Mentioned Here: T370705: Upgrade to Pandas ≥ 2 in Conda-Analytics
T370707: Upgrade to Pandas ≥ 2.2 in Conda-Analytics
T370710: Upgrade to Numpy ≥ 1.24 in Conda-Analytics
T370711: Upgrade to Pyarrow ≥ 10.0.1 in Conda-Analytics
T370712: Upgrade to Pyspark ≥ 3.4
T370713: Upgrade to Pyspark ≥ 3.5
T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0
T356231: Package versions in Conda-Analytics are not pinned
T345482: Wmfdata should connect to Presto using the analytics-presto CNAME

Event Timeline

nshahquinn-wmf created this task.Jan 31 2024, 1:53 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 31 2024, 1:53 AM

nshahquinn-wmf updated the task description. (Show Details)Jan 31 2024, 1:53 AM

nshahquinn-wmf updated the task description. (Show Details)Jan 31 2024, 2:15 AM

nshahquinn-wmf mentioned this in T345482: Wmfdata should connect to Presto using the analytics-presto CNAME.Jan 31 2024, 2:19 AM

nshahquinn-wmf edited projects, added Data-Platform-SRE; removed Data-Engineering.

nshahquinn-wmf moved this task from Incoming to Watching on the Movement-Insights board.Feb 7 2024, 10:00 PM

Gehel triaged this task as Medium priority.Feb 9 2024, 1:28 PM

Gehel moved this task from Incoming to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE.

Stevemunene claimed this task.Feb 27 2024, 3:05 PM

Stevemunene moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 1 2024, 4:00 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 1 2024, 4:21 PM

stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/42

Draft: Pin essential conda-analytics packages

stevemunene closed https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/42

Draft: Pin essential conda-analytics packages

Maintenance_bot removed a project: Patch-For-Review.Mar 21 2024, 11:31 AM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:45 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 22 2024, 8:45 AM

stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/43

Pin essential conda-analytics packages

I got the below error when upgraded to Pandas 2.2.2

File ~/.conda/envs/2024-02-05T22.35.00_mayakpwiki/lib/python3.10/site-packages/pandas/core/arrays/datetimes.py:732, in DatetimeArray.astype(self, dtype, copy)
    720     raise TypeError(
    721         "Cannot use .astype to convert from timezone-aware dtype to "
    722         "timezone-naive dtype. Use obj.tz_localize(None) or "
    723         "obj.tz_convert('UTC').tz_localize(None) instead."
    724     )
    726 elif (
    727     self.tz is None
    728     and lib.is_np_dtype(dtype, "M")
    729     and dtype != self.dtype
    730     and is_unitless(dtype)
    731 ):
--> 732     raise TypeError(
    733         "Casting to unit-less dtype 'datetime64' is not supported. "
    734         "Pass e.g. 'datetime64[ns]' instead."
    735     )
    737 elif isinstance(dtype, PeriodDtype):
    738     return self.to_period(freq=dtype.freq)

TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.

solution was to downgrade to v1.5.3

nshahquinn-wmf updated the task description. (Show Details)Apr 11 2024, 11:36 PM

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE (2024.03.25 - 2024.04.14).Apr 15 2024, 12:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

nshahquinn-wmf updated the task description. (Show Details)Apr 15 2024, 8:46 PM

Crossposting an update for visibility

https://phabricator.wikimedia.org/T356231#9717734

Stevemunene moved this task from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.Apr 16 2024, 12:26 PM

nshahquinn-wmf updated the task description. (Show Details)Apr 26 2024, 10:20 PM

Stevemunene moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.Apr 30 2024, 3:25 PM

Gehel edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE (2024.04.15 - 2024.05.05).May 3 2024, 3:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.

Gehel moved this task from In Progress to To Be Deployed on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.May 8 2024, 9:43 AM

stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/43

Pin essential conda-analytics packages

Maintenance_bot removed a project: Patch-For-Review.May 21 2024, 7:31 AM

Stevemunene moved this task from To Be Deployed to In Progress on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.May 21 2024, 8:25 AM

Gehel edited projects, added Data-Platform-SRE (2024.05.27 - 2024.06.16); removed Data-Platform-SRE (2024.05.06 - 2024.05.26).May 24 2024, 12:20 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.27 - 2024.06.16) board.

Gehel edited projects, added Data-Platform-SRE (2024.06.17 - 2024.07.07); removed Data-Platform-SRE (2024.05.27 - 2024.06.16).Jun 17 2024, 3:05 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.

Mentioned in SAL (#wikimedia-analytics) [2024-06-26T11:12:10Z] <stevemunene> deploy conda-analytics v 0.0.32 to analytics hadoop worker hosts T356231 T356230

Mentioned in SAL (#wikimedia-analytics) [2024-06-26T11:22:15Z] <stevemunene> deploy conda-analytics v 0.0.32 to analytics hadoop coordinator hosts T356231 T356230

Mentioned in SAL (#wikimedia-analytics) [2024-06-26T11:47:41Z] <stevemunene> deploy conda-analytics v 0.0.32 to analytics airflow hosts T356231 T356230

Stevemunene moved this task from In Progress to Done on the Data-Platform-SRE (2024.06.17 - 2024.07.07) board.Jun 26 2024, 11:54 AM

Gehel closed this task as Resolved.Jun 28 2024, 7:49 AM

I'd like to keep this open, mainly for documentation, since it's still true that we can't use the latest versions of Pandas and Numpy because of the package versions in Conda-Analytics.

In T356230#9942134, @nshahquinn-wmf wrote:

I'd like to keep this open, mainly for documentation, since it's still true that we can't use the latest versions of Pandas and Numpy because of the package versions in Conda-Analytics.

If we keep it open, would it be possible to reword the description to reflect the current situation and the desired outcome, even if that is a long-term goal?
My concern is that, now that significant work has been done to enable pinning across clones and to pin the numpy and pandas packages as requested, the next steps in terms of SRE involvement are unclear.

You mention version dependencies between:

pandas 2.2 and pyarrow <10.0.1
pandas 2.0 and pyspark < 3.5
numpy 1.24 and pyspark = 3.1.2

We already have a ticket to upgrade the production version of spark: T338057: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 although we have only gone as far as version 3.4.1 in that ticket at the moment.

Perhaps there would be more value in creating specific tickets for upgrading certain packages, then we can more accurately set out the blockers in terms of upgrading the dependencies.

Upgrade pandas in conda-analytics to > 2.0 would depend on Upgrade pyspark to > 3.5
Upgrade pandas in conda-analytics to > 2.2 would depend on Upgrade pyarrow to >10.0.1
Upgrade numpy in conda-analytics to > 1.24 would depend on something about boolean handling in pyspark 3.1.2, maybe an upgrade

I feel that this approach would be more likely to allow us to prioritise the software upgrades effectively and keep a steady stream of updates happening to conda-analytics.
You could keep this ticket open for your own team's benefit as a parent ticket and a tracking mechanism for these various upgrades, if that helps.

Gehel edited projects, added Data-Platform-SRE (2024.07.08 - 2024.07.28); removed Data-Platform-SRE (2024.06.17 - 2024.07.07).Jul 8 2024, 6:31 PM

Gehel moved this task from Backlog to Done on the Data-Platform-SRE (2024.07.08 - 2024.07.28) board.

@BTullis that makes sense!

I've filed:

I'll do a quick MR updating the comments in conda-environment.yml.

nshahquinn-wmf opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/51

Improve documentation

In T356230#10004880, @CodeReviewBot wrote:

nshahquinn-wmf opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/51

Improve documentation

In addition to updating the comments on the pins, I ended up reformatting conda-environment.yml for readability and copyediting the readme. I hope that's welcome!

Thanks for the updated documentation @nshahquinn-wmf

stevemunene merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/51

Improve documentation

Marking this as done on the SRE board in favour of the recently created tasks on upgrading numpy, pyarrow, pandas and pyspark.

Maintenance_bot removed a project: Patch-For-Review.Jul 29 2024, 10:31 AM

Marking this as resolved in favour of the created tasks on upgrading numpy, pyarrow, pandas and pyspark.

Conda-Analytics packages incompatible with latest versions of Pandas and NumpyClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Conda-Analytics packages incompatible with latest versions of Pandas and Numpy
Closed, ResolvedPublic
Actions