While working on releasing a new version of Wmfdata-Python for T345482, I noticed that there are some dependency problems that will happen if a user upgrades to the latest versions of Pandas and Numpy:
- Pandas 2.2 increased the minimum required version of Pyarrow from 7.0.0 to 10.0.1. Conda-Analytics has Pyarrow fixed at 9.0.0. Pyarrow isn't a required dependency of Pandas (although it will become one in Pandas 3.0) but if it is installed, having 9.0 causes errors when trying to use Pandas with Parquet files.
- Pandas 2.0 removed support for a casting to the unitless datetime64 dtype, which PySpark tries to do when collecting a datetime field to a Pandas dataframe (causing TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead.). PySpark fixes this in either 4.0 (according to the bug tracker) or 3.5 (according to this StackOverflow answer).
- Numpy 1.24 removed np.bool, which is accessed by Pyspark 3.1.2 when reading a boolean field via Spark.
One way to handle this would be to pin Pandas and Numpy below these versions, but note that actually pinning versions requires fixing T356231.