Wmfdata should connect to Presto using the analytics-presto CNAME
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	nshahquinn-wmf
	Sep 1 2023, 10:39 PM

Description

Currently, Wmfdata connects directly to an-coord1001.eqiad.wmnet. However, we should be connecting instead through the analytics-presto.eqiad.wmnet CNAME (T273642), so that Wmfdata will adapt seamlessly is the coordinator role is switched to a different server.

I tried just switching to the new host name, but that failed with CertificateError: hostname 'analytics-presto.eqiad.wmnet' doesn't match 'an-coord1001.eqiad.wmnet'.

The relevant code is in wmfdata-python/wmfdata/presto.py.

Details

	Subject	Repo	Branch	Lines +/-
	Switch presto from Puppet to PKI certificates	operations/puppet	production	+6 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T345482 Wmfdata should connect to Presto using the analytics-presto CNAME
Resolved	BTullis	T336045 Bring an-coord100[3-4] into service
Resolved	brouberol	T353774 Decom an-coord100[1-2]
Duplicate	None	T332572 Refresh hadoop coordinators an-coord100[1-2] with an-coord[3-4]

Event Timeline

nshahquinn-wmf created this task.Sep 1 2023, 10:39 PM

Restricted Application added projects: Data-Engineering, Product-Analytics. · View Herald TranscriptSep 1 2023, 10:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@BTullis do you have any idea how to make the CNAME work here?

This will lead to unexpected breakage and need an immediate patch at some point, when the coordinator role is switched to a different server.

Change 709713 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update Presto TLS configuration in production

https://gerrit.wikimedia.org/r/709713

gerritbot added a project: Patch-For-Review.Sep 4 2023, 11:45 AM

Hi @nshahquinn-wmf - I believe that I do know how we can make this work and I think we have a patch ready to go.

However, we will need to make a corresponding patch to wmfdata-python, because we will also need to update the ca_bundle that is in use here, and we will need to coordinate the deployment, to make sure that it doesn't interrupt people's use of wmfdata-python.

The change is already in place on thr test cluster, but we haven't promoted it to production yet. Do we have a build of wmfdata-python that connects to the test cluster?
If so, we could test it with the DNS CNAME of analytics-test-presto, which is an alias for an-test-coord1001.eqiad.wmnet

BTullis added a project: Data-Platform-SRE.Sep 5 2023, 2:41 PM

BTullis moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

mpopov moved this task from Triage to Tracking on the Product-Analytics board.Sep 6 2023, 3:48 PM

Gehel assigned this task to BTullis.Nov 15 2023, 9:54 AM

Gehel moved this task from Ready for Work to In Progress on the Data-Platform-SRE board.

@nshahquinn-wmf - I have made a patch to wmfdata-python here: https://github.com/wikimedia/wmfdata-python/pull/47 and requested your review

I wasn't sure whether I should make the changes related to a new release (metadata.py and CHANGELOG.md) in this PR, or if you wanted that release process to be separate.
Anyway, this new version can be released at any time. It's safe to do so.

Once we have upgraded wmfdata-python everywhere, then we can deploy https://gerrit.wikimedia.org/r/709713 which will update the presto certificates to use the PKI.
Then we can do another PR to wmfdata-python to change the URL that is in use.

BTullis moved this task from In Progress to Needs Review on the Data-Platform-SRE board.Nov 28 2023, 2:51 PM

Gehel edited projects, added Data-Platform-SRE (2023.12.01 - 2023.12.31); removed Data-Platform-SRE, Patch-For-Review, Product-Analytics, Data-Engineering, Wmfdata-Python.Dec 6 2023, 9:26 AM

Gehel moved this task from Backlog to Needs Review on the Data-Platform-SRE (2023.12.01 - 2023.12.31) board.Dec 6 2023, 9:26 AM

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/39

Update the version of wmfdata-python used in conda-analytics

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/39

Update the version of wmfdata-python used in conda-analytics

Ugh! conda-analytics version 0.0.26 is failing to run conda-analytics-clone mycoolenv in the test environment.

Creating new cloned conda env mycoolenv...
Source:      /opt/conda-analytics
Destination: /home/btullis/.conda/envs/mycoolenv
The following packages cannot be cloned out of the root environment:
 - conda-forge/linux-64::conda-23.7.4-py310hff52083_0
 - conda-forge/noarch::conda-libmamba-solver-23.7.0-pyhd8ed1ab_0
Packages: 215
Files: 958

Downloading and Extracting Packages


Downloading and Extracting Packages

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate mycoolenv
#
# To deactivate an active environment, use
#
#     $ conda deactivate

Collecting package metadata (current_repodata.json): done
Solving environment: | 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::arrow-cpp==8.0.0=py310h3098874_0
  - defaults/noarch::pyspark==3.1.2=pyhd3eb1b0_0
  - conda-forge/linux-64::abseil-cpp==20211102.0=h93e1e8c_3
  - conda-forge/linux-64::grpc-cpp==1.46.3=h0b91f02_1
  - defaults/linux-64::pyarrow==8.0.0=py310h468efa6_0
  - conda-forge/linux-64::libabseil==20211102.0=cxx17_h48a1fff_3
unsuccessful initial attempt using frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: / 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::arrow-cpp==8.0.0=py310h3098874_0
  - defaults/noarch::pyspark==3.1.2=pyhd3eb1b0_0
  - conda-forge/linux-64::abseil-cpp==20211102.0=h93e1e8c_3
  - conda-forge/linux-64::grpc-cpp==1.46.3=h0b91f02_1
  - defaults/linux-64::pyarrow==8.0.0=py310h468efa6_0
  - conda-forge/linux-64::libabseil==20211102.0=cxx17_h48a1fff_3
\ 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                             

UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (boltons):

  - conda-libmamba-solver=23.7.0 -> boltons[version='>=23.0.0']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> conda-package-handling[version='>=1.3.0|>=2.2.0']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> jsonpatch[version='>=1.32']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> packaging[version='>=23.0']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> pluggy[version='>=1.0.0']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> pycosat[version='>=0.6.3']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> pyopenssl[version='>=16.2.0']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> python_abi=3.10[build=*_cp310]
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> requests[version='>=2.27.0,<3']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> ruamel.yaml[version='>=0.11.14,<0.18']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> setuptools[version='>=60.0.0']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> toolz[version='>=0.8.1']
  - conda-libmamba-solver=23.7.0 -> conda[version='>=23.5.0'] -> tqdm[version='>=4']
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> fmt[version='>=10.1.1,<11.0a0|>=9.1.0,<10.0a0']
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> libgcc-ng[version='>=11.2.0|>=12']
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> libmamba[version='1.5.3|1.5.4',build='had39da4_0|haf1ee3a_0']
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> libstdcxx-ng[version='>=11.2.0|>=12']
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> openssl[version='>=3.0.10,<4.0a0|>=3.0.11,<4.0a0|>=3.2.0,<4.0a0|>=3.1.4,<4.0a0|>=3.0.7,<4.0a0']
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> pybind11-abi==4
  - conda-libmamba-solver=23.7.0 -> libmambapy[version='>=1.4.1'] -> yaml-cpp[version='>=0.8.0,<0.9.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> bzip2[version='>=1.0.8,<2.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> ld_impl_linux-64[version='>=2.35.1|>=2.36.1']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> libffi[version='>=3.4,<3.5|>=3.4,<4.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> libnsl[version='>=2.0.0,<2.1.0a0|>=2.0.1,<2.1.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> libsqlite[version='>=3.40.0,<4.0a0|>=3.43.2,<4.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> libuuid[version='>=1.41.5,<2.0a0|>=2.32.1,<3.0a0|>=2.38.1,<3.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> libzlib[version='>=1.2.13,<1.3.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> ncurses[version='>=6.3,<7.0a0|>=6.4,<7.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> pip
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> readline[version='>=8.0,<9.0a0|>=8.1.2,<9.0a0|>=8.2,<9.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> tk[version='>=8.6.12,<8.7.0a0|>=8.6.13,<8.7.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> tzdata
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> xz[version='>=5.2.6,<6.0a0|>=5.4.2,<6.0a0']
  - conda-libmamba-solver=23.7.0 -> python[version='>=3.8'] -> zlib[version='>=1.2.13,<1.3.0a0']
  - conda=23.7.4 -> boltons[version='>=23.0.0']
  - conda=23.7.4 -> conda-package-handling[version='>=1.3.0'] -> conda-package-streaming[version='>=0.9.0']
  - conda=23.7.4 -> conda-package-handling[version='>=1.3.0'] -> zstandard[version='>=0.15']
  - conda=23.7.4 -> jsonpatch[version='>=1.32'] -> jsonpointer[version='>=1.9']
  - conda=23.7.4 -> packaging[version='>=23.0']
  - conda=23.7.4 -> pluggy[version='>=1.0.0']
  - conda=23.7.4 -> pycosat[version='>=0.6.3'] -> libgcc-ng[version='>=11.2.0|>=12']
  - conda=23.7.4 -> pyopenssl[version='>=16.2.0'] -> cryptography[version='>=38.0.0,<42,!=40.0.0,!=40.0.1|>=41.0.5,<42']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> bzip2[version='>=1.0.8,<2.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> ld_impl_linux-64[version='>=2.35.1|>=2.36.1']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> libffi[version='>=3.4,<3.5|>=3.4,<4.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> libnsl[version='>=2.0.0,<2.1.0a0|>=2.0.1,<2.1.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> libsqlite[version='>=3.40.0,<4.0a0|>=3.43.2,<4.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> libuuid[version='>=1.41.5,<2.0a0|>=2.32.1,<3.0a0|>=2.38.1,<3.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> libzlib[version='>=1.2.13,<1.3.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> ncurses[version='>=6.3,<7.0a0|>=6.4,<7.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> openssl[version='>=3.0.10,<4.0a0|>=3.0.7,<4.0a0|>=3.1.4,<4.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> pip
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> readline[version='>=8.0,<9.0a0|>=8.1.2,<9.0a0|>=8.2,<9.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> tk[version='>=8.6.12,<8.7.0a0|>=8.6.13,<8.7.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> tzdata
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> xz[version='>=5.2.6,<6.0a0|>=5.4.2,<6.0a0']
  - conda=23.7.4 -> python[version='>=3.10,<3.11.0a0'] -> zlib[version='>=1.2.13,<1.3.0a0']
  - conda=23.7.4 -> python_abi=3.10[build=*_cp310]
  - conda=23.7.4 -> requests[version='>=2.27.0,<3'] -> certifi[version='>=2017.4.17']
  - conda=23.7.4 -> requests[version='>=2.27.0,<3'] -> charset-normalizer[version='>=2,<4']
  - conda=23.7.4 -> requests[version='>=2.27.0,<3'] -> idna[version='>=2.5,<4']
  - conda=23.7.4 -> requests[version='>=2.27.0,<3'] -> urllib3[version='>=1.21.1,<2|>=1.21.1,<3']
  - conda=23.7.4 -> ruamel.yaml[version='>=0.11.14,<0.18'] -> ruamel.yaml.clib[version='>=0.1.2|>=0.2.6']
  - conda=23.7.4 -> ruamel.yaml[version='>=0.11.14,<0.18'] -> setuptools
  - conda=23.7.4 -> setuptools[version='>=60.0.0']
  - conda=23.7.4 -> toolz[version='>=0.8.1']
  - conda=23.7.4 -> tqdm[version='>=4'] -> colorama

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/40

Downgrade some libmamba related packages

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/40

Fix a problem with cloning conda-analytics

Thanks for working on this while I was on leave, @BTullis!

If I'm understanding correctly, after gerrit 709713 is deployed, Wmfdata < 2.2 will no longer be able to use Presto. You may already be planning this, but even after the new version of conda-analytics is deployed, we should work on notifying users and giving them a deadline to upgrade to Wmfdata 2.2 (I think 2-3 weeks would be sufficient). I think it's safe to assume most people are not very prompt with updating, so this way, after changing the Presto certificate, we will just have a few people who suddenly can't query Presto, rather than a lot 😁

@nshahquinn-wmf - welcome back!

If I'm understanding correctly, after gerrit 709713 is deployed, Wmfdata < 2.2 will no longer be able to use Presto.

That is absolutely correct.

This roll-out is going to have to happen across multiple stages and I'm trying my best to avoid any breaking changes for people, if possible.

So yes, the next phase is to deploy version 0.0.27 of conda-analytics, which will install wmfdata-python 2.2 by default, for any newly created conda environments. At this point I think that we should make the announcement to users that they should upgrade wmfdata when convenient. As you suggest, we will probably give users a 2-3 weeks to do this, before making the change to the presto configuration.

I have already begun working on the next phase, which I had been planning to implement as a big-bang switch of the discovery URI and associated kerberos principal.
As such, I created this PR, which would support both the old and new hostnames.

However, I am starting to think that there is likely a better approach if I use disggregated presto coordinators, rather than a single coordinator as we have been up until now.
We have an-coord100[3-4] ready to be put into service, so I think that perhaps the best solution would be if I can get all three of an-coord100[1,3,4] working as coordinators at the same time. Then I can make changes a little more iteratively and should be able to avoid any big-bang switch-over operations for users.

Mentioned in SAL (#wikimedia-analytics) [2023-12-18T10:54:04Z] <btullis> deploy conda-analytics v 0.0.27 to the hadoop-test-analytics cluster for T345482

I have verified that conda-analytics version 0.0.27 seems fine with jupyter. It can create a new conda environment and wmfdata-python.

In T345482#9412275, @BTullis wrote:

That is absolutely correct.

This roll-out is going to have to happen across multiple stages and I'm trying my best to avoid any breaking changes for people, if possible.

Nice, sounds like you already have everything well under control! Please let me know if you need any help from me.

I am pushing out version 0.0.27 of conda-analytics to production now, with:

btullis@cumin1001:~$ sudo debdeploy deploy -u 2023-12-19-conda-analytics.yaml -Q 'C:conda_analytics'
Rolling out conda-analytics:
Library update, several services might need to be restarted

All hosts where conda-analytics is deployed now have version 0.0.27 installed.

I have sent out a mail requesting that users upgrade wmfdata to version 2.2.0.
The proposed timescale for us to switch the certificates on the presto cluster is mid-January 2024.

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE (2023.12.01 - 2023.12.31).Dec 19 2023, 4:36 PM

BTullis moved this task from Backlog to Blocked / Waiting on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 2 2024, 2:23 PM

Change 709713 merged by Btullis:

[operations/puppet@production] Switch presto from Puppet to PKI certificates

https://gerrit.wikimedia.org/r/709713

I've now deployed the change so that Presto is using PKI certificates and each node has a keytab containing two principals.
For example:

root@an-coord1004:/etc/presto/ssl# klist -k /etc/security/keytabs/presto/presto.keytab 
Keytab name: FILE:/etc/security/keytabs/presto/presto.keytab
KVNO Principal
---- --------------------------------------------------------------------------
   1 presto/an-coord1004.eqiad.wmnet@WIKIMEDIA
   1 presto/analytics-presto.eqiad.wmnet@WIKIMEDIA

However, I've tried manually configuring an-coord1004 to use this principal presto/analytics-presto.eqiad.wmnet@WIKIMEDIA when serving HTTPS, but it's proving a bit tricky.

The documentation says a couple of contradictory things:

One example mentions: http.server.authentication.krb5.service-hostname and another mentions: http.server.authentication.krb5.principal-hostname

The description for http.server.authentication.krb5.principal-hostname says...

The Kerberos hostname for the Presto coordinator. Must match the Kerberos principal. This parameter is optional. If included, Presto will use this value in the host part of the Kerberos principal instead of the machine’s hostname.

However, whichever one of these I try, I get the same result.

WARN        main        Bootstrap        UNUSED PROPERTIES
WARN        main        Bootstrap        http.server.authentication.krb5.principal-hostname
WARN        main        Bootstrap
ERROR        main        com.facebook.presto.server.PrestoServer        Unable to create injector, see the following errors:
1) Configuration property 'http.server.authentication.krb5.principal-hostname' was not used
   at com.facebook.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:244)

WARN        main        Bootstrap        UNUSED PROPERTIES
WARN        main        Bootstrap        http.server.authentication.krb5.service-hostname
WARN        main        Bootstrap
ERROR        main        com.facebook.presto.server.PrestoServer        Unable to create injector, see the following errors:
1) Configuration property 'http.server.authentication.krb5.service-hostname' was not used
   at com.facebook.airlift.bootstrap.Bootstrap.lambda$initialize$2(Bootstrap.java:244)

When I look through the debug information, I can see that there is a property named: http.authentication.krb5.principal-hostname (so without the .server. part), but when I try setting this to:

http.authentication.krb5.principal-hostname=analytics-presto.eqiad.wmnet

...then it throws an error attempting to announce itself to the discovery listener:

ERROR        Announcer-3        com.facebook.airlift.discovery.client.Announcer        Service announcement failed after 14.88ms. Next request will happen within 1000.00ms
WARN        http-client-node-manager-60        com.facebook.presto.metadata.HttpRemoteNodeState        Error fetching node state from https://an-coord1004.eqiad.wmnet:8281/v1/info/state returned status 401: Authentication failed for token: <snip>

I'm currently checking the contents of /etc/presto/ssl/server.p12 to see if it contains everything we need.

Oh, this is going to be harder to achieve than I first thought.

I have gone back to this page: http://prestodb.io/blog/2022/04/15/disggregated-coordinator/ which is about how to run multiple Presto coordinators.
Upon closer inspection, in order to be able to run multiple coordinators, which would facilitate switching between them, we have to run a new component called a Resource Manager.

Without this component, we cannot have a single view of the cluster that is shared between multiple coordinators.
Therefore, if we have two presto coordinators running, one of them would always have all of the workers registered, while the other one would have none of the workers registered.

It seems to me that there are three options.

Do not use analytics-presto.eqiad.wmnet at all. This means that we would continue to use a hostname in wmfdata/presto.py and moving the coordinator is a breaking change for all wmfdata users. However, with the exception of T336045: Bring an-coord100[3-4] into service this happens very infrequently.
Use analytics-presto.eqiad.wmnet via DNS, but accept that moving the CNAME from one host to another will require a full presto cluster restart. This will mean that failover is manual, running presto queries will be terminated and it could take minutes for the cluster to stabilise
Change the configuration for the Presto cluster to use diaggregated coordinators, using two Resource Managers in addition to the two coordinators.

If introducing Resource Managers into the mix, we have options as to whether they run on VMs, or as Kubernetes based services. In fact, we could even make the coordinators run under Kubernetes as well.

There are some useful videos here to help us decide:

BTullis moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 16 2024, 3:40 PM

Gehel edited projects, added Data-Platform-SRE (2024.01.22 - 2024.02.11); removed Data-Platform-SRE (2024.01.01 - 2024.01.21).Jan 22 2024, 1:42 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.Jan 22 2024, 1:43 PM

BTullis mentioned this in T355886: Write a cookbook to check the age of all Java processes associated with the Hadoop clusters.Jan 25 2024, 1:21 PM

I've been testing out various approaches on this task and I have received great help from @brouberol for which I am very grateful.
As a result, we believe that we have a solution to implement option 2 above (T345482#9462664)

https://github.com/wikimedia/wmfdata-python/pull/50

I have assigned @xcollazo and @nshahquinn-wmf for review.

BTullis mentioned this in T336045: Bring an-coord100[3-4] into service.Jan 26 2024, 4:36 PM

@BTullis My pleasure!

BTullis moved this task from Needs Review to Blocked / Waiting on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.Jan 30 2024, 10:24 AM

That patch to wmfdata-python has been merged to main now, so the next step is to release a new version as per these instructions.

Would either @xcollazo or @nshahquinn-wmf like to carry out that step, please?

After that, I will:

Prepare a new version of conda-analytics containing the new version.
Deploy it to the analytics clients.
Announce the new version and request that users upgrade wmfdata in their conda environments.
Give them some time to do so.

Once those steps are done, I will be able to move the DNS CNAME of analytics-presto.eqiad.wmnet from an-coord1001 to either an-coord1003 or an-coord1004.
It will probably take a rolling presto cluster restart to make it work, but I think that it should work at that point.

BTullis raised the priority of this task from Medium to High.Jan 30 2024, 10:36 AM

BTullis added a subtask: T336045: Bring an-coord100[3-4] into service.

@BTullis I started working on the release process but realized we should remove the Urllib3 version pin now. I've put that up in PR 51. If you can review and merge that, I can do the release.

You'll also want to remove the version spec from the conda-analytics when you prepare the new version.

In T345482#9499926, @nshahquinn-wmf wrote:

@BTullis I started working on the release process but realized we should remove the Urllib3 version pin now. I've put that up in PR 51. If you can review and merge that, I can do the release.

You'll also want to remove the version spec from the conda-analytics when you prepare the new version.

+1 and merged to unblock you.

nshahquinn-wmf mentioned this in T356230: Conda-Analytics packages incompatible with latest versions of Pandas and Numpy.Jan 31 2024, 1:53 AM

Okay, I've released Wmfdata 2.3.0.

@BTullis while I was testing the new version, I noticed some dependency problems that occur if users upgrade their environments to more recent versions of Numpy or Pandas, as I did (T356230). This isn't actually caused by the code changes in the new version, which is why I just downgraded and finished the releases.

Writing that task spurred me to finally write up something I've been meaning to for a while: nothing in conda-analytics is actually pinned, so it's easy to break your environment by upgrading something (which can easily happen automatically when installing a new package) (T356231).

These should be relatively easy fixes and it would be great if you could take them on when you create the new version of Conda-Analytics, but I totally understand if you need to leave it for later.

BTullis moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.Jan 31 2024, 2:47 PM

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/41

Bump wmfdata to version 2.3.0 and add dependency

I have created this merge request to include the new version of wmfdata-python in conda-analytics.

In T345482#9500788, @nshahquinn-wmf wrote:

@BTullis while I was testing the new version, I noticed some dependency problems that occur if users upgrade their environments to more recent versions of Numpy or Pandas, as I did (T356230). This isn't actually caused by the code changes in the new version, which is why I just downgraded and finished the releases.

Writing that task spurred me to finally write up something I've been meaning to for a while: nothing in conda-analytics is actually pinned, so it's easy to break your environment by upgrading something (which can easily happen automatically when installing a new package) (T356231).

These should be relatively easy fixes and it would be great if you could take them on when you create the new version of Conda-Analytics, but I totally understand if you need to leave it for later.

Thank you so much @nshahquinn-wmf for that investigation and write-up. Yes, I believe that it would be good to work on pinning the packages. I had started to do a little bit of work on doing so here: T343823#9198960 but didn't take it any further forward. Your example pinned file in T356231: Package versions in Conda-Analytics are not pinned looks very suitable.

However, I think we should take on that work seprately, if you don't mind. I appreciate that it should be a relatively simple change, but it will also require some careful thinking about how to keep that file updated, how to test upgrades, and so on. There is also this ticket, which we are considering T321512: Install jupyterhub separately from conda-analytics as an approach and a number of other tickets around the way that we build and manage conda packages.

Therefore, I would rather keep this change to conda-analytics as simple as possible, just so that we can complete T336045: Bring an-coord100[3-4] into service and then T353774: Decom an-coord100[1-2].

Looking at the longer term, I'd like us to be able to implement JupyterHub on Kubernetes instead of users running it on individual stat servers. Doing so would help us to move away from these brittle conda environments altogether, but that's still some way off.

In T345482#9505663, @BTullis wrote:

However, I think we should take on that work seprately, if you don't mind. I appreciate that it should be a relatively simple change, but it will also require some careful thinking about how to keep that file updated, how to test upgrades, and so on. There is also this ticket, which we are considering T321512: Install jupyterhub separately from conda-analytics as an approach and a number of other tickets around the way that we build and manage conda packages.

Therefore, I would rather keep this change to conda-analytics as simple as possible, just so that we can complete T336045: Bring an-coord100[3-4] into service and then T353774: Decom an-coord100[1-2].

That totally makes sense! Things that look "relatively simple" from the outside often turn out to be much more complicated when you consider all the implications.

Thank you for the detailed response! It's good to know you were already aware of the issue and have some ideas about how to address it.

I'd like us to be able to implement JupyterHub on Kubernetes instead of users running it on individual stat servers. Doing so would help us to move away from these brittle conda environments altogether, but that's still some way off.

Yeah!!!

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/merge_requests/41

Bump wmfdata to version 2.3.0 and add dependency

I have tested the new version 0.0.28 of conda-analytics 2.3.0 on an-test-client and the presto.run() function still works as expected.

I'll plan to push out version 0.0.28 of conda-analytics on Monday, then request that users upgrade either their conda environments, or their wmfdata versions.

Mentioned in SAL (#wikimedia-analytics) [2024-02-05T14:07:29Z] <btullis> deploying conda-analytics version 0.0.28 to hadoop-all for T345482

I have pushed out the new version of conda-analytics containing the new wmfdata-python, plus I have annouced the update to users, so we should start to see all users move to the use of the analytics-presto.eqiad.wmnet CNAME as of now.

Once we have allowed a couple of weeks to pass, we can then think about moving the presto coordinator from an-coord1001 to an-coord1003, or an-coord1004 as part of T336045: Bring an-coord100[3-4] into service.
Either host should work, but it will prbably need a full presto cluster restart to make any change to the location of the coordinator.

I'll mark this ticket as done, but leave it open whilst we await any feedback and/or problem reports.

BTullis closed this task as Resolved.Feb 9 2024, 3:01 PM

BTullis closed subtask T336045: Bring an-coord100[3-4] into service as Resolved.Feb 26 2024, 11:24 AM

	F41753444: image.png
	Feb 2 2024, 4:35 PM

	F41613111: image.png
	Dec 19 2023, 11:15 AM

	F41610899: image.png
	Dec 18 2023, 11:51 AM

Wmfdata should connect to Presto using the analytics-presto CNAMEClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Wmfdata should connect to Presto using the analytics-presto CNAME
Closed, ResolvedPublic
Actions

Related Objects
Search...