Page MenuHomePhabricator

Ensure WDQS stack works on Bullseye
Closed, ResolvedPublic5 Estimated Story Points

Description

Per parent ticket, we must migrate away from Buster by September 2023. Creating this ticket to:

  • Test operation of the current stack on a Bullseye host
  • If necessary, update Puppet and other parts of the stack to ensure the WCQS/WDQS stack works on newer versions of Debian.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2023-04-20T19:16:40Z] <inflatador> bking@cumin1001 depool wdqs2012.codfw.wmnet for data xfer T331300

Mentioned in SAL (#wikimedia-operations) [2023-04-20T21:18:27Z] <inflatador> bking@cumin1001 depool wdqs2009 for data xfer T331300

Mentioned in SAL (#wikimedia-operations) [2023-04-20T21:22:44Z] <inflatador> bking@cumin1001 repool wdqs2012 T331300

Icinga downtime and Alertmanager silence (ID=09a1e24c-01d3-42a5-8179-085a01f32aae) set by bking@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2009.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=53c1ca57-f03f-405d-8e63-67add663e004) set by bking@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2006.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=45b4cbdf-ff0a-48b7-9921-640f9106f396) set by bking@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2012.codfw.wmnet

We're examining wdqs2022, where we have completed the transfer of /srv/wdqs/ yet blazegraph is not starting.

/usr/lib/libjvmquake.so is giving issues which appear to be preventing wdqs-blazegraph from starting:

Apr 26 21:13:44 wdqs2022 wdqs-blazegraph[1710982]: (jvmquake) using options: threshold=[300s],runtime_weight=[5:1],action=[JVM OOM]
Apr 26 21:13:44 wdqs2022 wdqs-blazegraph[1710982]: agent library failed to init: /usr/lib/libjvmquake.so

Here's wdqs2022 (bullseye)'s package info as opposed to wdqs2004 (buster):

ryankemper@wdqs2004:~$ java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-4~deb10u1-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)

ryankemper@wdqs2004:~$ dpkg -l | grep jvmquake
ii  jvmquake                             1.0.1-1+deb10u1              amd64        A JVMTI agent that kills your JVM when things go sideways
ryankemper@wdqs2022:~$ java -version
openjdk version "1.8.0_362"
OpenJDK Runtime Environment (build 1.8.0_362-8u362-ga-4~deb11u1-b09)
OpenJDK 64-Bit Server VM (build 25.362-b09, mixed mode)

ryankemper@wdqs2022:~$ dpkg -l | grep jvmquake
ii  jvmquake                             1.0.1-1+deb11u1                amd64        A JVMTI agent that kills your JVM when things go sideways
bking added a subscriber: dcausse.

Here's what @dcausse and I did at today's pairing session:

  • Realized that the jvmquake package for Bullseye is build against Java 11, whereas we need Java 8
  • Removed the Java-11-built-jvmquake packages from main bullseye-wikimedia Debian repo and copied Java-8-built-jvmquake packages from main wikimedia-buster to main wikimedia-bullseye repo.

Our next attempt to start wdqs-blazegraph.service on wdqs2022 met with a new error:
Error: Could not find or load main class org.eclipse.jetty.runner.Runner

We believe that this is caused by an incomplete scap deploy. So our next steps will be:

Gehel set the point value for this task to 5.May 1 2023, 3:26 PM

Icinga downtime and Alertmanager silence (ID=51e1d4cd-32ce-4dfa-ae82-91dd8c3e940b) set by bking@cumin1001 for 12 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2022.codfw.wmnet

Change 914381 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: add wdqs2022 to conftool

https://gerrit.wikimedia.org/r/914381

Change 914381 merged by Bking:

[operations/puppet@production] wdqs: add wdqs2022 to conftool

https://gerrit.wikimedia.org/r/914381

Change 914384 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: Add wdqs2022 as scap target

https://gerrit.wikimedia.org/r/914384

Change 914384 merged by Bking:

[operations/puppet@production] wdqs: Add wdqs2022 as scap target

https://gerrit.wikimedia.org/r/914384

Icinga downtime and Alertmanager silence (ID=9386abfe-15eb-44c9-befd-5bcc42b22df7) set by bking@cumin1001 for 14 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2022.codfw.wmnet

We were unable to deploy because the git-fat package was not available for Bullseye. We copied it from the Buster repo using these instructions .

After installing git-fat and re-running Puppet, we were able to get the wdqs services to start cleanly. Our test queries also passed.

At this point, I believe we've confirmed that the WDQS stack runs on Bullseye. But we do want to leave this open until @dcausse returns next week so he can validate as well.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Change 918597 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] [WIP]wdqs: Activate wdqs2021

https://gerrit.wikimedia.org/r/918597

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster executed with errors:

  • wdqs2021 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster completed:

  • wdqs2021 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202305102152_bking_3857978_wdqs2021.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 918597 merged by Ryan Kemper:

[operations/puppet@production] wdqs: Activate wdqs2021

https://gerrit.wikimedia.org/r/918597

We've noticed that on the bullseye hosts, the blazegraph prometheus exporters are in a restart loop, ultimately [likely] due to differing python versions breaking the current implementation of our exporter script.

Python 3 version differs between OS versions:

(buster)

ryankemper@wdqs1010:~$ python3 -V
Python 3.7.3

versus
(bullseye)

ryankemper@wdqs2022:~$ python3 -V
Python 3.9.2
Gehel renamed this task from Ensure WCQS/WDQS stack works on Bullseye and later to Ensure WCQS/WDQS stack works on Bullseye.May 11 2023, 6:59 PM

Icinga downtime and Alertmanager silence (ID=2ba9c32a-8bbc-4b94-9ca0-1fbeeaee45e7) set by bking@cumin1001 for 14 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2021.codfw.wmnet

We noticed errors deploying the latest wdqs version to Bullseye:

  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/runcmd.py", line 91, in gitcmd
    return _runcmd(["git", subcommand] + list(args), **kwargs)
  File "/var/lib/scap/scap/lib/python3.9/site-packages/scap/runcmd.py", line 78, in _runcmd
    raise FailedCommand(argv, p.returncode, stdout, stderr)
scap.runcmd.FailedCommand: Command 'git fat init' failed with exit code 1;                              stdout:

19:08:57 [wdqs2022.codfw.wmnet] deploy-local failed: <FailedCommand> {'exitcode': 1, 'stdout': '', 'stde
rr': "git: 'fat' is not a git command. See 'git --help'.\n\nThe most similar commands are\n\tfetch\n\tmk
tag\n\tstage\n\tstash\n\ttag\n\tvar\n"}

As previously mentioned, we were missing git-fat from wdqs2022. To fix this issue, we reinstalled it and ran git fat init and git fat pull from the WDQS deploy directory on wdqs2022. Git-fat uses Python2 and puppet is configured to remove python2-related packages , so we'll have to find a way to work around this.

Note that the Search Platform team has already been asked to replace git-fat with git-lfs . I'm not sure how quickly that is going to happen, so we probably want to update our puppet code to allow python2 on our wdqs bullseye hosts.

Change 920365 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: Permit python2 on bullseye

https://gerrit.wikimedia.org/r/920365

Icinga downtime and Alertmanager silence (ID=edd7680d-1c73-4a29-8687-f2061fd84b57) set by bking@cumin1001 for 12 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2012.codfw.wmnet

Change 920365 merged by Ryan Kemper:

[operations/puppet@production] query_service: Permit python2 on bullseye

https://gerrit.wikimedia.org/r/920365

Icinga downtime and Alertmanager silence (ID=389b7357-bed5-4b2f-8790-8d67f9ff7609) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2021.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=6de74cbd-41f5-48dd-9b28-0f5e924b361a) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2012.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=e631aacc-d9c4-4fd4-a12f-2ac8dc01ccf2) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs1016.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=e90a333a-c22a-4294-bcc7-6a7665a57f08) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs1016.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=60cdd970-8441-43c9-b77b-84c7443210ea) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2012.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=daffcf81-a026-4a44-bbd6-5cb2dda2365c) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2012.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=72d89466-5f1c-4a18-8dd7-27b6fb931b75) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2021.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=f835fbe1-5ecd-477d-9755-a6556d5b9287) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2021.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye

Icinga downtime and Alertmanager silence (ID=f4cd972b-d9a2-4ac2-866f-e100023e5f8d) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2022.codfw.wmnet

We're still having problems with our first Bullseye host, wdqs2022. After successfully* transferring the wdqs data via the data-transfer cookbook, the wdqs-categories and wdqs-blazegraph services can start , but wdqs-updater.service fails. The unit file calls a bash script with arguments as follows:

/bin/bash /srv/deployment/wdqs/wdqs/runStreamingUpdater.sh -n wdq -- --brokers kafka-main2001.codfw.wmnet:9092,kafka-main2002.codfw.wmnet:9092,kafka-main2003.codfw.wmnet:9092,kafka-main2004.codfw.wmnet:9092,kafka-main2005.codfw.wmnet:9092 --consumerGroup wdqs2022 --topic codfw.rdf-streaming-updater.mutation --batchSize 250

I ran this command manually and I've captured a stacktrace here . Running the test.sh script from the rdf repo also returns a 503, suggesting that the data transferred via the cookbook might be corrupt.

*"successfully," as in, "the cookbook ran without errors"

Looking at P49427, it seems that there is an issue with the logging configuration. This should not prevent the system from starting, but might affect logging.

02:05:41,152 |-ERROR in ch.qos.logback.core.model.processor.ImplicitModelHandler - Could not create component [filter] of type [org.wikidata.query.rdf.common.log.PerLoggerThrottler] java.lang.ClassNotFoundException: org.wikidata.query.rdf.common.log.PerLoggerThrottler

The message above indicates that some of the logging configuration references classes that should be available in the main Blazegraph binaries, but not in the updater (we need some additional filtering of logs in Blazegraph as it sometimes gets too verbose, but those should not be needed in the updater).

Looking at the Blazegraph logs (/var/log/wdqs/wdqs-blazegraph.log) , there seem to be an issue with file permissions (or file existence?) on /srv/wdqs/wikidata.jnl

01:33:14.910 [main] WARN  o.eclipse.jetty.webapp.WebAppContext - Failed startup of context o.e.j.w.WebAppContext@5d908d47{Bigdata,/bigdata,file:///tmp/jetty-localhost-9999-blazegraph-service-0.3.124.war-_bigdata-any-5007834296220894734.dir/webapp/,UNAVAILABLE}{file:///srv/deployment/wdqs/wdqs-cache/revs/41174d50f967ef9bf3e3d956059c075790561ed0/blazegraph-service-0.3.124.war} 
java.io.FileNotFoundException: /srv/wdqs/wikidata.jnl (Permission denied)
        at java.io.RandomAccessFile.open0(Native Method)
Wrapped by: java.lang.RuntimeException: file=/srv/wdqs/wikidata.jnl
        at com.bigdata.journal.FileMetadata.<init>(FileMetadata.java:1144)
Wrapped by: java.lang.RuntimeException: java.lang.RuntimeException: file=/srv/wdqs/wikidata.jnl
        at com.bigdata.rdf.sail.webapp.BigdataRDFServletContextListener.openIndexManager(BigdataRDFServletContextListener.java:816)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS bullseye executed with errors:

  • wdqs2021 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306140150_bking_1429183_wdqs2021.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

Icinga downtime and Alertmanager silence (ID=583ec76e-3af3-444b-8dcc-25a81545f149) set by bking@cumin1001 for 20 days, 0:00:00 on 1 host(s) and their services with reason: attempting WDQS stack on bullseye

wdqs2021.codfw.wmnet

Leaving some notes before I step out for the weekend and forget everything. The Bullseye hosts are still not coming up without manual intervention beyond the data-transfer.

  • Puppet is not installing git-fat (required for deployment), but it's not removing it after a manual install, either. My best guess at this point is that when Puppet chokes on the prometheus exporters, it doesn't finish installing its packages. A weak theory, but the best one I have at the moment,.
  • Scap deploys targeting a single host ( tested with wdqs2020 ) succeed, but the service can't start. Manually invoking /bin/bash /srv/deployment/wdqs/wdqs/runBlazegraph.sh -f /etc/wdqs/RWStore.categories.properties shows Could not find or load main class org.eclipse.jetty.runner.Runner , and the /srv/deployment/wdqs directory is too small. Manually deleting the entire contents of /srv/deployment/wdqs/ and re-deploying via scap seems to fix this issue.

Manual steps (see above) needs to be documented on wiki before we close this task.

bking renamed this task from Ensure WCQS/WDQS stack works on Bullseye to Ensure WDQS stack works on Bullseye.Jul 25 2023, 9:02 PM

Documentation has been updated, but there's an important piece missing: we never verified WCQS. Creating a separate ticket for that issue.

Mentioned in SAL (#wikimedia-operations) [2023-08-08T19:28:02Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs[1001-1003].eqiad.wmnet with reason: T331300

Mentioned in SAL (#wikimedia-operations) [2023-08-08T19:28:18Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs[1001-1003].eqiad.wmnet with reason: T331300

Change 947928 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] query_service: install git-fat

https://gerrit.wikimedia.org/r/947928

Change 947928 merged by Bking:

[operations/puppet@production] query_service: install git-fat

https://gerrit.wikimedia.org/r/947928

Change 947930 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] wdqs.data-transfer: ensure data_loaded file is created

https://gerrit.wikimedia.org/r/947930

Per pairing discussion with Ryan, we believe this work is complete. The actual migration work continues in T343124 .