Page MenuHomePhabricator

Examine/refactor WDQS startup scripts
Open, LowPublic

Description

While investigating wdqs categories-related service failures in T342060 , we noticed the categories update scripts use a lot of indirection , which complicates troubleshooting.
Creating this ticket to:

  • Document the "what, why, and how" of the current setup
  • Identify and take opportunities to make the scripts simpler and more reliable
  • Ensure the scripts and their related unit files can be started by Puppet. A look at icinga shows that load-dcatap-weekly.service is failing on all newly-provisioned hosts , and running puppet does not fix this.
  • Avoid making changes that will break non-WMF installations of WDQS.

Event Timeline

bking renamed this task from Examine/refactor WDQS categories update scripts to Examine/refactor WDQS startup scripts.Aug 15 2023, 9:42 PM

This issue came up again when working on T343856 . Broadening the scope of this ticket to include all WDQS startup scripts, not just categories.

Change 949503 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] [WIP] Start Balzegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/949503

Change 949503 merged by Bking:

[operations/puppet@production] Start Blazegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/949503

Mentioned in SAL (#wikimedia-operations) [2023-08-17T14:21:43Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs2007.codfw.wmnet with reason: canary for T342361

Mentioned in SAL (#wikimedia-operations) [2023-08-17T14:21:56Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs2007.codfw.wmnet with reason: canary for T342361

Change 950027 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] query_service: fix glob expansion in blazegraph systemd unit

https://gerrit.wikimedia.org/r/950027

Change 950027 merged by Gehel:

[operations/puppet@production] query_service: fix glob expansion in blazegraph systemd unit

https://gerrit.wikimedia.org/r/950027

Some observations from last two patches, tested on wdqs2007 before reverting due to issues:

  • Discrepancy between location of wdqs-blazegraph unit files:
wdqs2007: (bullseye, wdqs-public, running patch) /etc/systemd/system/wdqs-blazegraph.service
wdqs2008: (bullseye, wdqs-internal, not running patch) /lib/systemd/system/wdqs-blazegraph.service
  • Weirdness with %p - %t:

(from /lib/systemd/system/wdqs-categories.service on wdqs2007)

ExecStart=/usr/bin/sh -c '/usr/bin/java -server \
    -XX:+UseG1GC \
    -Xmx8g \
    -Xloggc:/var/log/wdqs/wdqs-categories_jvm_gc.%p-%t.log \
...

versus

(from /etc/systemd/system/wdqs-blazegraph.service on wdqs2007)

ExecStart=/usr/bin/sh -c '/usr/bin/java -server \
    -XX:+UseG1GC \
    -Xmx31g \
    -Xloggc:/var/log/wdqs/wdqs-blazegraph_jvm_gc.test.log \
...

^^ Above changes were one-offs I made to troubleshoot. Sorry for the confusion!

Change 950136 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] Start Blazegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/950136

Change 950136 merged by Bking:

[operations/puppet@production] Start Blazegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/950136

Mentioned in SAL (#wikimedia-operations) [2023-08-31T20:16:24Z] <inflatador> 'bking@wdqs1004 depool wdqs1004 to test script changes T342361'

Unfortunately, the patch had to be rolled back. The error we received was:

Aug 31 20:18:34 wdqs1004 wdqs-blazegraph[1142014]: Error: Could not find or load main class org.eclipse.jetty.runner.Runner

Change 955832 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] Start Blazegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/955832

Change 955832 merged by Gehel:

[operations/puppet@production] Start Blazegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/955832

Mentioned in SAL (#wikimedia-operations) [2023-09-07T18:49:58Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: T342361

Mentioned in SAL (#wikimedia-operations) [2023-09-07T18:50:23Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: T342361

The generated unit file contains jar / war path that have globs. We expect those to be expanded by sh -c '...', but this does not seem to be the case.

[Unit]
Description=Query Service - Blazegraph - wdqs-blazegraph

[Service]
Type=simple
WorkingDirectory=/srv/deployment/wdqs/wdqs
ExecStart=/usr/bin/sh -c '/usr/bin/java -server \
    -XX:+UseG1GC \
    -Xmx31g \
    -XX:+UnlockExperimentalVMOptions \
    -XX:+UseNUMA \
    -XX:G1NewSizePercent=20 \
    -XX:+ParallelRefProcEnabled \
    -Xloggc:/var/log/wdqs/wdqs-blazegraph_jvm_gc.%%p-%%t.log \
    -XX:+PrintGCDetails \
    -XX:+PrintGCDateStamps \
    -XX:+PrintGCTimeStamps \
    -XX:+PrintAdaptiveSizePolicy \
    -XX:+PrintReferenceGC \
    -XX:+PrintGCCause \
    -XX:+PrintGCApplicationStoppedTime \
    -XX:+PrintTenuringDistribution \
    -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=10 \
    -XX:GCLogFileSize=20M \
    -Dlogback.configurationFile=/etc/wdqs/logback-wdqs-blazegraph.xml \
    -Dhttp.proxyHost=webproxy.eqiad.wmnet \
    -Dhttp.proxyPort=8080 \
    -XX:+ExitOnOutOfMemoryError \
    -DwikibaseSomeValueMode=skolem \
    -javaagent:/usr/share/java/prometheus/jmx_prometheus_javaagent.jar=9102:/etc/wdqs/wdqs-blazegraph-prometheus-jmx.yaml \
    -Dwdqs.jwt-identity-filter.jwt-identity-cookie-name=wcqsSession \
    -Dwdqs.jwt-identity-filter.jwt-identity-claim=username \
    -DwikibaseServiceWhitelist=/etc/wdqs/allowlist-wdqs-blazegraph.txt \
    -Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/etc/wdqs/RWStore.wikidata.properties \
    -Dorg.eclipse.jetty.server.Request.maxFormContentSize=200000000 \
    -Dcom.bigdata.rdf.sparql.ast.QueryHints.analytic=true \
    -Dcom.bigdata.rdf.sparql.ast.QueryHints.analyticMaxMemoryPerQuery=1073741824 \
    -DASTOptimizerClass=org.wikidata.query.rdf.blazegraph.WikibaseOptimizers \
    -Dorg.wikidata.query.rdf.blazegraph.inline.literal.WKTSerializer.noGlobe=2 \
    -Dcom.bigdata.rdf.sail.webapp.client.RemoteRepository.maxRequestURLLength=7168 \
    -Dcom.bigdata.rdf.sail.sparql.PrefixDeclProcessor.additionalDeclsFile=/srv/deployment/wdqs/wdqs/prefixes.conf \
    -Dorg.wikidata.query.rdf.blazegraph.mwapi.MWApiServiceFactory.config=/srv/deployment/wdqs/wdqs/mwservices.json \
    -Dcom.bigdata.rdf.sail.webapp.client.HttpClientConfigurator=org.wikidata.query.rdf.blazegraph.ProxiedHttpConnectionFactory \
    -DblazegraphDefaultNamespace=wdq \
    -Dhttp.userAgent="Wikidata Query Service (test); https://query.wikidata.org/" \
    -Dorg.eclipse.jetty.annotations.AnnotationParser.LEVEL=OFF \
    -cp /srv/deployment/wdqs/wdqs/jetty-runner*.jar:/srv/deployment/wdqs/wdqs/lib/logging/* \
    org.eclipse.jetty.runner.Runner \
    --host localhost \
    --port 9999 \
    --path /bigdata \
    /srv/deployment/wdqs/wdqs/blazegraph-service-*.war'

User=blazegraph
StandardOutput=journal+console
Restart=always
SyslogIdentifier=%N

TasksMax=10000

PrivateDevices=yes
ProtectSystem=full
ProtectHome=yes
NoNewPrivileges=yes
RestrictAddressFamilies=AF_UNIX AF_INET AF_INET6

ReadOnlyDirectories=/
# data storage
ReadWriteDirectories=/srv/wdqs
# logs
ReadWriteDirectories=/var/log/wdqs
# already protected by PrivateTmp
ReadWriteDirectories=/tmp /var/tmp

[Install]

Mentioned in SAL (#wikimedia-operations) [2023-09-11T09:24:20Z] <gehel@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: T342361 - testing blazegraph startup script refactor

Mentioned in SAL (#wikimedia-operations) [2023-09-11T09:24:33Z] <gehel@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: T342361 - testing blazegraph startup script refactor

Change 956432 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] Start Blazegraph from systemd unit, without runBlazegraph.sh

https://gerrit.wikimedia.org/r/956432

Gehel triaged this task as Low priority.Nov 3 2023, 10:27 AM