User Details
- User Since
- May 1 2020, 10:28 PM (49 w, 17 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- RKemper (WMF) [ Global Accounts ]
Thu, Mar 25
wdqs1009, wdqs1010, and wdqs2008 are done, so we need to data-transfer to the remaining instances.
Wed, Mar 24
Mon, Mar 22
Fri, Mar 19
I forgot to try running a curl command from inside the analytics network *before* deploying the new cert, so I don't have a good before/after comparison, but curling relforge from within the analytics network hangs indefinitely, which is a good sign (it should reject the cert and return immediately if it's still broken).
After creating a new manifest and running the cert gen command, we need to copy the newly generated secret key in decrypted form to another location in the /srv/private repo. Then we chown all the new files to make sure they're owned by gitpuppet (it's possible there's a git commit hook that does this for me but I didn't see one so I have just been playing it safe). Finally, we need to copy over the pubkey to the operations/puppet repo.
ryankemper@puppetmaster1001:/srv/private$ sudo cergen -c 'relforge.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d 2021-03-19 02:55:31,498 INFO cergen Generating certificates ['relforge.svc.eqiad.wmnet'] with force=False 2021-03-19 02:55:31,498 INFO Certificate(relforge.svc.eqiad.wmnet) Generating all files, force=False... 2021-03-19 02:55:31,500 INFO Certificate(relforge.svc.eqiad.wmnet) Generating certificate file /usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning 2021-03-19 02:55:33,004 INFO Certificate(relforge.svc.eqiad.wmnet) Generating CA certificate file 2021-03-19 02:55:33,005 INFO Certificate(relforge.svc.eqiad.wmnet) Generating PKCS12 keystore file 2021-03-19 02:55:33,285 INFO Certificate(relforge.svc.eqiad.wmnet) Generating Java keystore file 2021-03-19 02:55:34,365 INFO Certificate(relforge.svc.eqiad.wmnet) Importing PuppetCA(puppetmaster1001.eqiad.wmnet_8140) cert into Java keystore 2021-03-19 02:55:35,406 INFO Certificate(relforge.svc.eqiad.wmnet) Generating Java truststore file with CA certificate PuppetCA(puppetmaster1001.eqiad.wmnet_8140)
New cergen-based manifest (`modules/secret/secrets/certificates/certificate.manifests.d/relforge.certs.yaml
) to generate relforge.svc.eqiad.wmnet`:
Thu, Mar 18
Sat, Mar 13
DNS change logs
ryankemper@authdns1001:~$ sudo authdns-update Updating authdns1001.wikimedia.org (self)... Pulling the current revision from https://gerrit.wikimedia.org/r/operations/dns.git Reviewing 85d9b49dc2ff0f8e3657f6f2cd91ce3df79bd1cf...
The issues with envoy were resolved by running sudo /usr/local/sbin/build-envoy-config -c /etc/envoy to properly build /etc/envoy/envoy.yaml. That should have been done by puppet already, triggered upon a sudo systemctl restart envoyproxy.service, but it didn't - perhaps a race condition. See https://gerrit.wikimedia.org/g/operations/puppet/+/b7dacbca9fae42b32bb91fd485a3f2c70ff903b3/modules/envoyproxy/manifests/init.pp#81 and https://gerrit.wikimedia.org/g/operations/puppet/+/b7dacbca9fae42b32bb91fd485a3f2c70ff903b3/modules/envoyproxy/manifests/conf.pp#30 for the puppet code that normally does it automatically.
Fri, Mar 12
I missed a step yesterday: I'd updated /srv/private as well as the public labs/private repo but missed the step for updating operations/puppet with the new pubkey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/671267
Current status for when I pick this back up:
Ah, so poking around the certificate.manifests.d repo I see certs that don't necessarily follow the discovery.wmnet pattern. To me that implies Option 2 should be working, so I might be missing something. Here's an example that doesn't use discovery:
ryankemper@puppetmaster1001:/srv/private$ cat modules/secret/secrets/certificates/certificate.manifests.d/analytics_http_ui.certs.yaml yarn.wikimedia.org: authority: puppet_ca expiry: null alt_names: ["yarn.wikimedia.org", "hue.wikimedia.org", "hue-next.wikimedia.org", "superset.wikimedia.org", "pivot.wikimedia.org", "turnilo.wikimedia.org", "stats.wikimedia.org", "analytics.wikimedia.org", "piwik.wikimedia.org", "datasets.wikimedia.org"] key: password: REDACTED algorithm: ec
Option 2 fails to even generate the cert. All the cergen documentation is written for a certificate like query-preview.discovery.wmnet and not wdqs1009.eqiad.wmnet or query-preview.wikidata.org. So I do think this just isn't what cergen is built to do.
Hit a big blocker with the current proposed approach of using wdqs1009.eqiad.wmnet as the cert name:
Mar 10 2021
Finished rolling back to the previous iteration of wdqs.discovery.wmnet cert since we're now going to create a net-new cert wdqs1009.eqiad.wmnet for wdqs-test
ryankemper@puppetmaster1001:/srv/private$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory)
ryankemper@puppetmaster1001:/srv/private$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed) (use "git checkout -- <file>..." to discard changes in working directory)
Mar 8 2021
Mar 5 2021
Here's how the ats mapping looks afterdeploy of the backend.yaml changes:
Mar 4 2021
Posting logs of our IRC convo from ~1 month ago for context when I tag people for review:
Mar 3 2021
PLUGIN BUILD & UPLOAD STEPS PERFORMED:
# Starting from plugins repo # (1) Build locally and scp over to build host ./debian/rules prepare_build cd .. ssh 'deneb.codfw.wmnet' 'sudo rm -rfv ~/plugins'
Note: The initial build/upload was broken due to operator error, so we built/uploaded the new 6.5.4-6, which has been confirmed to work.
Mar 2 2021
@TJones Thanks, I'll tap in David or Zbyszko to see if they can find the error.
Feb 26 2021
Side note: Just noticed I named the tmux session elastic1065. Fortunately as can be seen above we're reimaging the proper host, elastic2045 :P
Note: Puppet is still disabled on wdqs2008 while the reload runs. It occurred to me that I'm not sure if puppet actually needs to be disabled during data reloads or if that's just a precaution we've historically taken - any insight here @Gehel?
@Cmjohnson The data reload is complete on wdqs1009, so the host can now have its firmware upgraded and be rebooted at its convenience. Note this is an internal wdqs test host, so there is no public-facing service for us to worry about.
Feb 25 2021
Downtimed wdqs2008 until 2021-03-04 21:56:59
I started doing restarts in eqiad, but hit a show-stopper: any node with the new plugin version had its elasticsearch systemd units stuck in a failure state that persisted across restarts. The most suspicious log-line by far is java.nio.file.AccessDeniedException: /var/run/elasticsearch:
Feb 24 2021
Now that the new debian package is built & uploaded, we can proceed to the actual roll-out (https://phabricator.wikimedia.org/T274204) when ready
The new debian package has been built and uploaded.
Commands used to unban elastic1063:
Closed because this is (somewhat) redundant with T267927; will track in that ticket
@Gehel Yup I can get elastic2045 re-imaged and unbanned once we get sda replaced
Feb 23 2021
Feb 19 2021
Feb 10 2021
From
ryankemper@relforge1004:~$ sudo systemctl status kibana.service ● kibana.service - Kibana Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Wed 2021-02-10 00:38:09 UTC; 2min 41s ago Process: 1040 ExecStart=/usr/share/kibana/bin/kibana -c /etc/kibana/kibana.yml (code=exited, status=64) Main PID: 1040 (code=exited, status=64)
Closing the loop on the above, it looks like newsfeed.enabled exists in Elasticsearch 7 but not in Elasticsearch 6.
Moving back to in-progress - I'd thought that all systemd units were working properly now, but kibana.service is still failing on relforge100[3,4].
Feb 9 2021
Updated netbox entries to mark the servers as active.
Above issue is resolved; our order of operations is a bit flawed and will result in puppet trying to install packages such as elasticsearch-oss before it can "see" the package (presumably due to lack of an apt-get update being ran in time). Issues self-healed in ~30 minutes; I manually restarted the failing services once the state in puppet-land had resolved itself.
Prometheus exporters are having trouble:
See https://sal.toolforge.org/log/A5m9h3cBgTbpqNOmqYik for timing of reboot. Also see https://phabricator.wikimedia.org/T274270 for related ticket that came out of this (reboot took super long)
Feb 8 2021
Still waiting for the latest dumps to be downloaded (few more hours), then need to reboot WDQS hosts as part of https://phabricator.wikimedia.org/T274213, then can do the actual data-reload
(See https://phabricator.wikimedia.org/T273097#6805355 for why this ticket has been closed)
- https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service#Updates
- https://commons.wikimedia.org/wiki/Commons:Village_pump#Unscheduled_maintenance%3A_Wikimedia_Commons_Query_Service
- WikiData ML mailing list (note: A community member pointed out that sending an e-mail to a Commons mailing list would have made more sense, which is a great point. We'll do that in the future)
WCQS is back in service; updating the notification channels right now and will comment back here after
Feb 5 2021
sudo cookbook sre.wdqs.data-reload wdqs1009.eqiad.wmnet --reuse-downloaded-dump --reload-data wikidata --skolemize --reason 'T267927: Reload wikidata jnl from fresh dumps' --task-id T267927 is failing with:
Feb 4 2021
TODO from IRC meeting with bblack/gehel: create a DNS entry (CNAME to dyna.wm.o), another set of entries in backend.yaml map, create another minisite (with the appropriate configuration)
Feb 3 2021
Notified WikiData mailing list and also posted here: https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service#Updates
wcqs-beta-01.eqiad.wmflabs is running low on disk space due to its blazegraph journal dataset size. In order to free up space we will need to take the service down, delete the journal and re-import from the latest dump. Service interruption will begin at Feb 4 18:30 UTC and continue until the data reload is complete.
@akosiaris Is your concern with the idea of using a`flink` base image solution mainly just centered around the inefficiency/inconvenience of needing SRE to merge any flink version upgrades? Since we have an embedded SRE on search (me) and to a lesser extent Guillaume, I think it wouldn't be too much of a problem. In general having our dependencies managed by a docker image will make it easier for us to be explicit about what version we're using, and it seems like the default docker-y way of doing things. Is there a technical reason why a base image might not be a good idea?
We'll want to reload these this Friday, because the latest dumps should be available thursday evening.
Feb 1 2021
Jan 28 2021
@Jclark-ctr In addition to Erik's point above about dmidecode being installed, we just deployed a patch to install edac-util on all Elasticsearch systems (this includes logstash*, cloudelastic* btw). So edac-util is now available for use
Jan 27 2021
Barring any further issues cropping up, this is done.
Jan 26 2021
Since resolving this monitoring issue is one of our highest priorities, here's a handoff for Tues Jan 26 so that Europe can make headway:
Finished generating new cert. Here's a (password-redacted) log of the changes made:
Jan 23 2021
I've downtimed the WDQS sparql alerts until next week.