- remove unused secrets from kubernetes.yaml on private puppet
I've created another 24-hour silence for this alert, UUID 59b5ca30-1aeb-4d06-b083-7023a373ccb3 .
Mon, Nov 27
The probe is getting a 500 error, which is spawning phab tickets for serviceops-collab team (see T352084 ). As such, I've set a 24-hour suppression in alertmanager (UUID fc02d897-8a64-4ebb-a362-77a765a7f155 ) . Will revisit once I'm back at work tomorrow.
Looks like the check targets are rendered at /srv/prometheus/ops/targets/probes-custom_puppet-http.yaml on the prom hosts
after merging the above patch, the target config for LDF endpoint looks like this
- labels: address: 10.64.132.7 family: ip4 module: http_query_wikidata_org_ldf_ip4 targets:
Sorry for the alert spam! We've fixed the Puppet failure and so I'll close this one out. Have a great rest of your week!
Just wanted to add that Envoy is deployed for Swift frontends per today's SRE meeting.
Looks like the data reload for lexemes completed. @dcausse , are you able to check the data from the reload and make sure it's usable? Let me know if I can help.
Mon, Nov 20
Let's leave wdqs1021 out for now, as we need it for performance testing in T351662
Thanks @Addshore , this is a wealth of great info!
Reopening per today's IRC conversation. We really need this process to be faster, so we'll try enabling the performance governor and seeing what happens next.
Update: The wikidata dump finished on wdqs1022 ( Wikidata dump loaded in 25 days, 13:32:17.263762) .
Fri, Nov 17
Per today's IRC discussion in the security channel, @CDanis mentioned detuning or removing the LVS alerts for internal hosts. So I'll set this one to blocked at the moment. Chris and/or Brett, let us know what your teams decide.
@jbond Sorry for the confusion, I associated the reimage with the wrong ticket. The output of the last reimage is here . Puppet was disabled because the hosts were previously set to their production role, but due to the PKI errors we put them back to insetup. I should have paid more attention...it seems the reimage never actually wiped the disks, whereas I had assumed it failed on later steps.
Thu, Nov 16
Confirmed working, closing...
Wed, Nov 15
These VMs have been fully deleted/decommissioned. Closing...
Not sure what happened, but the cloudelastic1008-1010 hosts are up after a reimage. I had to manually powercycle the DRAC, login to its console, and force-enable/run puppet.
Apologies for the reimage spam, it's from an unrelated operation.
Reopening as cloudelastic1008-1010 don't appear to have reimaged properly, and we may need them for T350826 .
Tue, Nov 14
team-sre/probes.yaml in the alerts repo looks like a good place to start.
Confirmed, we do not need to take further action. The decommissioned hosts have been removed automatically, and other aspects of the dashboard are still working.
This is done...closing out ticket.
Duplicate of T351123 ... closing.
Mon, Nov 13
Command should be sudo cookbook sre.hosts.decommission search-loader1001.eqiad.wmnet,search-loader2001.codfw.wmnet -t T351123. I'm at the end of my day, so will run tomorrow.
Per the last deploy message above, it looks like mjolnir is running successfully under Bullseye and Python 3.10. The next step is to decom the older, buster-based VMs. That work is tracked in T351123.
Both apps (commons and wikidata) are stable in staging-eqiad now:
Another progress report: We are 80% (869/1104) done on the leading host (wdqs1022).
Thu, Nov 9
Wed, Nov 8
Upon further review, I'm declining this as invalid. We do need to track resource usage, but that shouldn't be limited to MWAPI. Other resource-related issues will turn up as we roll out in staging (T347075) and test a backfill (T350826) , so we can add new subtasks off these tickets as needed.
My rough notes around this subject are here . I'm still learning Flink and Kafka, so will need some help creating the backfill test.
Per conversation with @Gehel , we might need to do a scream test by shutting off the old instances. Will look into this next week.