Generated new cergen certs for wdqs.discovery.wmnet that include wdqs1016 in the alt_names instead of wdqs1005. Followed the steps below:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 31 2023
Aug 30 2023
Aug 29 2023
Aug 28 2023
Aug 17 2023
Some observations from last two patches, tested on wdqs2007 before reverting due to issues:
Aug 16 2023
Built wmf-elasticsearch-search-plugins_7.10.2-9 and wmf-elasticsearch-search-plugins_7.10.2-9~bullseye (https://apt.wikimedia.org/wikimedia/pool/thirdparty/elastic710/w/wmf-elasticsearch-search-plugins/); installed on all elastic* hosts (incl. relforge* and cloudelastic*). Rolling restarts not completed yet. relforge* can be restarted at any time, but elastic* and cloudelastic* must wait till after an ongoing reindex of all wikis has completed.
Will be in blocked/waiting for a few days while a reindex of all wikis completes to apply the newest settings.
Aug 15 2023
Aug 14 2023
Patch was merged here: https://gerrit.wikimedia.org/r/947928
Aug 8 2023
Looks like we lost track of this a bit. @bking and I can work this this week.
Aug 7 2023
Just some investigation we did to understand where the metrics come from: probe_ssl_earliest_cert_expiry comes from the blackbox exporter (see random docs). That metric is used by the alerting repo here: https://github.com/wikimedia/operations-alerts/blob/4ecc222e95710395a6f9a7039e487186d2264323/team-sre/probes.yaml#L55
Aug 3 2023
Checked like so:
In T338159#9002663, @EBernhardson wrote:It looks like we added only the link, could we add a paragraph about how to use this dashboard as well?
Aug 1 2023
Jul 28 2023
Jul 25 2023
Decom cookbook finished, and dc-ops ticket created (see ticket desc AC section for ticket #)
Jul 24 2023
Jul 21 2023
wdqs202[1-2] have been brought into service. With teh merging of https://gerrit.wikimedia.org/r/c/operations/puppet/+/940272, all hosts are now in service and have alerting enabled.
With the new hosts in service, we can now begin decom'ing these hosts at our convenience.
Jul 20 2023
All of these hosts except wdqs202[1-2] are in service. Those last two hosts will be brought in service after a final data xfer (ongoing).
Jul 18 2023
In T342162#9025774, @thcipriani wrote:I think I have the context to understand this.
It looks like /srv/deployment/wdqs/wdqs-cache/revs/$CURRENT_DEPLOY_COMMIT_HASH/.git/config-files/etc/query_service is symlinked at /etc/query_service/ldf-config.json is that true?
I see this line in log output:
/etc/query_service/ldf-config.json is already linked to current rev(use --force to override)Which is exactly what you're describing. It comes from the check in scap deploy here: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/deploy.py#L212
The assumption is that if a file is already symlinked, there's no need to regenerate the file. But it sounds like it's a bad assumption in this case, is that true?
First draft of this ticket up. There's a couple things that aren't perfect:
Jul 17 2023
Jul 13 2023
Jun 29 2023
Merged patch (had wrong ticket in commit message): https://gerrit.wikimedia.org/r/c/operations/puppet/+/934403
Jun 27 2023
This should be done, but I haven't yet ran a validation command to sanity check that the correct version is in place.
Jun 26 2023
May 30 2023
Should be deployed as of today.
May 22 2023
Thanks for the patience on this! This is getting deployed today.
Documentation aspect of this ticket's already done. Basically two things left to do to close this ticket out:
May 18 2023
Relforge SAL entry: https://phabricator.wikimedia.org/T274204#8862474
May 17 2023
We've built the new package 7.10.2-5. Haven't yet done a restart of hosts.
May 15 2023
May 11 2023
We've noticed that on the bullseye hosts, the blazegraph prometheus exporters are in a restart loop, ultimately [likely] due to differing python versions breaking the current implementation of our exporter script.
May 9 2023
@Gehel With the recording rule removed in https://gerrit.wikimedia.org/r/912382, there shouldn't be any performance issues since we're not recording anything. The latest query settings in https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938 and previous patches are sufficient for acceptable performance on the query, i.e. we don't get timeouts when viewing the graph.
With https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938, we now have the grizzly dashboard where we want it. That was the last blocker for closing out this ticket, so this should be all done.
Forgot to link patch but here's the (hopefully final) grizzly patch to get this where we want it: https://gerrit.wikimedia.org/r/c/operations/grafana-grizzly/+/917938
May 4 2023
Apr 26 2023
We're examining wdqs2022, where we have completed the transfer of /srv/wdqs/ yet blazegraph is not starting.
Apr 19 2023
Apr 17 2023
Apr 13 2023
Apr 12 2023
In T333656#8773505, @Dzahn wrote:hi @RKemper was wondering if you can bring this one up in your team meeting or so (no rush, but would be nice to have): https://gerrit.wikimedia.org/r/c/operations/dns/+/905754 cheers, Daniel
Apr 11 2023
- Things we looked at
Apr 3 2023
Mar 14 2023
In T324335#8683259, @Gehel wrote:After investigation, configuring log4j to talk directly to syslog is adding too much complexity related to the Java Security Manager. We will keep logstash to do log forwarding for now.
Rerouted a shard like so:
Mar 13 2023
Mar 9 2023
We've zeroed out the cluster (transient|persistent).indices.recovery.max_bytes_per_sec settings for eqiad & codfw:
Mar 8 2023
Mar 7 2023
Mar 6 2023
Mar 2 2023
Decom ticket for dc-ops: T331074