Page MenuHomePhabricator
Feed Advanced Search

Dec 20 2023

RKemper updated the task description for T351671: Service implementation for wdqs10[17-21].
Dec 20 2023, 10:05 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)
RKemper updated the task description for T352878: Troubleshoot recurring systemd unit failures and availability issues for wdqs1022-24.
Dec 20 2023, 6:03 PM · Data-Platform-SRE (2024.01.01 - 2024.01.21)

Dec 19 2023

RKemper added a comment to T351671: Service implementation for wdqs10[17-21].

Current status

New hosts added in puppet. Their weights have been set in pybal (more specifically, etcd via conftool), and they're currently marked inactive while we do data xfers. About to kick off batch #1 shortly and then will do batch #2 after the first batch is all finished:

Dec 19 2023, 10:52 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)
RKemper added a comment to T351650: Expose 3 new dedicated WDQS endpoints.

Current status:

Dec 19 2023, 4:43 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Dec 15 2023

RKemper updated the task description for T351671: Service implementation for wdqs10[17-21].
Dec 15 2023, 10:37 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)

Dec 14 2023

RKemper added a parent task for T353482: decommission wdqs10[09-10].eqiad.wmnet: T351671: Service implementation for wdqs10[17-21].
Dec 14 2023, 8:45 PM · SRE, ops-eqiad, decommission-hardware
RKemper added a subtask for T351671: Service implementation for wdqs10[17-21]: T353482: decommission wdqs10[09-10].eqiad.wmnet.
Dec 14 2023, 8:45 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)
RKemper created T353482: decommission wdqs10[09-10].eqiad.wmnet.
Dec 14 2023, 8:45 PM · SRE, ops-eqiad, decommission-hardware
RKemper updated the task description for T351671: Service implementation for wdqs10[17-21].
Dec 14 2023, 8:42 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)

Dec 13 2023

RKemper updated the task description for T345449: Rolling operation cookbook: Detect and remove failed index aliases.
Dec 13 2023, 10:50 PM · Data-Platform-SRE
RKemper added a comment to T349777: Q2:rack/setup/install elastic110[3-7].

Been seeing some weirdness on elastic1107 (internal search team alerts for PuppetZeroResources and the like) so we'll see if a fresh reimage smooths things over

Dec 13 2023, 10:38 PM · SRE, ops-eqiad, Data-Platform-SRE, DC-Ops

Dec 8 2023

RKemper updated the task description for T351671: Service implementation for wdqs10[17-21].
Dec 8 2023, 8:04 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)
RKemper renamed T351671: Service implementation for wdqs10[17-21] from Service implementation for wdqs1017-1020 to Service implementation for wdqs10[17-21].
Dec 8 2023, 7:46 PM · Data-Platform-SRE (2023.12.01 - 2023.12.31)
RKemper updated the task description for T342749: Q1:rack/setup/install wdqs102[0-4].
Dec 8 2023, 7:41 PM · SRE, Data-Platform-SRE, ops-eqiad, DC-Ops
RKemper renamed T332314: Service implementation for wdqs20[13-22] from Configure new WDQS servers in codfw (wdqs20[13-22]) to Service implementation for wdqs20[13-22].
Dec 8 2023, 7:22 PM · Patch-For-Review, Discovery-Search (Current work), Data-Platform-SRE, Wikidata, Wikidata-Query-Service

Dec 7 2023

RKemper added a comment to T350106: Implement a spark job that converts a RDF triples table into a RDF file format.

Here's some extra notes with some of the commands we ran/used: P54284

Dec 7 2023, 8:14 PM · Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
RKemper created P54284 WDQS Graph Split Manual Data Load Notes.
Dec 7 2023, 8:12 PM · Discovery-Search (Current work)

Dec 5 2023

RKemper added a comment to T351650: Expose 3 new dedicated WDQS endpoints.

Alright, I had an initial meeting with Traffic team (Brandon & Valentin).

Dec 5 2023, 6:20 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Nov 22 2023

RKemper added a comment to T349340: CirrusSearch: make p95 alerts more granular .

Here's a graph of some of the alerts firing over time. We do see occasional alerts for these new alerts, but it seems less frequent (and therefore more granular) than with the previous alerting strategy (alerting on generic p95 request time rather than breaking it down per query type)

Nov 22 2023, 5:57 PM · Data-Platform-SRE, Discovery-Search (Current work)

Nov 20 2023

RKemper updated the task description for T351650: Expose 3 new dedicated WDQS endpoints.
Nov 20 2023, 4:35 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Discovery-Search (Current work), Wikidata-Query-Service, Wikidata
RKemper created T351650: Expose 3 new dedicated WDQS endpoints.
Nov 20 2023, 3:56 PM · Data-Platform-SRE (2024.01.22 - 2024.02.11), Discovery-Search (Current work), Wikidata-Query-Service, Wikidata

Nov 15 2023

RKemper updated the task description for T351354: Service implementation for cloudelastic1007-1010.
Nov 15 2023, 10:15 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03)
RKemper updated the task description for T351354: Service implementation for cloudelastic1007-1010.
Nov 15 2023, 10:10 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03)

Nov 14 2023

RKemper claimed T338009: Create dashboards for Search SLOs.
Nov 14 2023, 7:22 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Discovery-Search (Current work)

Nov 13 2023

RKemper updated subscribers of T338009: Create dashboards for Search SLOs.

Talked to @EBernhardson last week and one thing we were uncertain of is if it made sense to set SLOs on metrics such as MediaSearch latency p95 which, with the metric representing the actual time to render the view and not just the Elasticsearch backend response time, means that its possible for the SLO to be missed in the absence of there being anything wrong with Search services in particular. After talking with @Gehel, one thing we discussed is that some of these SLOs should be set in terms of what user experience we feel is acceptable; in this way we'll have a metric/objective that we can point to if, say, some change elsewhere in the stack leads to slowdowns. It's sort of analogous to the role unit tests play in refactoring: allowing you to make changes while being able to validate that the changes didn't break something.

Nov 13 2023, 7:25 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Discovery-Search (Current work)

Nov 2 2023

RKemper added a comment to T338009: Create dashboards for Search SLOs.

Balthazar and I met last week. We took a look at the temporary dashboard and outlined some alerting threshold values for the various SLIs:

Nov 2 2023, 6:23 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Discovery-Search (Current work)

Nov 1 2023

RKemper updated the task description for T349011: Improve data-reload cookbook based on graph split needs.
Nov 1 2023, 9:58 PM · Data-Platform-SRE

Oct 26 2023

RKemper updated the task description for T349772: Create dashboards/alerts for new Cirrus Streaming Updater.
Oct 26 2023, 6:49 PM · Data-Platform-SRE
RKemper updated the task description for T349772: Create dashboards/alerts for new Cirrus Streaming Updater.
Oct 26 2023, 6:48 PM · Data-Platform-SRE

Oct 23 2023

RKemper moved T328330: Create SLI / SLO on Search update lag from Ready for Dev -- SRE/Ops to In Progress on the Discovery-Search (Current work) board.
Oct 23 2023, 3:34 PM · Data-Platform-SRE, Discovery-Search (Current work)
RKemper moved T349340: CirrusSearch: make p95 alerts more granular from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.

Alert changes have been merged so sticking in blocked/waiting for now. We should check back in a week and make sure the alerts work as intended.

Oct 23 2023, 3:22 PM · Data-Platform-SRE, Discovery-Search (Current work)

Oct 19 2023

RKemper updated the task description for T349340: CirrusSearch: make p95 alerts more granular .
Oct 19 2023, 7:47 PM · Data-Platform-SRE, Discovery-Search (Current work)
RKemper created T349340: CirrusSearch: make p95 alerts more granular .
Oct 19 2023, 7:09 PM · Data-Platform-SRE, Discovery-Search (Current work)

Oct 12 2023

RKemper updated the language for P52919 wdqs graph split dumps issue from autodetect to bash.
Oct 12 2023, 7:28 PM · Discovery-Search
RKemper created P52919 wdqs graph split dumps issue.
Oct 12 2023, 7:28 PM · Discovery-Search

Oct 11 2023

RKemper added a comment to T348418: Reboot apifeatureusage* hosts.

Created T348696 for the Illegal Reflective Access warning

Oct 11 2023, 9:36 PM · Data-Platform-SRE
RKemper updated the task description for T348696: Address illegal reflective access on apifeatureusage*.
Oct 11 2023, 9:35 PM · Data-Platform-SRE
RKemper created T348696: Address illegal reflective access on apifeatureusage*.
Oct 11 2023, 9:34 PM · Data-Platform-SRE
RKemper added a comment to T348418: Reboot apifeatureusage* hosts.

Reboot complete. Systemd units are happy.

Oct 11 2023, 9:31 PM · Data-Platform-SRE
RKemper updated the task description for T348418: Reboot apifeatureusage* hosts.
Oct 11 2023, 9:27 PM · Data-Platform-SRE
RKemper updated the task description for T348418: Reboot apifeatureusage* hosts.
Oct 11 2023, 6:35 PM · Data-Platform-SRE
RKemper updated the task description for T346920: VisualEditor's Add a link should suggest a redirect with exact case match.
Oct 11 2023, 4:18 PM · Verified, MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), Discovery-Search (Current work), Editing-team (Kanban Board), VisualEditor

Oct 8 2023

RKemper removed a project from T348418: Reboot apifeatureusage* hosts: ApiFeatureUsage.
Oct 8 2023, 10:51 PM · Data-Platform-SRE
RKemper created T348418: Reboot apifeatureusage* hosts.
Oct 8 2023, 10:46 PM · Data-Platform-SRE

Oct 6 2023

RKemper updated the task description for T348350: Set requests (not limits) for cirrus-streaming-updater in k8s.
Oct 6 2023, 7:51 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03)
RKemper created T348350: Set requests (not limits) for cirrus-streaming-updater in k8s.
Oct 6 2023, 7:49 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03)

Oct 2 2023

RKemper renamed T326302: Misconfigured proxies on data-engineering hosts from Missconfigured proxies on data-engineering hosts to Misconfigured proxies on data-engineering hosts.
Oct 2 2023, 6:47 PM · Data-Platform-SRE, Data-Engineering
RKemper updated the task description for T347034: RESTBase /v1/related endpoint should call the MW action API with a GET not a POST.
Oct 2 2023, 4:09 PM · API Platform, RESTBase Sunsetting, Essential-Work, Wikifeeds, Sustainability (Incident Followup), Discovery-Search
RKemper renamed T346885: BUG Partially persistent authentication on WCQS after revoking permissions from BUG Partially persisten authentication on WCQS after revoking permissions to BUG Partially persistent authentication on WCQS after revoking permissions.
Oct 2 2023, 3:42 PM · Wikidata, Fiwiki-Wikidata-Commons, Wikidata-Query-Service, StructuredDataOnCommons

Sep 28 2023

RKemper moved T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api from Incoming to In Progress on the Data-Platform-SRE board.
Sep 28 2023, 7:04 PM · Patch-For-Review, Data-Platform-SRE (2024.03.04 - 2024.03.24)
RKemper created T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api.
Sep 28 2023, 7:04 PM · Patch-For-Review, Data-Platform-SRE (2024.03.04 - 2024.03.24)

Sep 27 2023

RKemper updated the task description for T340153: decommission frauth2001.frack.eqiad.wmnet.
Sep 27 2023, 5:12 PM · SRE, ops-codfw, decommission-hardware
RKemper moved T339347: qlever dblp endpoint for wikidata federated query nomination from Ready for Dev -- SRE/Ops to Needs Reporting on the Discovery-Search (Current work) board.
Sep 27 2023, 5:06 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Wikidata-Query-Service, Wikidata
RKemper moved T339347: qlever dblp endpoint for wikidata federated query nomination from Blocked / Waiting to Done on the Data-Platform-SRE board.
Sep 27 2023, 5:06 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Wikidata-Query-Service, Wikidata
RKemper moved T345475: Service implementation for wdqs202[3-5].codfw.wmnet from In Progress to Done on the Data-Platform-SRE board.

These hosts are in service as of yesterday.

Sep 27 2023, 5:05 PM · Data-Platform-SRE
RKemper updated the task description for T345475: Service implementation for wdqs202[3-5].codfw.wmnet.
Sep 27 2023, 5:04 PM · Data-Platform-SRE

Sep 26 2023

RKemper updated the task description for T344198: Decommission wdqs100[3-5].
Sep 26 2023, 7:00 PM · Data-Platform-SRE
bking awarded T347341: Only alert for high latency if there is enough data to make a sensible average a Orange Medal token.
Sep 26 2023, 1:18 PM · Discovery-Search

Sep 25 2023

RKemper closed T347341: Only alert for high latency if there is enough data to make a sensible average as Resolved.

We've also applied the fix to codfw and made the thresholds equal between both approaches since they're back to using the same metric again.

Sep 25 2023, 10:46 PM · Discovery-Search
RKemper closed T347341: Only alert for high latency if there is enough data to make a sensible average, a subtask of T346945: 2023-09-20 Elasticsearch unavailable incident, as Resolved.
Sep 25 2023, 10:46 PM · Discovery-Search (Current work), Wikimedia-Incident, SRE-OnFire, Data-Platform-SRE
RKemper added a comment to T347341: Only alert for high latency if there is enough data to make a sensible average.

Okay, @EBernhardson, @bking and I think we've found an approach that actually works:

Sep 25 2023, 10:35 PM · Discovery-Search
RKemper updated the task description for T347338: Investigate performance differences between elastic2037-2054 and 2055-2086.
Sep 25 2023, 7:14 PM · Data-Platform-SRE, Discovery-Search
RKemper added a comment to T347284: Restore service for https://query.wikidata.org/bigdata/ldf.

Just to explain what happened here, we switched the ldf endpoint from 1003 to 1016, which at the time was a public host. However for separate reasons we later realized that 1016 was supposed to be an internal host, and reimaged as such without remembering that it was the new LDF host.

Sep 25 2023, 7:04 PM · Data-Platform-SRE, Wikidata, Wikidata-Query-Service

Sep 19 2023

RKemper updated the task description for T346315: Improve the flink-app chart to provide more useful defaults.
Sep 19 2023, 7:17 PM · Patch-For-Review, Discovery-Search (Current work), serviceops, Event-Platform, Data-Engineering

Sep 18 2023

RKemper closed T343124: Migrate WDQS and WCQS servers to Debian Bullseye, a subtask of T323921: [Epic] Migrate all Search Platform servers to Debian Bullseye, as Resolved.
Sep 18 2023, 10:21 PM · Data-Platform-SRE, Epic
RKemper closed T343124: Migrate WDQS and WCQS servers to Debian Bullseye as Resolved.
Sep 18 2023, 10:21 PM · Data-Platform-SRE
RKemper updated the task description for T344198: Decommission wdqs100[3-5].
Sep 18 2023, 9:50 PM · Data-Platform-SRE
RKemper moved T344198: Decommission wdqs100[3-5] from In Progress to Blocked / Waiting on the Data-Platform-SRE board.

Running decom cookbook for wdqs100[3,4]. Dc-ops ticket up here: https://phabricator.wikimedia.org/T346699

Sep 18 2023, 9:21 PM · Data-Platform-SRE
RKemper created T346699: decommission wdqs100[3,4].eqiad.wmnet.
Sep 18 2023, 9:20 PM · SRE, ops-eqiad, decommission-hardware
RKemper moved T314890: Service implementation for wdqs101[4,5,6] from In Progress to Done on the Data-Platform-SRE board.
Sep 18 2023, 9:11 PM · Data-Platform-SRE
RKemper updated the task description for T314890: Service implementation for wdqs101[4,5,6].
Sep 18 2023, 9:11 PM · Data-Platform-SRE

Sep 14 2023

RKemper added a comment to T340793: Implement depool (source only) and keep-downtime options on data-transfer cookbook.

Removed subtask because I think the scap ticket is not directly related to this one.

Sep 14 2023, 9:55 PM · Data-Platform-SRE
RKemper removed a parent task for T342162: "scap deploy"'s config-deploy should check for broken symlinks: T340793: Implement depool (source only) and keep-downtime options on data-transfer cookbook.
Sep 14 2023, 9:55 PM · Data-Platform-SRE, Release-Engineering-Team, Scap
RKemper removed a subtask for T340793: Implement depool (source only) and keep-downtime options on data-transfer cookbook: T342162: "scap deploy"'s config-deploy should check for broken symlinks.
Sep 14 2023, 9:55 PM · Data-Platform-SRE

Sep 13 2023

RKemper moved T343820: Retune enwiki_content shard settings from In Progress to Needs Reporting on the Discovery-Search (Current work) board.
Sep 13 2023, 10:31 PM · Discovery-Search (Current work), Data-Platform-SRE
RKemper updated the task description for T343820: Retune enwiki_content shard settings.
Sep 13 2023, 10:30 PM · Discovery-Search (Current work), Data-Platform-SRE
RKemper moved T343820: Retune enwiki_content shard settings from To be Deployed to Done on the Data-Platform-SRE board.

The reindex, even though we had to terminate it before it finished, had already gotten to enwiki_content. So this is done. Here's what the value on both eqiad and codfw looks like:

Sep 13 2023, 9:54 PM · Discovery-Search (Current work), Data-Platform-SRE
RKemper created P52490 Investigating WDQS SLO update lag dashboard discrepancies.
Sep 13 2023, 7:16 AM · Wikidata-Query-Service, Discovery-Search

Sep 11 2023

RKemper claimed T345475: Service implementation for wdqs202[3-5].codfw.wmnet.
Sep 11 2023, 6:46 PM · Data-Platform-SRE

Sep 8 2023

RKemper updated the task description for T314890: Service implementation for wdqs101[4,5,6].
Sep 8 2023, 4:24 AM · Data-Platform-SRE
RKemper added a comment to T314890: Service implementation for wdqs101[4,5,6].

Host was reimaged but patch wasn't yet merged. Merging patch and rolling the re-image again.

Sep 8 2023, 4:23 AM · Data-Platform-SRE
RKemper updated the task description for T314890: Service implementation for wdqs101[4,5,6].
Sep 8 2023, 4:20 AM · Data-Platform-SRE

Sep 7 2023

RKemper updated the task description for T345475: Service implementation for wdqs202[3-5].codfw.wmnet.
Sep 7 2023, 3:12 AM · Data-Platform-SRE

Sep 6 2023

RKemper updated the task description for T344198: Decommission wdqs100[3-5].
Sep 6 2023, 9:20 PM · Data-Platform-SRE
RKemper updated the task description for T314890: Service implementation for wdqs101[4,5,6].
Sep 6 2023, 9:17 PM · Data-Platform-SRE
RKemper reopened T314890: Service implementation for wdqs101[4,5,6], a subtask of T307138: Q4:(Need By: TBD) rack/setup/install wdqs101[4,5,6], as Open.
Sep 6 2023, 9:10 PM · Discovery-Search (Current work), SRE, ops-eqiad, DC-Ops
RKemper reopened T314890: Service implementation for wdqs101[4,5,6] as "Open".

Oops, meant this to be in progress. Changing status to Open from previous state of Resolved.

Sep 6 2023, 9:10 PM · Data-Platform-SRE
RKemper moved T314890: Service implementation for wdqs101[4,5,6] from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.

We made a slight mistake: one of these hosts needed to be in wdqs-internal since wdqs1003 (one of the hosts replaced by these new hosts) is.

Sep 6 2023, 9:09 PM · Data-Platform-SRE
RKemper moved T340793: Implement depool (source only) and keep-downtime options on data-transfer cookbook from In Progress to Done on the Data-Platform-SRE board.
Sep 6 2023, 8:55 PM · Data-Platform-SRE
RKemper updated the task description for T340793: Implement depool (source only) and keep-downtime options on data-transfer cookbook.
Sep 6 2023, 8:55 PM · Data-Platform-SRE
RKemper moved T337296: Allow federated queries with the NLG endpoint (data.nlg.gr) from Needs Review to Blocked / Waiting on the Data-Platform-SRE board.
Sep 6 2023, 8:54 PM · Data-Platform-SRE, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Sep 2 2023

RKemper renamed T345475: Service implementation for wdqs202[3-5].codfw.wmnet from Service implementation for wdqs20[23-25].codfw.wmnet to Service implementation for wdqs202[3-5].codfw.wmnet.
Sep 2 2023, 5:12 AM · Data-Platform-SRE

Sep 1 2023

RKemper created T345475: Service implementation for wdqs202[3-5].codfw.wmnet.
Sep 1 2023, 9:14 PM · Data-Platform-SRE

Aug 31 2023

RKemper updated the task description for T345391: decommission wdqs1005.
Aug 31 2023, 8:18 PM · SRE, ops-eqiad, decommission-hardware
RKemper renamed T345391: decommission wdqs1005 from decommission wdqs100[3-5] to decommission wdqs1005.
Aug 31 2023, 8:18 PM · SRE, ops-eqiad, decommission-hardware
RKemper moved T344198: Decommission wdqs100[3-5] from In Progress to Blocked / Waiting on the Data-Platform-SRE board.

Made the decom ticket for wdqs1005 and ran the cookbook. We will decom 1003/1004 later. Moving this to blocked/waiting for now.

Aug 31 2023, 8:15 PM · Data-Platform-SRE
RKemper updated the task description for T344198: Decommission wdqs100[3-5].
Aug 31 2023, 8:14 PM · Data-Platform-SRE
RKemper added a parent task for T345391: decommission wdqs1005: T344198: Decommission wdqs100[3-5].
Aug 31 2023, 8:14 PM · SRE, ops-eqiad, decommission-hardware
RKemper added a subtask for T344198: Decommission wdqs100[3-5]: T345391: decommission wdqs1005.
Aug 31 2023, 8:14 PM · Data-Platform-SRE
RKemper created T345391: decommission wdqs1005.
Aug 31 2023, 8:14 PM · SRE, ops-eqiad, decommission-hardware