Page MenuHomePhabricator

RKemper (Ryan Kemper)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
May 1 2020, 10:28 PM (135 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
RKemper (WMF) [ Global Accounts ]

Recent Activity

Mon, Nov 28

RKemper created T323935: Migrate query service data-transfer/reload cookbooks to new spicerack class API.
Mon, Nov 28, 5:19 PM · Discovery-Search, Wikidata

Wed, Nov 23

RKemper added a comment to T313751: Create WDQS uptime SLO.

A few comments on the current dashboard:

  • a very quick look at Turnilo: the graph look different enough that I'd like to know why the discrepancies
  • as discussed, we should define the service as "working" not only when returning HTTP/200, but also when requests are throttled (429) or banned (403)
  • we probably need to dig a bit more into other response codes and the dips we see in the graph to understand what they are and if they are problematic (and thus refine our definition of a "working" service)
Wed, Nov 23, 7:20 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper moved T323066: Understand meaning of trafficserver wdqs request data vs turnilo webrequest data from Ready for Dev -- SRE/Ops to Needs Reporting on the Discovery-Search (Current work) board.
Wed, Nov 23, 6:48 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper added a comment to T323066: Understand meaning of trafficserver wdqs request data vs turnilo webrequest data.

Comparing grafana to turnilo, the graphs seem more or less aligned as far as the shapes of the graphs are concerned.

Wed, Nov 23, 6:36 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper moved T323612: Increase small cluster heap memory from Incoming to In Progress on the Discovery-Search (Current work) board.

Cloudelastic had a few hosts that didn't have the latest heap size applied; most likely the rolling restart was done before sufficient time had passed for puppet to auto-run on each instance.

Wed, Nov 23, 5:46 PM · Discovery-Search (Current work)
RKemper moved T316236: Reload WCQS from dumps from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

Moved to Needs Reporting. Usually we leave tickets open but move them to needs reporting and then gehel closes as resolved after he reviews them, but I'll leave this task resolved for now to avoid re-opening it.

Wed, Nov 23, 2:03 AM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper updated the task description for T323620: 2022-11-22 WDQS high load incident.
Wed, Nov 23, 12:27 AM · Discovery-Search (Current work)
RKemper updated the task description for T323620: 2022-11-22 WDQS high load incident.
Wed, Nov 23, 12:26 AM · Discovery-Search (Current work)
RKemper updated the task description for T323066: Understand meaning of trafficserver wdqs request data vs turnilo webrequest data.
Wed, Nov 23, 12:16 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Tue, Nov 22

RKemper updated the task description for T323066: Understand meaning of trafficserver wdqs request data vs turnilo webrequest data.
Tue, Nov 22, 11:47 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Thu, Nov 17

RKemper moved T320482: Degraded RAID on elastic2052 from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Reimaging now

Thu, Nov 17, 7:54 PM · Discovery-Search (Current work), SRE, ops-codfw

Mon, Nov 14

RKemper created T323066: Understand meaning of trafficserver wdqs request data vs turnilo webrequest data.
Mon, Nov 14, 7:52 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper created T323064: Create WDQS Uptime SLO && WDQS/WCQS update lag SLO dashboards in Grizzly.
Mon, Nov 14, 7:50 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper moved T313842: Decommission elastic2049.codfw.wmnet from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Nov 14, 7:31 PM · DC-Ops, Discovery-Search (Current work), decommission-hardware
RKemper reopened T313842: Decommission elastic2049.codfw.wmnet as "In Progress".
Mon, Nov 14, 7:31 PM · DC-Ops, Discovery-Search (Current work), decommission-hardware

Thu, Nov 10

RKemper renamed T319020: Reset to upstream java GC options and remove redundant JVM options from Reset to default java GC options to Reset to upstream java GC options and remove redundant JVM options.
Thu, Nov 10, 9:07 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper renamed T319020: Reset to upstream java GC options and remove redundant JVM options from Reset to default java JC options to Reset to default java GC options.
Thu, Nov 10, 9:04 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper updated the task description for T319020: Reset to upstream java GC options and remove redundant JVM options.
Thu, Nov 10, 9:00 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper renamed T319020: Reset to upstream java GC options and remove redundant JVM options from Elasticsearch (omega cluster) failed with OOME on elastic1096 to Reset to default java JC options.
Thu, Nov 10, 9:00 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper added a comment to T320482: Degraded RAID on elastic2052.

@Papaul Yup per jbond's comment above we're still seeing the RAID issue. Could we try either rebuilding raid with the current disk, or swapping in a new one and rebuilding? (I suspect the latter is necessary but I'm not totally sure)

Thu, Nov 10, 1:43 AM · Discovery-Search (Current work), SRE, ops-codfw

Mon, Nov 7

RKemper moved T317816: Enable 10G networking in cirrus elastic clusters from Ready for Dev -- SRE/Ops to Blocked/Waiting on the Discovery-Search (Current work) board.
Mon, Nov 7, 4:09 PM · Discovery-Search (Current work)
RKemper changed the status of T317816: Enable 10G networking in cirrus elastic clusters from Open to In Progress.
Mon, Nov 7, 4:09 PM · Discovery-Search (Current work)
RKemper moved T322082: Q1:rerack elastic10[53-67] from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.
Mon, Nov 7, 4:08 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops

Nov 3 2022

RKemper updated the task description for T322377: Use DNS name instead of IP in PyBal alerts.
Nov 3 2022, 9:38 PM · SRE, Observability-Alerting, Traffic
RKemper updated subscribers of T321605: Make WCQS/WDQS data transfer cookbook more reliable .
Nov 3 2022, 9:25 PM · Discovery-Search (Current work)
RKemper updated the task description for T321605: Make WCQS/WDQS data transfer cookbook more reliable .
Nov 3 2022, 9:24 PM · Discovery-Search (Current work)
RKemper moved T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service from Ready for Dev -- SRE/Ops to Needs Reporting on the Discovery-Search (Current work) board.
Nov 3 2022, 8:57 PM · Discovery-Search (Current work)
RKemper moved T322358: Address Puppet change errors in Relforge from Incoming to Needs Reporting on the Discovery-Search (Current work) board.
Nov 3 2022, 8:39 PM · Discovery-Search (Current work)
RKemper added a comment to T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.

We tried https://gerrit.wikimedia.org/r/851711 out on wdqs1009 (making exporter Requires= and After= the blazegraph instance); didn't behave how we expected. For example we thought restarting blazegraph would restart the exporter, but that wasn't the case.

Nov 3 2022, 7:17 PM · Discovery-Search (Current work)

Nov 1 2022

RKemper renamed T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service from Data-transfer.py cookbook: stop prometheus-blazegraph-exporter service to Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.
Nov 1 2022, 7:35 PM · Discovery-Search (Current work)
RKemper updated Other Assignee for T322082: Q1:rerack elastic10[53-67], added: Jclark-ctr.
Nov 1 2022, 7:20 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops

Oct 31 2022

RKemper updated the task description for T322082: Q1:rerack elastic10[53-67].
Oct 31 2022, 9:52 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper updated the task description for T322082: Q1:rerack elastic10[53-67].
Oct 31 2022, 9:52 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper added a project to T322082: Q1:rerack elastic10[53-67]: Discovery-Search (Current work).
Oct 31 2022, 9:37 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper updated the task description for T322082: Q1:rerack elastic10[53-67].
Oct 31 2022, 9:33 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper updated the task description for T317816: Enable 10G networking in cirrus elastic clusters.
Oct 31 2022, 9:32 PM · Discovery-Search (Current work)
RKemper added a subtask for T317816: Enable 10G networking in cirrus elastic clusters: T322082: Q1:rerack elastic10[53-67].
Oct 31 2022, 9:31 PM · Discovery-Search (Current work)
RKemper added a parent task for T322082: Q1:rerack elastic10[53-67]: T317816: Enable 10G networking in cirrus elastic clusters.
Oct 31 2022, 9:31 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper renamed T322082: Q1:rerack elastic10[53-67] from Q1:rack/setup/install elastic10[53-67] to Q1:rerack elastic10[53-67].
Oct 31 2022, 9:31 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper created T322082: Q1:rerack elastic10[53-67].
Oct 31 2022, 9:29 PM · SRE, Discovery-Search (Current work), ops-eqiad, DC-Ops
RKemper added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

Further context for the rsync approach:

Oct 31 2022, 5:26 PM · Discovery-Search (Current work)
RKemper set the point value for T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service to 2.
Oct 31 2022, 4:59 PM · Discovery-Search (Current work)
RKemper updated the task description for T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.
Oct 31 2022, 4:58 PM · Discovery-Search (Current work)
RKemper added a comment to T321605: Make WCQS/WDQS data transfer cookbook more reliable .

Notes from today's SRE meeting where I asked if anyone had any thoughts:

Oct 31 2022, 4:49 PM · Discovery-Search (Current work)
RKemper renamed T321855: Using DISTINCT with VALUES returns more results than expected from Using DISTINCT with VALUES returns more results that expected to Using DISTINCT with VALUES returns more results than expected.
Oct 31 2022, 4:47 PM · Wikidata, Wikidata-Query-Service
RKemper removed a subtask for T316236: Reload WCQS from dumps: T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.
Oct 31 2022, 4:46 PM · Patch-For-Review, Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper added a subtask for T321605: Make WCQS/WDQS data transfer cookbook more reliable : T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service.
Oct 31 2022, 4:46 PM · Discovery-Search (Current work)
RKemper edited parent tasks for T322037: Add blazegraph as systemd dependency of prometheus-blazegraph-exporter service, added: T321605: Make WCQS/WDQS data transfer cookbook more reliable ; removed: T316236: Reload WCQS from dumps.
Oct 31 2022, 4:46 PM · Discovery-Search (Current work)
RKemper moved T300943: Service implementation for elastic20[61-86].codfw.wmnet from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Decom is done: T321243

Oct 31 2022, 4:19 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper updated the task description for T300943: Service implementation for elastic20[61-86].codfw.wmnet.
Oct 31 2022, 4:18 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper updated the task description for T317816: Enable 10G networking in cirrus elastic clusters.
Oct 31 2022, 4:16 PM · Discovery-Search (Current work)

Oct 25 2022

RKemper added a comment to T320482: Degraded RAID on elastic2052.

@bking this host is out of warranty. If it is a critical host you will have to let us know and request to purchase a disk. Another option is to check also if we have any disk similar from the decommissioned nodes that we can use.
Thanks.

Oct 25 2022, 7:29 PM · Discovery-Search (Current work), SRE, ops-codfw

Oct 24 2022

RKemper set the point value for T320482: Degraded RAID on elastic2052 to 2.
Oct 24 2022, 3:29 PM · Discovery-Search (Current work), SRE, ops-codfw

Oct 19 2022

RKemper updated the task description for T321237: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet.
Oct 19 2022, 8:00 PM · SRE, ops-codfw, DC-Ops
RKemper updated the task description for T321237: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet.
Oct 19 2022, 7:46 PM · SRE, ops-codfw, DC-Ops
RKemper created T321237: hw troubleshooting: flapping mgmt console for wdqs2005.mgmt.codfw.wmnet.
Oct 19 2022, 7:45 PM · SRE, ops-codfw, DC-Ops

Oct 14 2022

RKemper added a comment to T300943: Service implementation for elastic20[61-86].codfw.wmnet.

Final step is to open dcops ticket, and then this can be moved to needs reporting.

Oct 14 2022, 2:12 AM · Patch-For-Review, Discovery-Search (Current work)
RKemper updated the task description for T300943: Service implementation for elastic20[61-86].codfw.wmnet.
Oct 14 2022, 12:25 AM · Patch-For-Review, Discovery-Search (Current work)

Oct 13 2022

RKemper moved T313431: Increase Elastic master-eligible nodes from 3 to 5 from In Progress to Needs Reporting on the Discovery-Search (Current work) board.
Oct 13 2022, 7:33 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper added a comment to T313751: Create WDQS uptime SLO.

With respect to recording nginx request responses:

Oct 13 2022, 6:13 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper added a comment to T313751: Create WDQS uptime SLO.

The current approach we're trying to work towards is recording the nginx response codes for requests. That will give us insight into the number of failures we're seeing.

Oct 13 2022, 6:10 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Oct 4 2022

RKemper added a comment to T313751: Create WDQS uptime SLO.

With respect to the SLO itself, our goal is an SLO that captures the promise we make about service availability: namely, that WDQS is available on a best-effort basis. In practice, this means that if an issue arises out of "business hours", it's acceptable to wait until "business hours" to resolve it. For example, in the most extreme case, if the service were to have an outage on a Friday night, we wouldn't be paging anyone to work the night nor the weekend, but come Monday we'd be focusing our efforts on restoring availability as soon as possible. This specific scenario - a multi-day full outage - would of course be quite rare (on the order of a few times a year at most, but generally much less).

Oct 4 2022, 8:08 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper added a comment to T313751: Create WDQS uptime SLO.

Gehel and I met with bblack today.

Oct 4 2022, 9:23 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Oct 3 2022

RKemper moved T316728: decommission elastic10[48-52].eqiad.wmnet from Incoming to Needs Reporting on the Discovery-Search (Current work) board.
Oct 3 2022, 3:46 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper added a comment to T319020: Reset to upstream java GC options and remove redundant JVM options.

Somewhat related: https://gerrit.wikimedia.org/r/c/operations/puppet/+/830240

Oct 3 2022, 3:39 PM · Patch-For-Review, Discovery-Search (Current work)

Sep 22 2022

RKemper moved T302530: Add Elastic cluster info to host MOTD from Ready for Dev -- SWE to Needs Reporting on the Discovery-Search (Current work) board.

It works!

Sep 22 2022, 8:44 PM · Discovery-Search (Current work)

Sep 21 2022

RKemper added a comment to T318270: Avoid overloading individual Elastic nodes with popular shards.

Per avg(rate(elasticsearch_indices_search_query_total[5m])) by (index), it seems like enwiki_content is pretty much the main source of query load. Also, looking at 4 hosts (2 of the worst offenders in terms of latency, and 2 average latency), we noticed that the 2 worst offenders had 2 enwiki_content each while the 2 average-latency hosts had only one.

Sep 21 2022, 10:30 PM · Discovery-Search (Current work), Patch-For-Review
RKemper updated the task description for T318270: Avoid overloading individual Elastic nodes with popular shards.
Sep 21 2022, 8:38 PM · Discovery-Search (Current work), Patch-For-Review
RKemper added a comment to T318270: Avoid overloading individual Elastic nodes with popular shards.

I'm curious about what we've seen that indicates that

Elastic likes to pack a lot of the larger index shards (such as commonswiki) onto a single host

Sep 21 2022, 8:13 PM · Discovery-Search (Current work), Patch-For-Review

Sep 14 2022

RKemper created T317816: Enable 10G networking in cirrus elastic clusters.
Sep 14 2022, 9:39 PM · Discovery-Search (Current work)

Sep 13 2022

RKemper added a comment to T313751: Create WDQS uptime SLO.
Intro (some context for traffic team)

Search team is working on creating an SLI to measure uptime of WDQS. We want our SLI to map as well to the actual user experience as possible, so to that end we're trying to come up with a way to hit WDQS endpoints externally or semi-externally. Ideally the solution would be ambivalent to the underlying pool/depool state of the underlying hosts (translation: if a host is depooled the request won't ultimately route to it).

Sep 13 2022, 8:44 PM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service
RKemper moved T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 from Incoming to Needs review on the Discovery-Search (Current work) board.
Sep 13 2022, 5:47 PM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch
RKemper changed the status of T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2, a subtask of T308676: Elasticsearch 7.10.2 rollout plan, from Open to In Progress.
Sep 13 2022, 5:43 PM · MW-1.39-notes (1.39.0-wmf.28; 2022-09-05), Patch-For-Review, Discovery-Search (Current work), CirrusSearch
RKemper changed the status of T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2 from Open to In Progress.
Sep 13 2022, 5:43 PM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch
RKemper created T317686: Upgrade eqiad cluster to Elasticsearch 7.10.2.
Sep 13 2022, 5:43 PM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch

Sep 12 2022

RKemper created P34533 Query Service Deploy.
Sep 12 2022, 5:50 PM · Discovery-Search

Sep 9 2022

RKemper added a comment to T313431: Increase Elastic master-eligible nodes from 3 to 5.
ryankemper@mwmaint1002:~/elastic$ cat psi_codfw_masters.lst
elastic2054.codfw.wmnet:9700
elastic2076.codfw.wmnet:9700
elastic2080.codfw.wmnet:9700
ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst
seeds=['elastic2025.codfw.wmnet:9300', 'elastic2031.codfw.wmnet:9300', 'elastic2042.codfw.wmnet:9300']
to_ret={'chi': {'seeds': ['elastic2025.codfw.wmnet:9300', 'elastic2031.codfw.wmnet:9300', 'elastic2042.codfw.wmnet:9300']}}
seeds=['elastic2054.codfw.wmnet:9700', 'elastic2076.codfw.wmnet:9700', 'elastic2080.codfw.wmnet:9700']
to_ret={'psi': {'seeds': ['elastic2054.codfw.wmnet:9700', 'elastic2076.codfw.wmnet:9700', 'elastic2080.codfw.wmnet:9700']}}
seeds=['elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500']
to_ret={'omega': {'seeds': ['elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500']}}

Set new seeds for psi to resolve elasticsearch settings alert.

(We'll need to re-do this step w/ the new new values when we expand up to 5 masters today)

Sep 9 2022, 9:13 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper updated the task description for T317442: Sanity-check indices before promotion.
Sep 9 2022, 8:18 PM · Discovery-Search
RKemper changed the status of T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade from Open to In Progress.
Sep 9 2022, 7:13 PM · Discovery-Search (Current work), CirrusSearch
RKemper changed the status of T317374: Search now caps total results count at 10k because of elasticsearch 7 upgrade, a subtask of T308676: Elasticsearch 7.10.2 rollout plan, from Open to In Progress.
Sep 9 2022, 7:13 PM · MW-1.39-notes (1.39.0-wmf.28; 2022-09-05), Patch-For-Review, Discovery-Search (Current work), CirrusSearch

Sep 7 2022

RKemper updated the title for P34126 es7 remote cluster client roles from untitled to es7 remote cluster client roles.
Sep 7 2022, 4:51 PM · Discovery-Search
RKemper added a comment to P34126 es7 remote cluster client roles.

https://github.com/elastic/elasticsearch/issues/62445 && https://github.com/elastic/elasticsearch/pull/62730/files

Sep 7 2022, 4:51 PM · Discovery-Search
RKemper added a project to P34126 es7 remote cluster client roles: Discovery-Search.
Sep 7 2022, 4:50 PM · Discovery-Search
RKemper added a comment to T313431: Increase Elastic master-eligible nodes from 3 to 5.
ryankemper@mwmaint1002:~/elastic$ cat psi_codfw_masters.lst
elastic2054.codfw.wmnet:9700
elastic2076.codfw.wmnet:9700
elastic2080.codfw.wmnet:9700
ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst
seeds=['elastic2025.codfw.wmnet:9300', 'elastic2031.codfw.wmnet:9300', 'elastic2042.codfw.wmnet:9300']
to_ret={'chi': {'seeds': ['elastic2025.codfw.wmnet:9300', 'elastic2031.codfw.wmnet:9300', 'elastic2042.codfw.wmnet:9300']}}
seeds=['elastic2054.codfw.wmnet:9700', 'elastic2076.codfw.wmnet:9700', 'elastic2080.codfw.wmnet:9700']
to_ret={'psi': {'seeds': ['elastic2054.codfw.wmnet:9700', 'elastic2076.codfw.wmnet:9700', 'elastic2080.codfw.wmnet:9700']}}
seeds=['elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500']
to_ret={'omega': {'seeds': ['elastic2042.codfw.wmnet:9500', 'elastic2047.codfw.wmnet:9500', 'elastic2052.codfw.wmnet:9500']}}
Sep 7 2022, 4:28 PM · Patch-For-Review, Discovery-Search (Current work)

Sep 6 2022

RKemper updated the task description for T316922: Add pfischer to #wmf-nda on Phab and to #wmf on LDAP.
Sep 6 2022, 5:51 PM · SRE, LDAP-Access-Requests, Discovery-Search (Current work), WMF-NDA-Requests
RKemper updated the task description for T316090: Production Shell access for Peter.
Sep 6 2022, 5:50 PM · Data-Engineering-Planning, Patch-For-Review, SRE, SRE-Access-Requests, Discovery-Search (Current work)
RKemper added a comment to T316922: Add pfischer to #wmf-nda on Phab and to #wmf on LDAP.

@pfischer Per the above, you might need to log into Wikitech directly. Want to give that a try when you get a chance and then we'll see if you appear in WikiTech users afterwards?

Sep 6 2022, 5:45 PM · SRE, LDAP-Access-Requests, Discovery-Search (Current work), WMF-NDA-Requests
RKemper added a comment to T316922: Add pfischer to #wmf-nda on Phab and to #wmf on LDAP.

Update: Unfortunately, even though @pfischer has created the Wikitech dev account, nothing has changed since @MatthewVernon 's comment above:

sudo check_user [redacted for spam prevention]
WikiTech Users:
	no user found with [redacted for spam prevention]

Gsuit User:
	Primary Email:	[redacted for spam prevention]
	Aliases:
	title:		No title found.
	manager:	No manager found.
	agreedToTerms:	True

I'm not sure if it takes awhile to sync this information. I will reach out to IT in Slack to see if they have any ideas on the Gsuite side of things.

Sep 6 2022, 5:44 PM · SRE, LDAP-Access-Requests, Discovery-Search (Current work), WMF-NDA-Requests

Aug 31 2022

RKemper added a comment to T286388: Global search fails with HTTP 500.

The upstream change is that elastic was upgraded from 6.8 to 7.10.2 on cloudelastic, with production services migrating this week and the next. I had taken a quick look over global-search and it seemed to be avoiding some of the biggest changes (no more mapping types), but i seem to have missed these other details.

For the first few errors, against .ltrstore and .tasks, it looks like elastic has become more strict in how it handles missing fields. In 6.x elasticsearch always allowed you to query unknown fields and it would resolve as equivalent to a match_none query, since there was nothing to match against. It seems elastic 7.x is now being strict and issuing errors.
For this is suspect we would need to change the index glob patterns to use *_content, *_general, *_file to target the wiki specific indices.

For the invalid regex exception this is unexpected, I'm not aware of any changes we had to make on the cirrus side to get our integration tests passing for regex searches. We run our plain searches against the source field using the query_string query though, as opposed to through a match query as seen here. That might be harder to do here though, as it requires having some code that escapes all the things users might do in query_string context. I'll have to do some experimenting to find what is appropriate here.

Aug 31 2022, 5:18 PM · Tool-global-search
RKemper created T316729: decommission elastic2035.codfw.wmnet.
Aug 31 2022, 12:25 AM · SRE, Patch-For-Review, ops-codfw, decommission-hardware
RKemper added a comment to T309810: Service implementation for elastic1[084-102].eqiad.wmnet.

Remembered I still need to create dc-ops decom ticket for the 5 eqiad elastic refresh hosts

Aug 31 2022, 12:21 AM · Patch-For-Review, Discovery-Search (Current work)
RKemper created T316728: decommission elastic10[48-52].eqiad.wmnet.
Aug 31 2022, 12:20 AM · Patch-For-Review, Discovery-Search (Current work)

Aug 30 2022

RKemper created T316719: Upgrade codfw cluster to Elasticsearch 7.10.2.
Aug 30 2022, 8:40 PM · Patch-For-Review, Discovery-Search (Current work), CirrusSearch
RKemper moved T315124: Add OpenDataSweden to the SPARQL whitelist from Needs review to Needs Reporting on the Discovery-Search (Current work) board.

This has been deployed.

Aug 30 2022, 5:03 AM · Discovery-Search (Current work), Wikibase.cloud, Wikidata, Wikidata-Query-Service

Aug 26 2022

RKemper added a comment to T313751: Create WDQS uptime SLO.

Some quick pros/cons of two possible approaches to getting the SLI metrics: approach #1 is to run a query or set of queries per-dc at a certain frequency, approach #2 is just to run a query on each host at a certain frequency

Aug 26 2022, 7:56 AM · Discovery-Search (Current work), Wikidata, Wikidata-Query-Service

Aug 23 2022

RKemper added a comment to T315604: Upgrade relforge cluster to 7.10.2.
[2022-08-23T19:46:22,669][WARN ][o.e.c.c.ClusterFormationFailureHelper] [relforge1003-relforge-eqiad-small-alpha] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{relforge1003-relforge-eqiad-small-alpha}{rrFzJB13TIa9ElcCQ8v6NQ}{vKrW187kSR-4Rsy7wCg5Qg}{10.64.5.37}{10.64.5.37:9500}{dimr}{hostname=relforge1003, rack=A2, fqdn=relforge1003.eqiad.wmnet, row=A}]; discovery will continue using [] from hosts providers and [{relforge1003-relforge-eqiad-small-alpha}{rrFzJB13TIa9ElcCQ8v6NQ}{vKrW187kSR-4Rsy7wCg5Qg}{10.64.5.37}{10.64.5.37:9500}{dimr}{hostname=relforge1003, rack=A2, fqdn=relforge1003.eqiad.wmnet, row=A}] from last-known cluster state; node term 0, last-accepted version 15 in term 0
[2022-08-23T19:46:31,615][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [relforge1003-relforge-eqiad-small-alpha] no known master node, scheduling a retry
Aug 23 2022, 7:53 PM · Discovery-Search (Current work), CirrusSearch
RKemper added a comment to T309810: Service implementation for elastic1[084-102].eqiad.wmnet.

Remembered I still need to create dc-ops decom ticket for the 5 eqiad elastic refresh hosts

Aug 23 2022, 7:47 PM · Patch-For-Review, Discovery-Search (Current work)
RKemper added a comment to T300943: Service implementation for elastic20[61-86].codfw.wmnet.
Aug 23 2022, 7:25 PM · Patch-For-Review, Discovery-Search (Current work)

Aug 22 2022

RKemper reassigned T315604: Upgrade relforge cluster to 7.10.2 from RKemper to bking.
Aug 22 2022, 3:36 PM · Discovery-Search (Current work), CirrusSearch
RKemper claimed T315604: Upgrade relforge cluster to 7.10.2.
Aug 22 2022, 3:35 PM · Discovery-Search (Current work), CirrusSearch