Many search suggestions missing when connecting to eqiad, but not when connecting to codfw
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	matmarex
	Thu, Apr 25, 6:58 PM

Description

See bug report here: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_suggestions_seem_to_be_missing_obvious_choices

For a simple example, try this query: https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=sonia+sotomay&namespace=0&limit=10

Expected response on codfw (mwdebug2001): ["sonia sotomay",["Sonia Sotomayor","Sonia Sotomayor Supreme Court nomination","Sonia Sotomayor Learning Academies","Sonia M. Sotomayor High School"],["","","",""],["https://en.wikipedia.org/wiki/Sonia_Sotomayor","https://en.wikipedia.org/wiki/Sonia_Sotomayor_Supreme_Court_nomination","https://en.wikipedia.org/wiki/Sonia_Sotomayor_Learning_Academies","https://en.wikipedia.org/wiki/Sonia_M._Sotomayor_High_School"]]

Actual response on eqiad (mwdebug1001): ["sonia sotomay",[],[],[]]

Related Objects

Mentioned In: T363694: Post incident tasks: Search missing results/unavailable for some eqiad users
T335974: ‘Remember selection’ option / Vector-2022 have search results that do not start with user input
T363521: Completion suggester can promote a bad build
Mentioned Here: E6: Phabricator Upgrade
T334230: Bring Juniper switches in eqiad racks E5-7 and F5-7 online and ready for servers
T363521: Completion suggester can promote a bad build

Event Timeline

matmarex created this task.Thu, Apr 25, 6:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Apr 25, 6:58 PM

Nikerabbit subscribed.Thu, Apr 25, 6:59 PM

matmarex added a project: SRE.Thu, Apr 25, 7:01 PM

hmm, i can confirm this is happening. The completion index is built new every day in each datacenter. Usually they are the same, but somehow the eqiad index is about half the size of the codfw index (6.7g vs 14.5g). Auto complete is fairly high traffic, we should probably shift the autocomplete traffic to codfw until it can be fixed which probably requires a rebuild and a couple hours.

Change #1024478 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

gerritbot added a project: Patch-For-Review.Thu, Apr 25, 7:31 PM

Gehel added a project: Discovery-Search (Current work).Thu, Apr 25, 7:32 PM

Gehel moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2024-04-25T19:33:12Z] <ebernhardson> T363516 started manual rebuild of enwiki titlesuggest indices in eqiad

Decided against shuffling traffic, rebuild is almost compete already for enwiki. I can see in the logs where the enwiki eqiad build jumped from 44% to complete, but no reason why. nothing in logstash for that period either. I've created T363521 to put something in place to prevent this in the future.

Change #1024478 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

Reason:

rebuild only took 45 minutes, decided not to shuffle while it was in progress

https://gerrit.wikimedia.org/r/1024478

There's someone reporting that they're still not seeing the expected results for some queries, although I can't reproduce: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#c-2804:F14:8092:9F01:3468:323E:5807:DBA8-20240425214000-Matma_Rex-20240425212500

Never mind, they just said it's fixed :)

Change #1024478 restored by DCausse:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

This is still happening, raising to UBN

Change #1024478 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:36:24Z] <dcausse@deploy1002> Started scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:41:30Z] <dcausse@deploy1002> dcausse and ebernhardson: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:54:21Z] <dcausse@deploy1002> Finished scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] (duration: 17m 57s)

completion traffic is now served from codfw which has proper indices, lowering prio

BTullis subscribed.Fri, Apr 26, 10:08 AM

Thibaut120094 subscribed.Fri, Apr 26, 10:30 AM

akosiaris subscribed.Fri, Apr 26, 1:01 PM

BCornwall removed a project: SRE.Fri, Apr 26, 5:20 PM

bking subscribed.Fri, Apr 26, 5:26 PM

Apologies folks it seems this was all due to an oversight on my part when we brought the new eqiad racks E5, E6, E7, F5, F6 and F7 live earlier this year (see T334230).

What we missed was to create the puppet patches to add new vlan sub-interfaces on our LVS load-balancers to connect them directly to the new subnets in those racks. This is needed as we are using IPVS in MAC-rewrite mode, which means it needs to be on the same Ethernet segment as any realserver it has to load-balance traffic to (it works by re-writing the MAC header only, allowing it to get the packet to the backend without changing the destination IP address).

The problem should be resolved now after merging these patches:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024776
https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024782

The specific problem here was reachability from lvs1019, which was announcing the VIP for search.svc.eqiad.wmnet (10.2.2.30), to elastic1105 (rack E5) and elastic1107 (rack F5). For example, without the direct layer-2 adjacency to the hosts on private1-e5-eqiad, the lvs was trying to use its default route to reach the new machines:

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
default via 10.64.32.1 dev eno1np0 onlink

Now of course when we tried to test this connectivity with pings or manual curls, everything worked fine. Traffic would route out via the default route and the network would get it to the host in the new rack. But IPVS was unable to load-balance traffic to the new hosts as it needs the direct layer-2 adjacency and to be able to ARP for their IPs.

After the change the lvs hosts are now directly connected to the vlans in the new racks, so they can send traffic directly to the elastic servers at layer-2:

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
10.64.152.0/24 dev vlan1047 proto kernel scope link src 10.64.152.19

cmooney@lvs1019:~$ ip neigh show 10.64.152.2 
10.64.152.2 dev vlan1047 lladdr 14:23:f2:c2:96:e0 REACHABLE

decided to delay bringing traffic back to eqiad until monday. To be confident in the daily indices we would probably want to rebuild them all, but that takes many hours and it would finish only a few hours before I'm heading out for the weekend. Didn't seem like a great time to bring traffic back. The daily rebuilds will run, we can look at them on monday and bring traffic back if everything is back to normal.

EBernhardson mentioned this in T363521: Completion suggester can promote a bad build.Fri, Apr 26, 7:04 PM

Jack_who_built_the_house mentioned this in T335974: ‘Remember selection’ option / Vector-2022 have search results that do not start with user input.Fri, Apr 26, 9:28 PM

Change #1025176 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] Revert "cirrus: Shift autocomplete traffic to codfw"

https://gerrit.wikimedia.org/r/1025176

Given that this happens second time already, something needs to be done to ensure this doesn’t happen in another year.

bking mentioned this in T363694: Post incident tasks: Search missing results/unavailable for some eqiad users.Mon, Apr 29, 1:23 PM

Change #1025176 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "cirrus: Shift autocomplete traffic to codfw"

https://gerrit.wikimedia.org/r/1025176

Mentioned in SAL (#wikimedia-operations) [2024-04-29T13:27:10Z] <dcausse@deploy1002> Started scap: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-29T13:29:39Z] <dcausse@deploy1002> dcausse: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-29T13:43:13Z] <dcausse@deploy1002> Finished scap: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] (duration: 16m 02s)

Krinkle subscribed.Mon, Apr 29, 2:28 PM

The general issue here, missing search suggestions, is resolved and the temporary mitigations put in place have been rolled back. I'm calling this issue done. One of the root causes, network connectivity, has been resolved. The other root cause, promoting a bad index, is tracked in T363521. Some changes have already been put in place to make this code more resilient to network failures, but more might still me done.

Gehel closed this task as Resolved.Fri, May 3, 9:21 AM

Gehel claimed this task.

Many search suggestions missing when connecting to eqiad, but not when connecting to codfwClosed, ResolvedPublicBUG REPORTActions

Description

Related Objects

Event Timeline

Many search suggestions missing when connecting to eqiad, but not when connecting to codfw
Closed, ResolvedPublicBUG REPORT
Actions