Page MenuHomePhabricator

Many search suggestions missing when connecting to eqiad, but not when connecting to codfw
Closed, ResolvedPublicBUG REPORT

Description

See bug report here: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Search_suggestions_seem_to_be_missing_obvious_choices

For a simple example, try this query: https://en.wikipedia.org/w/api.php?action=opensearch&format=json&formatversion=2&search=sonia+sotomay&namespace=0&limit=10

Expected response on codfw (mwdebug2001): ["sonia sotomay",["Sonia Sotomayor","Sonia Sotomayor Supreme Court nomination","Sonia Sotomayor Learning Academies","Sonia M. Sotomayor High School"],["","","",""],["https://en.wikipedia.org/wiki/Sonia_Sotomayor","https://en.wikipedia.org/wiki/Sonia_Sotomayor_Supreme_Court_nomination","https://en.wikipedia.org/wiki/Sonia_Sotomayor_Learning_Academies","https://en.wikipedia.org/wiki/Sonia_M._Sotomayor_High_School"]]

Actual response on eqiad (mwdebug1001): ["sonia sotomay",[],[],[]]

Event Timeline

hmm, i can confirm this is happening. The completion index is built new every day in each datacenter. Usually they are the same, but somehow the eqiad index is about half the size of the codfw index (6.7g vs 14.5g). Auto complete is fairly high traffic, we should probably shift the autocomplete traffic to codfw until it can be fixed which probably requires a rebuild and a couple hours.

Change #1024478 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

Mentioned in SAL (#wikimedia-operations) [2024-04-25T19:33:12Z] <ebernhardson> T363516 started manual rebuild of enwiki titlesuggest indices in eqiad

Decided against shuffling traffic, rebuild is almost compete already for enwiki. I can see in the logs where the enwiki eqiad build jumped from 44% to complete, but no reason why. nothing in logstash for that period either. I've created T363521 to put something in place to prevent this in the future.

Change #1024478 abandoned by Ebernhardson:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

Reason:

rebuild only took 45 minutes, decided not to shuffle while it was in progress

https://gerrit.wikimedia.org/r/1024478

Never mind, they just said it's fixed :)

Change #1024478 restored by DCausse:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

dcausse triaged this task as Unbreak Now! priority.Fri, Apr 26, 9:13 AM
dcausse added a project: CirrusSearch.
dcausse subscribed.

This is still happening, raising to UBN

Change #1024478 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Shift autocomplete traffic to codfw

https://gerrit.wikimedia.org/r/1024478

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:36:24Z] <dcausse@deploy1002> Started scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:41:30Z] <dcausse@deploy1002> dcausse and ebernhardson: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-26T09:54:21Z] <dcausse@deploy1002> Finished scap: Backport for [[gerrit:1024478|cirrus: Shift autocomplete traffic to codfw (T363516)]] (duration: 17m 57s)

dcausse lowered the priority of this task from Unbreak Now! to Medium.Fri, Apr 26, 10:00 AM

completion traffic is now served from codfw which has proper indices, lowering prio

Apologies folks it seems this was all due to an oversight on my part when we brought the new eqiad racks E5, E6, E7, F5, F6 and F7 live earlier this year (see T334230).

What we missed was to create the puppet patches to add new vlan sub-interfaces on our LVS load-balancers to connect them directly to the new subnets in those racks. This is needed as we are using IPVS in MAC-rewrite mode, which means it needs to be on the same Ethernet segment as any realserver it has to load-balance traffic to (it works by re-writing the MAC header only, allowing it to get the packet to the backend without changing the destination IP address).

The problem should be resolved now after merging these patches:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024776
https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024782

The specific problem here was reachability from lvs1019, which was announcing the VIP for search.svc.eqiad.wmnet (10.2.2.30), to elastic1105 (rack E5) and elastic1107 (rack F5). For example, without the direct layer-2 adjacency to the hosts on private1-e5-eqiad, the lvs was trying to use its default route to reach the new machines:

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
default via 10.64.32.1 dev eno1np0 onlink

Now of course when we tried to test this connectivity with pings or manual curls, everything worked fine. Traffic would route out via the default route and the network would get it to the host in the new rack. But IPVS was unable to load-balance traffic to the new hosts as it needs the direct layer-2 adjacency and to be able to ARP for their IPs.

After the change the lvs hosts are now directly connected to the vlans in the new racks, so they can send traffic directly to the elastic servers at layer-2:

cmooney@lvs1019:~$ ip route get fibmatch 10.64.152.2 
10.64.152.0/24 dev vlan1047 proto kernel scope link src 10.64.152.19
cmooney@lvs1019:~$ ip neigh show 10.64.152.2 
10.64.152.2 dev vlan1047 lladdr 14:23:f2:c2:96:e0 REACHABLE

decided to delay bringing traffic back to eqiad until monday. To be confident in the daily indices we would probably want to rebuild them all, but that takes many hours and it would finish only a few hours before I'm heading out for the weekend. Didn't seem like a great time to bring traffic back. The daily rebuilds will run, we can look at them on monday and bring traffic back if everything is back to normal.

Change #1025176 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] Revert "cirrus: Shift autocomplete traffic to codfw"

https://gerrit.wikimedia.org/r/1025176

Given that this happens second time already, something needs to be done to ensure this doesn’t happen in another year.

Change #1025176 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "cirrus: Shift autocomplete traffic to codfw"

https://gerrit.wikimedia.org/r/1025176

Mentioned in SAL (#wikimedia-operations) [2024-04-29T13:27:10Z] <dcausse@deploy1002> Started scap: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]]

Mentioned in SAL (#wikimedia-operations) [2024-04-29T13:29:39Z] <dcausse@deploy1002> dcausse: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-04-29T13:43:13Z] <dcausse@deploy1002> Finished scap: Backport for [[gerrit:1025176|Revert "cirrus: Shift autocomplete traffic to codfw" (T363516)]] (duration: 16m 02s)

The general issue here, missing search suggestions, is resolved and the temporary mitigations put in place have been rolled back. I'm calling this issue done. One of the root causes, network connectivity, has been resolved. The other root cause, promoting a bad index, is tracked in T363521. Some changes have already been put in place to make this code more resilient to network failures, but more might still me done.

Gehel claimed this task.