Completion suggester can promote a bad build
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	EBernhardson
	Apr 25 2024, 7:48 PM

Description

On Apr 25th the build log for the enwiki completion suggester skips from 44% to done. It had been progressing at a rate of 2 or 3% per minute, so there was lots left to do. But instead for some reason it decided it was done and promoted the index. Logstash contains nothing of interest, searching for host:mwmaint1002 AND wiki:enwiki for the failure time period includes no logs. systemd timer output captured here: P61225

In general what we probably need is some sanity check before promote-and-delete that the index to be promoted has everything we expect it to have.

It would also be nice to recognize whatever failure caused it in the first place, including alerting implementation.

Details

	Subject	Repo	Branch	Lines +/-
	UpdateSuggesterIndex: add more logging info	mediawiki/extensions/CirrusSearch	master	+21 -3

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		bking	T363694 Post incident tasks: Search missing results/unavailable for some eqiad users
		Resolved		• dcausse	T363521 Completion suggester can promote a bad build

Event Timeline

EBernhardson created this task.Apr 25 2024, 7:48 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 25 2024, 7:48 PM

EBernhardson mentioned this in T363516: Many search suggestions missing when connecting to eqiad, but not when connecting to codfw.Apr 25 2024, 8:18 PM

Wrote a terrible bash script to compare titlesuggest doc counts between the two clusters. This suggests the problem isn't limited to enwiki

for dc in eqiad codfw; do
  curl https://search.svc.$dc.wmnet:9243/_cat/indices | \
    awk '/titlesuggest/ {print $3, $7}' | \
    sed 's/_titlesuggest_.* / /' \
    > $dc.clean
done

awk -F ' ' '
  FNR == NR {a[$1] = $2; next }
  $1 in a {
    diff = $2 - a[$1];
    if (diff > 1000) {
      print $1, diff
    }
  }
' eqiad.clean codfw.clean

svwiki 2490420
eswiki 11596
arwiki 4529
ukwiki 321075
arzwiki 3026
sawiktionary 3000
ruwiki 2834452
ttwiki 190248
idwiki 3892
itwiktionary 127617
mgwiktionary 6005
enwiktionary 3025
ltwiktionary 3008
uzwiki 187459
plwiki 1440184
frwiki 4079
viwiki 781017
itwiki 3786

One thing we do have in logstash, although not specifically from the script running eqiad, is a surprising (to me) number of general network errors talking to the elasticsearch cluster. Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50. Our completion indices build ran from 02:30 to 6:45. Looking at the last 7 days there are consistently network errors during this time period. I'm assuming we are causing those, but we could try running it at a different time of day.

From our side, it looks like our error handling is incomplete. Cirrus should still do the right thing when network requests fail. But I'm also curious what is causing the failures. It's not obviously an increase in error messages, various dashboards show some failures here and there are normal. But it feels like an increase in errors if completion building is regularly running into it now, not just on one wiki but multiple wikis on the same day.

The errors "delayed connect error: 113" seem to have started on apr 24 21:30 right after deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937.
The errors affect both mw@wikikube and mwmaint1002 https://logstash.wikimedia.org/goto/5ac680b477389129ffb5ddf33fa09940
I think we should switch completion traffic to codfw while we work on a more resilient version of this maint script and also understand why we get these errors.

BTullis subscribed.Apr 26 2024, 10:09 AM

Thibaut120094 subscribed.Apr 26 2024, 10:25 AM

akosiaris subscribed.Apr 26 2024, 1:02 PM

tagging @serviceops for help regarding the connectivity issue and this new delayed connect error: 113 error

In T363521#9746443, @EBernhardson wrote:

Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50.

If this is the dashboard referred to, this is probably a red herring. The scale is in the millipackets/s and the spikes are barely at low tens, not enough to impact anything, especially TCP connections, which will retry packets anyway.

This is pretty concerning. What we also see is unknown: Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 113 which means usually that the destination is unreachable, aka EHOSTUNREACH

This https://sal.toolforge.org/log/TXIJEo8BGiVuUzOdIZbf lines up perfectly with the beginning of the errors, so I just reverted it.

bking subscribed.Apr 26 2024, 1:30 PM

akosiaris mentioned this in T361268: Service implementation for elastic110[3-7].Apr 26 2024, 1:31 PM

Moving to serviceops-radar, this seems to be more related to the elasticsearch infra than wikikube or appservers.

Cluster and listener envoy snippets, specific to search at P61253

Root cause of the network issue has been tracked down in T363516#9748908, A layer-2 issue with LVS and new racks. With that fixed this error should be triggered less frequently, but we should still apply some resiliency updates to the related code.

bking added a parent task: T363694: Post incident tasks: Search missing results/unavailable for some eqiad users.Apr 29 2024, 1:59 PM

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1024698 switches from using a scroll to a search_after approach which should be more robust by handling retries and errors properly.
Question is whether we should do more by adding more checks or not? Unfortunately not all wikis are building a new index and promoting it, to optimize cluster operations most of the wikis recycle the same index where we don't have a chance to do such sanity checks prior to promoting.

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Apr 29 2024, 2:25 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

In T363521#9753422, @dcausse wrote:

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1024698 switches from using a scroll to a search_after approach which should be more robust by handling retries and errors properly.
Question is whether we should do more by adding more checks or not? Unfortunately not all wikis are building a new index and promoting it, to optimize cluster operations most of the wikis recycle the same index where we don't have a chance to do such sanity checks prior to promoting.

Indeed a strict sanity check is a bit hard, but maybe we could do some approximation?

Count the number of docs we ingested to build the index. Require it to be close enough to the expected number of docs. We can't require perfect, because the index changes while we are reading it. Close enough is pretty meh, suggests it can fail without really failing. Or pass when it should have failed.

Sample n random pages, check if they can be found in autocomplete. Can filter pages to ensure they were not created after the build started. Unfortunately we can't know if the doc was in elasticsearch during the rebuild, only that it was in the sql database. We hope that most of the time the index is correct, but still a bit squishy. Maybe require pages to be a day old, and hope the index is correct (this will fail when it samples an unrenderable page).

I'm not sure how to be strict here, and squishy guarantees have a tendency to cause extra work due to false readings. May have to ponder some more.

dr0ptp4kt updated the task description. (Show Details)May 13 2024, 3:33 PM

Gehel set the point value for this task to 8.Jun 17 2024, 3:46 PM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

• dcausse claimed this task.Jun 28 2024, 2:43 PM

• dcausse moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change #1050640 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] UpdateSuggesterIndex: add more logging info

https://gerrit.wikimedia.org/r/1050640

gerritbot added a project: Patch-For-Review.Jun 28 2024, 3:37 PM

I added some logging info to get a sense of the numbers, moving to waiting while we gather a bit more info.

Change #1050640 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] UpdateSuggesterIndex: add more logging info

https://gerrit.wikimedia.org/r/1050640

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.12; 2024-07-02).Jul 1 2024, 5:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 1 2024, 5:30 PM

We have metric collection in place, but this seems to fail rarely enough that we don't reproduce it. Let's close this for the moment and re-open if the problem persist.

Gehel closed this task as Resolved.Jul 19 2024, 12:47 PM

	F48789159: image.png
	Apr 26 2024, 1:26 PM

Completion suggester can promote a bad buildClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Completion suggester can promote a bad build
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...