Page MenuHomePhabricator

Completion suggester can promote a bad build
Open, Needs TriagePublic8 Estimated Story Points

Description

On Apr 25th the build log for the enwiki completion suggester skips from 44% to done. It had been progressing at a rate of 2 or 3% per minute, so there was lots left to do. But instead for some reason it decided it was done and promoted the index. Logstash contains nothing of interest, searching for host:mwmaint1002 AND wiki:enwiki for the failure time period includes no logs. systemd timer output captured here: P61225

In general what we probably need is some sanity check before promote-and-delete that the index to be promoted has everything we expect it to have.

It would also be nice to recognize whatever failure caused it in the first place, including alerting implementation.

Event Timeline

Wrote a terrible bash script to compare titlesuggest doc counts between the two clusters. This suggests the problem isn't limited to enwiki

for dc in eqiad codfw; do
  curl https://search.svc.$dc.wmnet:9243/_cat/indices | \
    awk '/titlesuggest/ {print $3, $7}' | \
    sed 's/_titlesuggest_.* / /' \
    > $dc.clean
done

awk -F ' ' '
  FNR == NR {a[$1] = $2; next }
  $1 in a {
    diff = $2 - a[$1];
    if (diff > 1000) {
      print $1, diff
    }
  }
' eqiad.clean codfw.clean
svwiki 2490420
eswiki 11596
arwiki 4529
ukwiki 321075
arzwiki 3026
sawiktionary 3000
ruwiki 2834452
ttwiki 190248
idwiki 3892
itwiktionary 127617
mgwiktionary 6005
enwiktionary 3025
ltwiktionary 3008
uzwiki 187459
plwiki 1440184
frwiki 4079
viwiki 781017
itwiki 3786

One thing we do have in logstash, although not specifically from the script running eqiad, is a surprising (to me) number of general network errors talking to the elasticsearch cluster. Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50. Our completion indices build ran from 02:30 to 6:45. Looking at the last 7 days there are consistently network errors during this time period. I'm assuming we are causing those, but we could try running it at a different time of day.

From our side, it looks like our error handling is incomplete. Cirrus should still do the right thing when network requests fail. But I'm also curious what is causing the failures. It's not obviously an increase in error messages, various dashboards show some failures here and there are normal. But it feels like an increase in errors if completion building is regularly running into it now, not just on one wiki but multiple wikis on the same day.

The errors "delayed connect error: 113" seem to have started on apr 24 21:30 right after deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023937.
The errors affect both mw@wikikube and mwmaint1002 https://logstash.wikimedia.org/goto/5ac680b477389129ffb5ddf33fa09940
I think we should switch completion traffic to codfw while we work on a more resilient version of this maint script and also understand why we get these errors.

tagging @serviceops for help regarding the connectivity issue and this new delayed connect error: 113 error

Looking at the Host overview dashboard for mwmaint1002 for today can see that there were intermittent network errors from 03:00 until 06:50.

If this is the dashboard referred to, this is probably a red herring. The scale is in the millipackets/s and the spikes are barely at low tens, not enough to impact anything, especially TCP connections, which will retry packets anyway.

image.png (308×2 px, 19 KB)

This is pretty concerning. What we also see is unknown: Status code 503; upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 113 which means usually that the destination is unreachable, aka EHOSTUNREACH

This https://sal.toolforge.org/log/TXIJEo8BGiVuUzOdIZbf lines up perfectly with the beginning of the errors, so I just reverted it.

Moving to serviceops-radar, this seems to be more related to the elasticsearch infra than wikikube or appservers.

Cluster and listener envoy snippets, specific to search at P61253

Root cause of the network issue has been tracked down in T363516#9748908, A layer-2 issue with LVS and new racks. With that fixed this error should be triggered less frequently, but we should still apply some resiliency updates to the related code.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1024698 switches from using a scroll to a search_after approach which should be more robust by handling retries and errors properly.
Question is whether we should do more by adding more checks or not? Unfortunately not all wikis are building a new index and promoting it, to optimize cluster operations most of the wikis recycle the same index where we don't have a chance to do such sanity checks prior to promoting.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1024698 switches from using a scroll to a search_after approach which should be more robust by handling retries and errors properly.
Question is whether we should do more by adding more checks or not? Unfortunately not all wikis are building a new index and promoting it, to optimize cluster operations most of the wikis recycle the same index where we don't have a chance to do such sanity checks prior to promoting.

Indeed a strict sanity check is a bit hard, but maybe we could do some approximation?

  • Count the number of docs we ingested to build the index. Require it to be close enough to the expected number of docs. We can't require perfect, because the index changes while we are reading it. Close enough is pretty meh, suggests it can fail without really failing. Or pass when it should have failed.
  • Sample n random pages, check if they can be found in autocomplete. Can filter pages to ensure they were not created after the build started. Unfortunately we can't know if the doc was in elasticsearch during the rebuild, only that it was in the sql database. We hope that most of the time the index is correct, but still a bit squishy. Maybe require pages to be a day old, and hope the index is correct (this will fail when it samples an unrenderable page).

I'm not sure how to be strict here, and squishy guarantees have a tendency to cause extra work due to false readings. May have to ponder some more.

Gehel set the point value for this task to 8.Mon, Jun 17, 3:46 PM