Page MenuHomePhabricator

General collection of encountered Search issues
Closed, ResolvedPublic

Description

When reporting an issue in a comment under this ticket, please include:

  • The name of your Wikibase or the url
  • The moment you encountered the issue, date and time
  • Browser details
  • A description of the error you encountered
  • What you expected to happen, the normal behaviour

Event Timeline

Initially reported from Telegram:

18-02-2023, 09:23 AM CET Eduards Skvireckis
"Hey everyone, we will be performing maintenance on ElasticSearch. As part of th..."

Hey,
Is there any news regarding this matter? Right now search doesn't work for newly created items and I can't reconcile against these items (in the reconciliation process there is no match found)


19-02-2023, 13:08 CET Tom R.
"Hey, Is there any news regarding this matter? Right now search doesn't work for ..."

I'm also wondering about an update? Is there anything we could do? In any case, thanks for working on this.

Not sure if related, but I'll record my findings here anyways:

Last Friday, 17th of February in the late evening, there seems to have been an ElasticSearch outage, making a lot of search requests fail, and also caused certain item updates to be dropped entirely. The number of errors seems to have peaked at around 10PM CET (query):

image.png (114×1 px, 13 KB)

Most errors read like:

[warning] [CirrusSearch] Search backend error during fetching elasticsearch version after 6: unknown: Couldn't connect to host, Elasticsearch down?

or

[warning] [CirrusSearch] Search backend error during wikibase_prefix search for 'REDACTED_SEARCH_TERM' after 5: unknown: Couldn't connect to host, Elasticsearch down?

There's also lines like this:

[warning] [CirrusSearchChangeFailed] Dropping failing ElasticaWrite job for DataSender::sendData in cluster default after repeated failure

which means, some indices are now probably out of sync with their sources and need to be rebuilt (unless we do that automatically and I don't know about it yet).

Why did this happen?

There are no suspicious signs in the ElasticSearch service's (currently limited) resource metrics. It seems however that GKE upgraded nodes at that time, potentially whirling the ElasticSearch cluster into a non-functioning state during its lengthy recovery process (see T328740).

image.png (237×544 px, 29 KB)

Followups

  • Is there something going on in ElasticSearch that we just cannot see yet?
  • Can we make GKE perform node upgrades in a way that does not break ElasticSearch?
  • Is there manual work left to be done to get wikis and their indices back in sync?

Wikibase instance:
Riga literata

Time:
From 16.02.2023 (with newly created items)

Browser:
Google Chrome Version 110.0.5481.96 (Official Build) (64-bit)

Problem:
Search and reconciliation didn't work with newly created items (that were created on 16.02.2023). Since yesterday (20.02.2023) I have noticed some changes - slowly some of these items appear on search results and can be reconciled but there are still many items that cannot be found.

In terms of the issues for Riga literata it looks like there are a fair few jobs in the queue for this wiki. Right now >1000.

kubectl exec -it deployment/mediawiki-137-fp-app-web -- bash -c "WBS_DOMAIN='riga-literata.wikibase.cloud' php w/maintenance/showJobs.php" 
1124

as you browse the wiki these jobs will be done (2 per request). However we probably want to set up an automated process to eat through these jobs for the situations where much more writing than reading is happening on the Wiki. See: https://www.mediawiki.org/wiki/Manual:Job_queue

I've started a job to resolve this backlog for this specific wikibase:

WBS_DOMAIN="riga-literata.wikibase.cloud" ./runAllMWJobs.sh
job.batch/run-all-mw-jobs-4f7qf created

Edit: Job queue is now emptied for this wikibase

Hey Tarrow, thanks for your work on this! Could it be that the que for https://osloddt.wikibase.cloud/wiki/Main_Page is also overloaded? I started a couple of Special:Nuke things, which do not seem to work, and search in the search box is not updating at all...
Thanks for checking,
Tom

@Ruettet I checked your wiki and it looks like there are currently 549880 pending jobs in the queue. I manually started a job that will process these items in the next few hours.

wow, thanks for checking. And I understand that the search index update is somewhere buried between all these pending jobs, right? I'll let you know later then how things are!
Merci,
Tom

@Ruettet Extrapolating the speed at which jobs are currently being worked on, I would expect your wiki to be finished at sth around midnight (Berlin time) tonight (27.02.2023).

I have the impression that things are clearing up on the side of my wiki, with up to date search results, so this seems to have helped!
Thanks,
Tom

$wgJobRunRate = 2 due to a previous instance of this issue (now seemingly lost due to repo change), but as noted there that is probably not a sufficient approach to handle mass import in a timely manner as the expectation is that data is available immediately, and an unlisted wikibase may not be accessed "regularly" at all (not sure API counts, and even if it does it may never catch up if n+1 jobs are queued per request).

@Fring and @Tarrow > ok, I have the impression that the search box does updates now, but that they take a few days to catch up (I am doing a huge import). Not a problem for now, and also lots of understanding as this is beta, but indeed, the expectation in an alpha version would be that the search box is relatively quick in catching up new entries. I imagine the case where I want to add a statement to an entity, and I do not find the "object entity" that I want to relate to the "subject entity" > I then would have to create the object of the triple, and the expectation would be that it is available immediately as an object in the search box for making the triple.

Was redirected from Wikibase.Cloud / WBStack Telegram to report search issues since platform update from MediaWiki 1.37 to 1.38. Currently my instance (https://lgbtdb.wikibase.cloud/) is getting "An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later." for all searches since the update. As well, it is not possible to create new statements, as no entities appear in that search feature either (i.e. typing any property results in a "No match was found" error). Is there any way I can help with diagnosing or addressing the issue? I can also provide more information or feedback if needed. Thank you all so much!

Hi!

We think that it might be unrelated to the MediaWiki update but we can indeed see that ElasticSearch is taking *forever* to restart after we had an unrelated node update happen around 2000UTC last night. We hope this will be recovered soon (i.e. <24hrs). In the mean time you can probably add statements by directly referencing the Property ID in the the create new statement thing. e.g. P1 etc.

Thanks for the report though :)

@Tarrow No worries! Thank you so much for helping work on this! Unfortunately I will note that directly referencing Property ID is also not working at the present time.

Evelien_WMDE claimed this task.

Resolving this due to having enough inventory/metrics to guide the way for further development on ElasticSearch adventures