Page MenuHomePhabricator

Increased delay in indexing of new Items on Wikidata
Closed, ResolvedPublic3 Estimated Story PointsBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Create a new Item
  • Wait until it shows up in search results

What happens?:

  • The Item shows up after 5 to 6 minutes according to some reports, which is slower than it has been in the past.

What should have happened instead?:

  • The new Item should be indexed more quickly. This is especially important for common workflows of people creating Items and then searching for it when using them in a new statement

Other information (browser name/version, screenshots, etc.):

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.May 27 2024, 12:39 PM
Gehel moved this task from needs triage to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

The additional delay is coming from a new implementation of the search update pipeline, which has a deduplication window. In principles, search is an asynchronous system and should not be used in a synchronous workflow. Our informal and undocumented (yet) SLO is that updates should be visible in 10 minutes.

It might be time to review how we use search in the edit workflow and find alternatives.

This is hugely disrupting my workflow. What could the alternatives be, do an API call on (X) new items created by the user?

I'll let @Lydia_Pintscher give more context on the future plans and workarounds. We will be working on a prioritization of Wikidata updates in the search pipeline, but it will take some time. In the meantime, it should be possible to refer to items by their QID.

We've discussed how to go about this problem some more. Here is the option we have come up with so far to address the problem of using the asynchronous service for synchronous work: We could have some form of local storage for the user that contains the entities they recently created. Those would be used for suggestions when making new statements.
Additional thoughts:

  • One interesting step in this direction is allowing to create a new Item directly from the statement editing UI. @Celenduin has created a user script to explore how this part could look like and the details are in T107693.
  • People have said that it would not just be interesting to use this local storage for the recently created entity of the user but any they have recently used. This would make some edits easier where you repeatedly use the same entity while editing many entities.

A few options we might consider:

  • The 5 minute delay is part of an intentional deduplicate/merge step, we could add a way for wikidata to bypass this.
  • We could consider removing deduplication all together. It currently removes ~15% of incoming events, although we have some expectation that the merge rate will increase (although not sure by how much) if/when we start ingesting more async updates that are triggered by events (ML predictions, etc.). Each event translates into multiple calls to mw-api-int, so this also puts more load on related systems.
  • We could consider reducing the window length. This would probably have to be global and apply to all wikis. It's not impossible to vary it, but the additional complexity of managing multiple windowing operators is likely not worth the trouble.

There are probably a few more options, but these are the first few that came to mind. We intend to discuss this some more at our Wednesday meeting.

Gehel set the point value for this task to 3.Jul 15 2024, 3:27 PM

Option choosen: remore the delay / deduplication just for wikidata

  • The 5 minute delay is part of an intentional deduplicate/merge step, we could add a way for wikidata to bypass this.
pfischer changed the task status from Open to In Progress.Jul 16 2024, 8:23 AM
pfischer claimed this task.

Change #1054850 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: let wikidatawiki bypass optimization (deduplication)

https://gerrit.wikimedia.org/r/1054850

Change #1054850 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: let wikidatawiki bypass optimization (deduplication)

https://gerrit.wikimedia.org/r/1054850

Change #1055209 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/deployment-charts@master] Search update pipeline: let wikidatawiki bypass optimization (deduplication)

https://gerrit.wikimedia.org/r/1055209

Change #1055209 merged by jenkins-bot:

[operations/deployment-charts@master] Search update pipeline: let wikidatawiki bypass optimization (deduplication)

https://gerrit.wikimedia.org/r/1055209

Reading through CirrusSearch docs for an unrelated issue, I came across the following section which made me think of this task:

docs/page_lifecycle.txt
=== New page is created ===
* PageContentInsertComplete hook fires in the web request process. If
  $wgCirrusSearchInstantIndexNew is enabled a minimal document is immediatly
  indexed (primarily to populate autocomplete on Wikidata).

This is from 2018 and sounds like $wgCirrusSearchInstantIndexNew was intended to prevent exactly this issue. But codesearch suggests that it is no longer configured in production and no longer referenced in code?

Reading through CirrusSearch docs for an unrelated issue, I came across the following section which made me think of this task:

docs/page_lifecycle.txt
=== New page is created ===
* PageContentInsertComplete hook fires in the web request process. If
  $wgCirrusSearchInstantIndexNew is enabled a minimal document is immediatly
  indexed (primarily to populate autocomplete on Wikidata).

This is from 2018 and sounds like $wgCirrusSearchInstantIndexNew was intended to prevent exactly this issue. But codesearch suggests that it is no longer configured in production and no longer referenced in code?

That was removed in the patch Remove instant index on page creation back in 2019. It seems that was done as a simple revert, so it didn't catch the documentation that was added later. Perhaps something has gone wrong with the sql integration of wbsearchentities?