Page MenuHomePhabricator

New Wikidata items appear in search with a delay
Closed, ResolvedPublic

Description

There are complaints from the users that newly created Wikidata items appear with a delay in prefix search (and, consequently, all autocompletion boxes). The delay seems to be 10-15 seconds.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 16 2017, 9:05 AM

The reason this is annoying because for heavy editors a common workflow is the following:

  • go to item
  • try to add statement that links to another item
  • notice that other item does not exist yet
  • create the other item
  • go back to first item and make statement with newly created item

If the new item does not show up in the item selector relatively quickly that is pretty annoying for them.

jhsoby renamed this task from New Wikidata appear in search with a delay to New Wikidata items appear in search with a delay.Dec 18 2017, 1:01 AM
jhsoby added a subscriber: jhsoby.
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 2:57 PM

If a large majority of such usecases involve searching the entity id (QXXX) of the newly created item we can perform an additional db match to compensate the lag of the search index.
It's what we do for normal wikis, a db match is run in addition to the query sent to the search index.
If users search for the label or aliases of the newly created then this solution is pointless.

debt triaged this task as High priority.Dec 19 2017, 6:28 PM
debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
debt added a subscriber: debt.

For item creation on Wikidata, we probably want the delay to be as small as possible. Should look at how many items are created on a daily basis, to see how much of a load on the servers this might turn into (if we force a refresh). Should also look at if a human created the item or if a bot did -- maybe make the human creation items be a forced refresh but not the bot? Or...maybe set the flag to check for new items to check every 5 seconds.

The first graph on https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel?refresh=30m&orgId=1 shows the number of new items created over time.
For the particular problem indeed bots could be taken out. The make up the biggest part of new page creations on Wikidata (see https://stats.wikimedia.org/v2/#/wikidata.org/contributing/new-pages if you split by editor type).

So item creation rate is about 85k per day, or very close to one per second. Bots seem to dominate that though, so for real users it will be lower. Also, some of those are probably tools like QuickStatements for which it also could be fine to have the regular delay - maybe only force sync for those that come from browser pages?

Inspecting RC page, I see about 2-3 new items every second now, from non-bot accounts (it varies, but that's a common case), some of them have script tags but most don't have any special tags or markers.

EBernhardson added a comment.EditedDec 19 2017, 11:34 PM

It seems there are a couple options here, my thoughts:

Reduce the default refresh interval

By default we use a 30 second refresh interval for all wikis. This means that 30 seconds worth of updates get bundled together into a single update. Updates are not searchable until they have been refreshed. While it may not be particularly important on an individual wiki level, across 9k shards in the cluster this saves us considerable IO. We could potentially reduce the refresh rate only for wikidata to 5 seconds, or maybe even 1 second (the elasticsearch default). Trying to estimate the effect this has on the cluster is difficult, but my gut feeling is that a 5 second refresh for only 21 wikidatawiki_content shards it would probably go un-noticed.

It looks like our data collection around refresh is busted, graphite has intermittent data for some reason. Our indexing rate is fairly constant through the day though so i pulled some numbers directly for 2 minutes worth of activity (at ~23:20UTC) and saw 19 refreshes/second (1157/minute) across the full eqiad cluster. Worst case on increasing wikidatawiki_content with 21 shards from 30s to 5s would be from current 0.7/sec (42/minute) to 4.2/sec (252/minute) or an 18% increase. Likely not every shard is flushed on every opportunity.

Force refreshes from the cirrus codebase

Rather than increasing the default refresh rate, we could explicitly issue refreshes in the limited cases that we know it's important. Conceptually (i may have simply not thought about it enough) de-bouncing these to keep from issuing 100 refreshes in the same second seems non-trivial. We could certainly throttle the actions, but ensuring it happens after the throttle time runs out might not be so easy.

Best option?

In general I think i'm in favor of the less complicated, and likely more robust, solution of adjusting the refresh rate for wikidatawiki_content index down to 5s. Based on the current rates i think this will be reasonable. We can test on the codfw cluster first which receives all the same updates as eqiad. If 5s isn't fast enough I would have to rethink the forced refreshes, as worst case of a 1s refresh would double the refresh rate across the cluster. That might also be acceptable but a little harder to guesstimate.

I agree that we should try to lower the refresh rate to 5s and see whether it works.

Same for me I'd be for trying to increase the refresh rate on wikidata_content.

Mentioned in SAL (#wikimedia-operations) [2017-12-20T20:31:14Z] <ebernhardson> T183053 update elasticsearch settings for wikidatawiki_content on codfw to use: index.refresh_interval=5s

Took some measurements of refresh rate averaged over 5 minutes pre and post-deployment. Overall it's perhaps a 15% increase in refresh/minute across the cluster. Disk IO graphs don't show anything particularly interesting. There will certainly be more merge volume as well but elasticsearch should be able to bundle up the merges enough that these tiny merges are irrelelvant compared to the major merges that happen on many-GB segments.

refresh intervalcluster refresh/minindex refresh/min
30s128090
5s1514263

Change 399466 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Lower refresh interval for Wikidata to 5s

https://gerrit.wikimedia.org/r/399466

Looks to me that 5s is working fine. I'll add a config patch.

Mentioned in SAL (#wikimedia-operations) [2018-01-02T18:37:19Z] <ebernhardson> T183053 update index.refresh_interval for wikidatawiki_{content,general} on eqiad to 5s

EBernhardson added a comment.EditedJan 2 2018, 7:11 PM

We may still need to look into the special-case of newly created pages being indexed from the web request, rather than being punted into the job queue. cirrusSearchLinksUpdatePrioritized, which performs the actual generation of a document and write to elasticsearch, looks to have a p99 that regularly varies from 30 to 60 seconds. This is on top of however long it takes for the refreshLinksPrioritized job which is another 20s - 2 minutes for p99. For some fraction of requests, even when the queue is healthy, there will be a couple minutes between the edit being performed and the two necessary jobs making it through the job queue and turned into a write in elasticsearch.

We could maybe just sent basic info to ES when saving a new article, synchronously (not sure if it's a good idea, just putting it out there) and then let the jobs update it with full data. In the minus side, we'll get one extra document write which is then immediately overwritten. On the plus side, at least Qid and initial label are available near-instantly.

EBernhardson added a comment.EditedJan 2 2018, 11:33 PM

sending the basic info semi-synchronously (from DeferredUpdates, which will run in the same process as the edit but after closing the connection to the user so as not to make save timing worse) should be ok. Actually generating a "basic" set instead of the full thing might be more difficult than necessary though, i would be tempted to add a called to Updater::updateFromTitle(...) and let it do the full thing. Since article creates should be relatively (compared to total edit rate) rare i don't think the extra computation expense out-weights the maintenance cost of keeping an extra bit of code to generate partial updates, including getting the labels from wikidata, without also calculating the rest of it.

I'm not sure exactly what hook that should be attached to though.It looks like we can perhaps hook EditPage::attemptSave:after when $status->value == EditPage::AS_SUCCESS_NEW_ARTICLE, although might need to play with it to see if it does as expected.

Change 399466 merged by jenkins-bot:
[operations/mediawiki-config@master] Lower ElasticSearch index refresh interval for Wikidata to 5s

https://gerrit.wikimedia.org/r/399466

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Feb 13 2018, 8:17 PM
Smalyshev moved this task from Next to Doing on the User-Smalyshev board.Feb 22 2018, 10:04 PM

Change 413492 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/CirrusSearch@master] Allow some wikis to instantly index newly created articles

https://gerrit.wikimedia.org/r/413492

Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Feb 23 2018, 9:30 PM

Change 413899 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Add configuration for CirrusSearch to instantly index new Wikidata items

https://gerrit.wikimedia.org/r/413899

Change 413492 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Allow some wikis to instantly index newly created articles

https://gerrit.wikimedia.org/r/413492

Change 413899 merged by jenkins-bot:
[operations/mediawiki-config@master] Add configuration for CirrusSearch to instantly index new Wikidata items

https://gerrit.wikimedia.org/r/413899

Mentioned in SAL (#wikimedia-operations) [2018-03-08T00:28:12Z] <thcipriani@tin> Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:413899|Add configuration for CirrusSearch to instantly index new Wikidata items]] T183053 (duration: 01m 15s)

Smalyshev closed this task as Resolved.Mar 8 2018, 12:33 AM

The new items should now appear faster.