|operations/mediawiki-config : master||Add configuration for CirrusSearch to instantly index new Wikidata items|
|mediawiki/extensions/CirrusSearch : master||Allow some wikis to instantly index newly created articles|
|operations/mediawiki-config : master||Lower ElasticSearch index refresh interval for Wikidata to 5s|
The reason this is annoying because for heavy editors a common workflow is the following:
- go to item
- try to add statement that links to another item
- notice that other item does not exist yet
- create the other item
- go back to first item and make statement with newly created item
If the new item does not show up in the item selector relatively quickly that is pretty annoying for them.
If a large majority of such usecases involve searching the entity id (QXXX) of the newly created item we can perform an additional db match to compensate the lag of the search index.
It's what we do for normal wikis, a db match is run in addition to the query sent to the search index.
If users search for the label or aliases of the newly created then this solution is pointless.
For item creation on Wikidata, we probably want the delay to be as small as possible. Should look at how many items are created on a daily basis, to see how much of a load on the servers this might turn into (if we force a refresh). Should also look at if a human created the item or if a bot did -- maybe make the human creation items be a forced refresh but not the bot? Or...maybe set the flag to check for new items to check every 5 seconds.
The first graph on https://grafana.wikimedia.org/dashboard/db/wikidata-datamodel?refresh=30m&orgId=1 shows the number of new items created over time.
For the particular problem indeed bots could be taken out. The make up the biggest part of new page creations on Wikidata (see https://stats.wikimedia.org/v2/#/wikidata.org/contributing/new-pages if you split by editor type).
So item creation rate is about 85k per day, or very close to one per second. Bots seem to dominate that though, so for real users it will be lower. Also, some of those are probably tools like QuickStatements for which it also could be fine to have the regular delay - maybe only force sync for those that come from browser pages?
Inspecting RC page, I see about 2-3 new items every second now, from non-bot accounts (it varies, but that's a common case), some of them have script tags but most don't have any special tags or markers.
It seems there are a couple options here, my thoughts:
Reduce the default refresh interval
By default we use a 30 second refresh interval for all wikis. This means that 30 seconds worth of updates get bundled together into a single update. Updates are not searchable until they have been refreshed. While it may not be particularly important on an individual wiki level, across 9k shards in the cluster this saves us considerable IO. We could potentially reduce the refresh rate only for wikidata to 5 seconds, or maybe even 1 second (the elasticsearch default). Trying to estimate the effect this has on the cluster is difficult, but my gut feeling is that a 5 second refresh for only 21 wikidatawiki_content shards it would probably go un-noticed.
It looks like our data collection around refresh is busted, graphite has intermittent data for some reason. Our indexing rate is fairly constant through the day though so i pulled some numbers directly for 2 minutes worth of activity (at ~23:20UTC) and saw 19 refreshes/second (1157/minute) across the full eqiad cluster. Worst case on increasing wikidatawiki_content with 21 shards from 30s to 5s would be from current 0.7/sec (42/minute) to 4.2/sec (252/minute) or an 18% increase. Likely not every shard is flushed on every opportunity.
Force refreshes from the cirrus codebase
Rather than increasing the default refresh rate, we could explicitly issue refreshes in the limited cases that we know it's important. Conceptually (i may have simply not thought about it enough) de-bouncing these to keep from issuing 100 refreshes in the same second seems non-trivial. We could certainly throttle the actions, but ensuring it happens after the throttle time runs out might not be so easy.
In general I think i'm in favor of the less complicated, and likely more robust, solution of adjusting the refresh rate for wikidatawiki_content index down to 5s. Based on the current rates i think this will be reasonable. We can test on the codfw cluster first which receives all the same updates as eqiad. If 5s isn't fast enough I would have to rethink the forced refreshes, as worst case of a 1s refresh would double the refresh rate across the cluster. That might also be acceptable but a little harder to guesstimate.
Took some measurements of refresh rate averaged over 5 minutes pre and post-deployment. Overall it's perhaps a 15% increase in refresh/minute across the cluster. Disk IO graphs don't show anything particularly interesting. There will certainly be more merge volume as well but elasticsearch should be able to bundle up the merges enough that these tiny merges are irrelelvant compared to the major merges that happen on many-GB segments.
|refresh interval||cluster refresh/min||index refresh/min|
We may still need to look into the special-case of newly created pages being indexed from the web request, rather than being punted into the job queue. cirrusSearchLinksUpdatePrioritized, which performs the actual generation of a document and write to elasticsearch, looks to have a p99 that regularly varies from 30 to 60 seconds. This is on top of however long it takes for the refreshLinksPrioritized job which is another 20s - 2 minutes for p99. For some fraction of requests, even when the queue is healthy, there will be a couple minutes between the edit being performed and the two necessary jobs making it through the job queue and turned into a write in elasticsearch.
We could maybe just sent basic info to ES when saving a new article, synchronously (not sure if it's a good idea, just putting it out there) and then let the jobs update it with full data. In the minus side, we'll get one extra document write which is then immediately overwritten. On the plus side, at least Qid and initial label are available near-instantly.
sending the basic info semi-synchronously (from DeferredUpdates, which will run in the same process as the edit but after closing the connection to the user so as not to make save timing worse) should be ok. Actually generating a "basic" set instead of the full thing might be more difficult than necessary though, i would be tempted to add a called to Updater::updateFromTitle(...) and let it do the full thing. Since article creates should be relatively (compared to total edit rate) rare i don't think the extra computation expense out-weights the maintenance cost of keeping an extra bit of code to generate partial updates, including getting the labels from wikidata, without also calculating the rest of it.
I'm not sure exactly what hook that should be attached to though.It looks like we can perhaps hook EditPage::attemptSave:after when $status->value == EditPage::AS_SUCCESS_NEW_ARTICLE, although might need to play with it to see if it does as expected.