Page MenuHomePhabricator

Slow indexing of Lexemes for wbsearchentities
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:

Create a new lexeme.

Attempt to use that new lexeme in some other place, e.g., by P5238

Actual Results:

The new lexeme does not show up immediate in the drop down menu generated with content from a API wbsearchentities response. After some minutes the new lexeme apparently does show up.

Expected Results:

The new lexeme are indexed within seconds and is available to the dropdown menu.

Event Timeline

Fnielsen renamed this task from Slow indexing for to Slow indexing for wbsearchentities.Dec 10 2019, 12:34 PM

This seems not to be confined to lexemes but also the case for Q-items.

Could you precise what search string are you using?
wbsearchentities should be using the mysql database when searching using entity ids where the lag should be relatively small.
On the other hand the search index will take some time to udpate (job queue lag + elasticsearch refresh interval) so searches based on labels/aliases may not react immediately after an entity is added.

Fnielsen claimed this task.

Both the entity IDs and the label searches were slow. The search is now fast again, so I can make it "Resolved" (or something else?). I suppose it could have been a temporary lag in elasticsearch.

@Fnielsen thanks for letting us know, if search by entity ID is slow again please re-open this issue with a link to the entity you created so that we can correlate with the metrics we monitor.
For label search we are currently experiencing recurrent lag on the jobqueue that could make it rather bad (several minutes per T224425).

The indexing in connection with lexemes are now slow again. For instance, during creation of L270066, the form L270066-F3 is not available for use.

Gehel subscribed.

@dcausse could you have another look into this? Looking at job queue stats, the rate of the various Cirrus jobs seems stable, but I'm not familiar with all the details.

I think we've talked once or twice before about a time to update metric that can identify these issues. We have a document hinting process that informs the DataSender how to ship things, we could add an additional hint with timestamp on documents created by a new revision and report the difference between "now" and that timestamp when it is provided. This would give us critical information such as when a regression occured.

The related task (T224425) where the job queue would stop processing our jobs for some number of minutes, and then start again later, looks to have been resolved last week. I'm optimistic that the resolution of that task means this is also fixed, but I have yet to identify any hard data that could show it was bad before, and its measurably better today.

The job execution rates do look good this week, I'm not seeing any pauses, but unfortunately the historical graphite data doesn't have enough precision to compare against two weeks ago.[1] We could likely see some of the pauses at 5 minute resolution, although it might not as clearly drop to 0. I'm certain we wont see much of these pauses at 15 minute resolution.

[1] From profile::graphite::base:

Retain aggregated data at a one-minute resolution for one week; at
five-minute resolution for two weeks; at 15-minute resolution for
one month; one-hour resolution for one year, and 1d for five years.

Lydia_Pintscher renamed this task from Slow indexing for wbsearchentities to Slow indexing of Lexemes for wbsearchentities.Jul 16 2020, 12:14 PM
Lydia_Pintscher added a subscriber: Nikki.

Now that lexicographical data is becoming more and more popular we're seeing more requests about this. Would be <3 if we can solve it as it's quite frustrating for the editors.

The same doesn't happen with items - they can be used immediately

Items used to have the same problem, looking back through the code history it looks like we added an 'instant index new' option to CirrusSearch, but that still wasn't sufficient for the problem. Some other workaround was put in place, the related cirrus commit says "The wikidata results are now augmented by the sql database, meaning instant indexing no longer necessary there". Can the sql augmenting be enabled for lexemes as well? I'm not sure where/how that is done.

Linking to senses also seems to work immediately - I was able to create a statement linking to one of the lexeme's senses even though I can't link to the lexeme itself yet.

Sense lookups are not supported by WikibaseLexemeCirrusSearch so I suppose that they use the wb_term mysql table. I thought that we combined mysql+elastic lookups so that when an ID is searched mysql can respond but looking at WikibaseLexemeCirrusSearch it was not implemented there. It should be a couple lines to add such support.

Change 615404 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Combine DB lookups with elastic

https://gerrit.wikimedia.org/r/615404

dcausse triaged this task as Medium priority.Jul 22 2020, 7:19 AM
dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

Change 615404 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexemeCirrusSearch@master] Combine DB lookups with elastic

https://gerrit.wikimedia.org/r/615404

I can confirm that the issue is no longer present: one can enter the L-identifier and there is no longer a delay in what the popup displays.