Page MenuHomePhabricator

mwgrep and "insource:" search is missing lots of pages in its index
Closed, ResolvedPublic

Description

I find myself repeatedly today not finding pages with mwgrep and/or with on-wiki insource: searches for things that I do find with the regular search index.

For example, the following all were not found by mwgrep and not found by on-wiki insource: searches for the cited pieces of text

https://pl.wikipedia.org/w/index.php?search=insource%3ARTRC&profile=advanced&ns8=1 matches MediaWiki:Gadgets-definition and MediaWiki:Gadget-RTRC, but not MediaWiki:Gadget-RTRC.js.

The same search without insource: does return all three.

It seems there is a large gap in the source index for JavaScript (and maybe also CSS) content type pages.

Event Timeline

The example of pl.wikipedia.org seems to have resolved itself. Presumably a (partial?) reindex has occurred since then for at least that wiki.

Keeping this open for now in case this is a general issue. I've seen this more recently as well. It seems recently updated pages are indexed fine, but for insource, most older pages on wikis weren't re-indexed to support this.

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.
debt subscribed.

We'll take a look to see if it truly has been resolved.

I think there are maybe multiple issues here:

  • Index not up to date: could explain why you see results now, a sanitize process is now running continuously on all wikis and probably fixed the issues on pl.wikipedia.org
  • insource (regex with insource://) and mwgrep uses max_inspect:10000 to stop processing pages if more than 10000 are inspected, we should switch to a timeout based limit, this would give us a chance to warn the user that results shown are partial.

EDIT: reading more carefully if insource:RTRC did not find all pages but a simple RTRC did then it's none of the two problems I mentioned above, and maybe related to https://gerrit.wikimedia.org/r/#/c/261323/ ?

I know that @EBernhardson worked on similar issues but can't remember all the details nor if a reindex was scheduled after this patch was merged.

debt triaged this task as Medium priority.Aug 11 2016, 5:13 PM

Let's chat next week with @EBernhardson about prior details.

debt added a subscriber: dcausse.

Hi @Krinkle - Can you take a look and see what you think - if the fixes we've done so far are enough or if there is more to be able to close this out.

As a reminder, we only process the first 10,000 pages.

Thanks!

As a reminder, we only process the first 10,000 pages.

What does this mean? Most wikis have more than 10,000 pages. Does this mean that if 1 page on the whole wiki contains a word, I would wrongly assume no page contains this word if I get 0 results?

Or do you mean it only processes first 10,000 matches for a search? (Meaning, if I get 10,000 results and change them all, I should re-run the search afterward before assuming the word no longer appears).

I need to know that if I look for a deprecated JavaScript method on a wiki's MediaWiki namespace using insource: and get 0 results, or get 5 results and I fixed those, that I can confidently assume it no longer appears on the given wiki.


Regarding the confirming of this bug being fixed, I can't really know whether it is fixed or not as I'm usually not aware of pages I'm trying to find. I only ran into this by accident when I assumed a wiki no longer had any matches for a CSS class or JavaScript method and later found one myself when browsing around.

I need to know that if I look for a deprecated JavaScript method on a wiki's MediaWiki namespace using insource: and get 0 results, or get 5 results and I fixed those, that I can confidently assume it no longer appears on the given wiki.

Indeed, T134157 has been created to address this problem by using a timeout approach and adding a warning if elastic was unable to scan all the pages.
The 10000 limit is misleading but in short: the regex uses a 2 pass technique, the first pass is an approximation made using an inverted index with trigrams, the second pass actually load the text and run the regex.
The 10000 limit applies to the second pass.
It means that if the trigrams extracted do not permit to filter less than 10000 docs it's very likely than you'll have partial results.
Please see my comment on T106685 for more details on this trigram extraction.

Change 306933 had a related patch set uploaded (by EBernhardson):
Saneitizer: Check if revisions are up to date

https://gerrit.wikimedia.org/r/306933

linked patch will help with the pages mentioned that needed a null edit to be reindexed, this will check all pages every 2 weeks to ensure the latest version is in the index. Ideally we still keep chipping away at bugs that cause things to fail indexing in the first place, but this is a reasonable fallback for fixing problems.

Change 307652 had a related patch set uploaded (by EBernhardson):
Report partial result from mwgrep

https://gerrit.wikimedia.org/r/307652

As a starting point the above patch adjusts mwgrep to limit itself based strictly on execution time and not by the number of inspected documents. The number of inspected documents was already a bit odd anyways because its the number inspected per shard, and mwgrep is querying ~9k shards. We can consider applying the same update to insource queries, but I would like to monitor the stats from rolling out the mwgrep change to get a feel for what kind of resource impact we can expect.

Patch additionally adds some warning text now when an early exit happens, so the user at least knows they received partial results.

Change 306933 merged by jenkins-bot:
Saneitizer: Check if revisions are up to date

https://gerrit.wikimedia.org/r/306933

Change 307652 merged by Gehel:
Report partial result from mwgrep

https://gerrit.wikimedia.org/r/307652

will evaluate moving the timeout/max documents changes applied to mwgrep to insource in T134157