mwgrep and "insource:" search is missing lots of pages in its index
Closed, ResolvedPublic
Actions

Description

I find myself repeatedly today not finding pages with mwgrep and/or with on-wiki insource: searches for things that I do find with the regular search index.

For example, the following all were not found by mwgrep and not found by on-wiki insource: searches for the cited pieces of text

pl.wikipedia.org MediaWiki:Gadget-nuxedtoolkit.js: UsabilityInitiative or combined.min.css was not found. – fixed by edit
commons.wikimedia.org MediaWiki:Edittools.js/dev textSelection was not found. – fixed by edit
pl.wikipedia.org MediaWiki:Gadget-RTRC.js text RTRC not found.

https://pl.wikipedia.org/w/index.php?search=insource%3ARTRC&profile=advanced&ns8=1 matches MediaWiki:Gadgets-definition and MediaWiki:Gadget-RTRC, but not MediaWiki:Gadget-RTRC.js.

The same search without insource: does return all three.

It seems there is a large gap in the source index for JavaScript (and maybe also CSS) content type pages.

Details

	Subject	Repo	Branch	Lines +/-
	Report partial result from mwgrep	operations/puppet	production	+16 -2
	Saneitizer: Check if revisions are up to date	mediawiki/extensions/CirrusSearch	master	+87 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		EBernhardson	T127788 mwgrep and "insource:" search is missing lots of pages in its index
		Resolved		dcausse	T145023 Searching for insource:tag finds <tag> but not {{#tag:tag}}

Event Timeline

Krinkle created this task.Feb 23 2016, 2:44 AM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 23 2016, 2:44 AM

The example of pl.wikipedia.org seems to have resolved itself. Presumably a (partial?) reindex has occurred since then for at least that wiki.

Keeping this open for now in case this is a general issue. I've seen this more recently as well. It seems recently updated pages are indexed fine, but for insource, most older pages on wikis weren't re-indexed to support this.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 2 2016, 3:06 AM

We'll take a look to see if it truly has been resolved.

I think there are maybe multiple issues here:

Index not up to date: could explain why you see results now, a sanitize process is now running continuously on all wikis and probably fixed the issues on pl.wikipedia.org
insource (regex with insource://) and mwgrep uses max_inspect:10000 to stop processing pages if more than 10000 are inspected, we should switch to a timeout based limit, this would give us a chance to warn the user that results shown are partial.

EDIT: reading more carefully if insource:RTRC did not find all pages but a simple RTRC did then it's none of the two problems I mentioned above, and maybe related to https://gerrit.wikimedia.org/r/#/c/261323/ ?

I know that @EBernhardson worked on similar issues but can't remember all the details nor if a reindex was scheduled after this patch was merged.

dcausse added a subscriber: EBernhardson.Aug 5 2016, 12:52 PM

Let's chat next week with @EBernhardson about prior details.

Hi @Krinkle - Can you take a look and see what you think - if the fixes we've done so far are enough or if there is more to be able to close this out.

As a reminder, we only process the first 10,000 pages.

Thanks!

As a reminder, we only process the first 10,000 pages.

What does this mean? Most wikis have more than 10,000 pages. Does this mean that if 1 page on the whole wiki contains a word, I would wrongly assume no page contains this word if I get 0 results?

Or do you mean it only processes first 10,000 matches for a search? (Meaning, if I get 10,000 results and change them all, I should re-run the search afterward before assuming the word no longer appears).

I need to know that if I look for a deprecated JavaScript method on a wiki's MediaWiki namespace using insource: and get 0 results, or get 5 results and I fixed those, that I can confidently assume it no longer appears on the given wiki.

Regarding the confirming of this bug being fixed, I can't really know whether it is fixed or not as I'm usually not aware of pages I'm trying to find. I only ran into this by accident when I assumed a wiki no longer had any matches for a CSS class or JavaScript method and later found one myself when browsing around.

Krinkle removed Krinkle as the assignee of this task.Aug 25 2016, 7:58 PM

In T127788#2583654, @Krinkle wrote:

I need to know that if I look for a deprecated JavaScript method on a wiki's MediaWiki namespace using insource: and get 0 results, or get 5 results and I fixed those, that I can confidently assume it no longer appears on the given wiki.

Indeed, T134157 has been created to address this problem by using a timeout approach and adding a warning if elastic was unable to scan all the pages.
The 10000 limit is misleading but in short: the regex uses a 2 pass technique, the first pass is an approximation made using an inverted index with trigrams, the second pass actually load the text and run the regex.
The 10000 limit applies to the second pass.
It means that if the trigrams extracted do not permit to filter less than 10000 docs it's very likely than you'll have partial results.
Please see my comment on T106685 for more details on this trigram extraction.

Change 306933 had a related patch set uploaded (by EBernhardson):
Saneitizer: Check if revisions are up to date

https://gerrit.wikimedia.org/r/306933

gerritbot added a project: Patch-For-Review.Aug 26 2016, 2:45 PM

linked patch will help with the pages mentioned that needed a null edit to be reindexed, this will check all pages every 2 weeks to ensure the latest version is in the index. Ideally we still keep chipping away at bugs that cause things to fail indexing in the first place, but this is a reasonable fallback for fixing problems.

EBernhardson claimed this task.Aug 29 2016, 8:16 PM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

Change 307652 had a related patch set uploaded (by EBernhardson):
Report partial result from mwgrep

https://gerrit.wikimedia.org/r/307652

gerritbot added a project: Patch-For-Review.Aug 30 2016, 11:14 PM

As a starting point the above patch adjusts mwgrep to limit itself based strictly on execution time and not by the number of inspected documents. The number of inspected documents was already a bit odd anyways because its the number inspected per shard, and mwgrep is querying ~9k shards. We can consider applying the same update to insource queries, but I would like to monitor the stats from rolling out the mwgrep change to get a feel for what kind of resource impact we can expect.

Patch additionally adds some warning text now when an early exit happens, so the user at least knows they received partial results.

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Aug 30 2016, 11:19 PM

Change 306933 merged by jenkins-bot:
Saneitizer: Check if revisions are up to date

https://gerrit.wikimedia.org/r/306933

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-09-06_(1.28.0-wmf.18)).Aug 31 2016, 12:00 PM

Change 307652 merged by Gehel:
Report partial result from mwgrep

https://gerrit.wikimedia.org/r/307652

EBernhardson added a subtask: T145023: Searching for insource:tag finds <tag> but not {{#tag:tag}}.Sep 12 2016, 2:13 PM

will evaluate moving the timeout/max documents changes applied to mwgrep to insource in T134157

debt closed this task as Resolved.Sep 16 2016, 6:51 PM

debt closed subtask T145023: Searching for insource:tag finds <tag> but not {{#tag:tag}} as Resolved.Sep 23 2016, 9:04 PM

mwgrep and "insource:" search is missing lots of pages in its indexClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

mwgrep and "insource:" search is missing lots of pages in its index
Closed, ResolvedPublic
Actions

Related Objects
Search...