Page MenuHomePhabricator

Placeholder text prompt for a caption shouldn't be inserted into the search index
Closed, ResolvedPublic

Description

Searching Commons for "Add a one-line explanation of what this file represents" yields a lot of matches that have a caption, but not an English caption. These matches are unexpected and should not occur.

https://commons.wikimedia.org/w/index.php?search=%22Add+a+one-line+explanation+of+what+this+file+represents%22&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B6%5D%7D&ns6=1

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 12 2019, 9:51 PM
Jdforrester-WMF renamed this task from Wikimedia Search finds Structured Data help text unexpectedly (Add a one-line explanation of what this file represents) to Placeholder text prompt for a caption shouldn't be inserted into the search index.Jan 12 2019, 9:55 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptJan 12 2019, 9:55 PM

To be excluded they need to match one of these selectors, or we need to update the selector list: https://github.com/wikimedia/mediawiki/blob/master/includes/content/WikiTextStructure.php#L29

EBjune triaged this task as Normal priority.Jan 17 2019, 6:08 PM

Hmm, this feels like it's an artefact of our work-around to T190066.

Change 485077 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/core@master] WikiTextStructure: Add an exclusion from WikibaseMediaInfo

https://gerrit.wikimedia.org/r/485077

Once the patch is deployed removing the already indexed content has a few options:

  • As the pages get edited they will be reindexed and the extra strings will be removed
  • There is an automated process that reindexes 1/8th of the pages every week. After 8 weeks all pages will have had their indexed content regenerated without anyone doing anything.
  • If we can come up with a list of page ids that can be given to the cirrus reindexing script that will also do the trick. Unfortunately we are not able to collect this list from the elasticsearch cluster currently (T213994)

Will get deployed next week. Search index may take a little time to update.

Once the patch is deployed removing the already indexed content has a few options:

  • As the pages get edited they will be reindexed and the extra strings will be removed
  • There is an automated process that reindexes 1/8th of the pages every week. After 8 weeks all pages will have had their indexed content regenerated without anyone doing anything.
  • If we can come up with a list of page ids that can be given to the cirrus reindexing script that will also do the trick. Unfortunately we are not able to collect this list from the elasticsearch cluster currently (T213994)

I think it's probably sufficient to wait.

Change 485077 merged by jenkins-bot:
[mediawiki/core@master] WikiTextStructure: Add an exclusion from WikibaseMediaInfo

https://gerrit.wikimedia.org/r/485077

Update as of 2019-02-06: Numbers of results returned are trending downwards, but aren't at 0 yet; the full cycle of eight weeks will be over on 2019-03-20, so will re-check around then.

Update as of 2019-02-06: Numbers of results returned are trending downwards, but aren't at 0 yet; the full cycle of eight weeks will be over on 2019-03-20, so will re-check around then.

Noted. I've set a calendar reminder ;)

greg added a project: Multimedia.Mar 7 2019, 10:59 PM
Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.Mar 8 2019, 2:53 AM

As of today we've still got about 9800 files remaining with this problem. Do we need to do anything else?

Well, the first one I found actually is that text – the user must have copy-pasted it in as the value: https://commons.wikimedia.org/w/index.php?title=File:Haemocytometer_side-ml.svg&diff=341229381&oldid=341229378&diffmode=source

The process that reindexes all pages was broken initially, it was fixed in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/488967/ which was merged feb 11, giving a rough deploy date of feb 14. As long as the number continues declining (9870 results as of today, march 26th), we should simply keep waiting. A full loop over the large wikis takes 8 to 10 weeks.

Well, the first one I found actually is that text – the user must have copy-pasted it in as the value: https://commons.wikimedia.org/w/index.php?title=File:Haemocytometer_side-ml.svg&diff=341229381&oldid=341229378&diffmode=source

Taking a quick guess with insource:"Add a one-line explanation of what this file represents" it seems only two pages managed to copy the actual content.

There were less than 10k results, which is the amount i can pull in a single search, so I pulled them and gave the id's to the reindexer script. It seems we still have two pages that have the value copied into them, and two pages that somehow aren't fixed. Can look into it if important.

This now looks entirely fixed. Thank you, Erik!

Ramsey-WMF closed this task as Resolved.Apr 3 2019, 2:31 AM

Tested and fixed as of Erik's last update. Thanks!