Placeholder text prompt for a caption shouldn't be inserted into the search index
Closed, ResolvedPublic
Actions

Description

Searching Commons for "Add a one-line explanation of what this file represents" yields a lot of matches that have a caption, but not an English caption. These matches are unexpected and should not occur.

https://commons.wikimedia.org/w/index.php?search=%22Add+a+one-line+explanation+of+what+this+file+represents%22&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B6%5D%7D&ns6=1

Details

	Subject	Repo	Branch	Lines +/-
	WikiTextStructure: Add an exclusion from WikibaseMediaInfo	mediawiki/core	master	+3 -1

Customize query in gerrit

Related Objects

Mentioned Here: T213994: A Commons search user should be able to search for only captions
T190066: Expose all slots to the search interface

Event Timeline

Herzi.Pinki created this task.Jan 12 2019, 9:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 12 2019, 9:51 PM

Jdforrester-WMF renamed this task from Wikimedia Search finds Structured Data help text unexpectedly (Add a one-line explanation of what this file represents) to Placeholder text prompt for a caption shouldn't be inserted into the search index.Jan 12 2019, 9:55 PM

Jdforrester-WMF edited projects, added Structured Data Engineering, CirrusSearch; removed Search-Platform-Programs.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJan 12 2019, 9:55 PM

Good spot.

To be excluded they need to match one of these selectors, or we need to update the selector list: https://github.com/wikimedia/mediawiki/blob/master/includes/content/WikiTextStructure.php#L29

• EBjune triaged this task as Medium priority.Jan 17 2019, 6:08 PM

Thanks!

Hmm, this feels like it's an artefact of our work-around to T190066.

Change 485077 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/core@master] WikiTextStructure: Add an exclusion from WikibaseMediaInfo

https://gerrit.wikimedia.org/r/485077

gerritbot added a project: Patch-For-Review.Jan 17 2019, 6:32 PM

Jdforrester-WMF moved this task from To Do to Doing on the Structured Data Engineering board.Jan 17 2019, 6:32 PM

Once the patch is deployed removing the already indexed content has a few options:

As the pages get edited they will be reindexed and the extra strings will be removed
There is an automated process that reindexes 1/8th of the pages every week. After 8 weeks all pages will have had their indexed content regenerated without anyone doing anything.
If we can come up with a list of page ids that can be given to the cirrus reindexing script that will also do the trick. Unfortunately we are not able to collect this list from the elasticsearch cluster currently (T213994)

Will get deployed next week. Search index may take a little time to update.

In T213638#4889956, @EBernhardson wrote:

Once the patch is deployed removing the already indexed content has a few options:

As the pages get edited they will be reindexed and the extra strings will be removed

There is an automated process that reindexes 1/8th of the pages every week. After 8 weeks all pages will have had their indexed content regenerated without anyone doing anything.

If we can come up with a list of page ids that can be given to the cirrus reindexing script that will also do the trick. Unfortunately we are not able to collect this list from the elasticsearch cluster currently (T213994)

I think it's probably sufficient to wait.

Change 485077 merged by jenkins-bot:
[mediawiki/core@master] WikiTextStructure: Add an exclusion from WikibaseMediaInfo

https://gerrit.wikimedia.org/r/485077

• EBjune moved this task from needs triage to watching / waiting on the Discovery-Search board.Jan 24 2019, 6:11 PM

Update as of 2019-02-06: Numbers of results returned are trending downwards, but aren't at 0 yet; the full cycle of eight weeks will be over on 2019-03-20, so will re-check around then.

In T213638#4932263, @Jdforrester-WMF wrote:

Update as of 2019-02-06: Numbers of results returned are trending downwards, but aren't at 0 yet; the full cycle of eight weeks will be over on 2019-03-20, so will re-check around then.

Noted. I've set a calendar reminder ;)

• Ramsey-WMF moved this task from Verify on Production to Monitoring on the Structured Data Engineering board.Feb 6 2019, 9:32 PM

greg added a project: Multimedia.Mar 7 2019, 10:59 PM

• Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.Mar 8 2019, 2:53 AM

As of today we've still got about 9800 files remaining with this problem. Do we need to do anything else?

Well, the first one I found actually is that text – the user must have copy-pasted it in as the value: https://commons.wikimedia.org/w/index.php?title=File:Haemocytometer_side-ml.svg&diff=341229381&oldid=341229378&diffmode=source

The process that reindexes all pages was broken initially, it was fixed in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/CirrusSearch/+/488967/ which was merged feb 11, giving a rough deploy date of feb 14. As long as the number continues declining (9870 results as of today, march 26th), we should simply keep waiting. A full loop over the large wikis takes 8 to 10 weeks.

In T213638#5059697, @Jdforrester-WMF wrote:

Well, the first one I found actually is that text – the user must have copy-pasted it in as the value: https://commons.wikimedia.org/w/index.php?title=File:Haemocytometer_side-ml.svg&diff=341229381&oldid=341229378&diffmode=source

Taking a quick guess with insource:"Add a one-line explanation of what this file represents" it seems only two pages managed to copy the actual content.

There were less than 10k results, which is the amount i can pull in a single search, so I pulled them and gave the id's to the reindexer script. It seems we still have two pages that have the value copied into them, and two pages that somehow aren't fixed. Can look into it if important.

This now looks entirely fixed. Thank you, Erik!

Tested and fixed as of Erik's last update. Thanks!

Placeholder text prompt for a caption shouldn't be inserted into the search indexClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Placeholder text prompt for a caption shouldn't be inserted into the search index
Closed, ResolvedPublic
Actions