Page MenuHomePhabricator

Text content of wiki page in search index can merge words making them unfindable.
Closed, ResolvedPublic

Description

If a redlink is the last entry in a ''paragraph'' in german wiktionary, the text of the next template is concatenated to this redlinks text and this combination is put as one word into the fulltext searchindex. As a consequence the entry is not found in fulltext search using the redlink as searchtext. For example: Entering "Transportmedium" in the searchfield the entry "Träger" is not found. But entering "TransportmediumUnterbegriffe" or "BadeanzugträgerBeispiele" as searchtext will do as also "insource:/Transportmedium/ or "insource:/Badeanzugträger/ will do.
see also: https://de.wiktionary.org/wiki/Wiktionary:Fragen_zum_Wiktionary#Wiktionary-Suche

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 23 2018, 4:22 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptMay 25 2018, 5:44 PM
EBjune triaged this task as Medium priority.May 31 2018, 5:12 PM

This looks like a problem in the step that converts the wikitext parser's html output into plain text. Will need to look a bit closer.

Good search: https://de.wiktionary.org/w/index.php?search=insource%3A%2FBadeanzugtr%C3%A4ger%2F&fulltext=1
Bad search: https://de.wiktionary.org/w/index.php?search=Badeanzugtr%C3%A4ger&fulltext=1

Example page: https://de.wiktionary.org/wiki/Tr%C3%A4ger?action=cirrusdump
Content of the 'text' field:

BadeanzugträgerBeispiele

Wikitext in that area:

:[6] [[Bildträger]], [[Datenträger]], [[Instrumententräger]], [[Objektträger]], [[Querträger]],  [[Siebträger]], [[Schriftträger]], [[Tonträger]], [[Überweisungsträger]], [[Unterträger]]
:[1, 2] [[Badeanzugträger]]

{{Beispiele}}
:[1] Morgens unterhalb des Mount Everest: „Ruf doch mal den ''Träger!''“
EBernhardson renamed this task from Index for Special:Search is broken to Text content of wiki page in search index can merge words making them unfindable..May 31 2018, 5:16 PM
EBjune moved this task from needs triage to Up Next on the Discovery-Search board.May 31 2018, 5:17 PM
Cirdan added a subscriber: Cirdan.Jun 2 2018, 7:31 AM
Vvjjkkii renamed this task from Text content of wiki page in search index can merge words making them unfindable. to ufcaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from ufcaaaaaaa to Text content of wiki page in search index can merge words making them unfindable..Jul 2 2018, 1:55 PM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
EBernhardson added a comment.EditedSep 10 2018, 11:12 PM

The explanation is boils down to this:

<dl>
  <dd>foo</dd>
  <dd>bar</dd>
</dl> 
<p>baz</p>

When passed through HtmlFormatter::filterContent along with Sanitizer::stripAllTags, like is done to extract the content for the search index, the above returns: foobarbaz with no delimiters. I'm not sure what a real solution to this problem is, but a hack already exists that adds spaces before <br> tags. My suggestion would be to expand this to include a few other tags that indicate whitespace between the content of that tag and the prior content?

Adding <dd> and <p> to that list should do the trick, although I have been unable to reproduce locally and had to use the mwrepl on mwdebug1001. Will try and figure out how to apply that on mwdebug to verify before sending patch to gerrit.

Change 459657 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/core@master] Preserve whitespace in search index text content

https://gerrit.wikimedia.org/r/459657

Change 459657 merged by jenkins-bot:
[mediawiki/core@master] Preserve whitespace in search index text content

https://gerrit.wikimedia.org/r/459657

debt closed this task as Resolved.Oct 5 2018, 3:59 PM