Page MenuHomePhabricator

[Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch
Closed, ResolvedPublic

Description

This task is for adding labels & descriptions into ElasticSearch and enabling prefix search to use them. Current progress plan:

  • Implement the code for creating & indexing label fields
  • Reindex testwiki and check that index looks sane
  • Reindex wikidata and check that index looks sane (T162292)
  • Add code that allows wbsearchentities to use search engine depending on query flag
  • Setup test page comparing two searches and make an announcement on the list to gather user feedback.
  • Collect feedback and bikeshed about search profiles, weights and result ranking, hopefully arriving to some workable weights profile. (T172467)
  • Set up the config above in production and enable CirrusSearch on wbsearchentities by default (T175741)
  • Discuss & resolve question of how to display entity & title search together
  • Refactor code more to allow opensearch and other code using completionSearch() use the same code as wbsearchentities, in service of the results of the discussion above.
  • Implement the GUI part of the two items above.
  • Figure out how to properly index descriptions (T176903)
  • Make code to allow fulltext search to use entity search when appropriate (T178851).
  • Enable Cirrus searching for Special:Search

More detailed plan: https://www.wikidata.org/wiki/User:Smalyshev_(WMF)/Wikidata_search

Related Objects

StatusSubtypeAssignedTask
Resolved Wikidata-bugs
OpenNone
Resolvedaude
ResolvedSmalyshev
Resolvedaude
ResolvedNone
InvalidNone
ResolvedSmalyshev
ResolvedLydia_Pintscher
DuplicateSmalyshev
DuplicateNone
DeclinedNone
DeclinedNone
Resolveddaniel
ResolvedLydia_Pintscher
OpenNone
DeclinedNone
ResolvedSmalyshev
ResolvedSmalyshev
DeclinedNone
ResolvedSmalyshev
InvalidNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
Resolveddcausse
ResolvedSmalyshev
Resolveddebt
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
ResolvedSmalyshev

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 336911 had a related patch set uploaded (by Smalyshev):
[WIP] Add indexing for descriptions

https://gerrit.wikimedia.org/r/336911

Change 334194 had a related patch set uploaded (by Smalyshev):
[mediawiki/extensions/Wikibase] Create mappings for Wikibase labels

https://gerrit.wikimedia.org/r/334194

Change 334019 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Generalize field definitions for Items and Properties

https://gerrit.wikimedia.org/r/334019

Change 334194 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Create mappings for Wikibase labels

https://gerrit.wikimedia.org/r/334194

Change 335728 had a related patch set uploaded (by Smalyshev):
[mediawiki/extensions/Wikibase@master] Add prefix search for labels

https://gerrit.wikimedia.org/r/335728

Issues found while testing on http://elastic-wikidata.wmflabs.org:

  1. Testing on http://elastic-wikidata.wmflabs.org shows that there is a minor issue: When an alias is matched, it is correctly shown in addition to the preferred label. However, if a label in a fallback language is matched (e.g. a search in de-ch matches the en label), that matched label is not shown in addition to the preferred (perhaps de) label. Example: on http://elastic-wikidata.wmflabs.org/index.php/ElasticRepo:Main_Page?uselang=de-ch, enter "H." into the quick search box. This will suggest Q5628592, because the English label is "H. T. Kung" - but it does not show that "H. T. Kung" was matched.
  2. Another issue Katie found when playing with elastic-wikidata.wmflabs.org: We'll need a way to boost exact matches vs. partial matches, or a way to boost label matches vs alias matches. Ideally, we would be able to tweak both. Example: "Poppy" brings up George H.W. Bush as the first match.

Regarding issue (2): if I read EntitySearchHelper correctly, the old system does not boost labels vs. aliases, it treats them exactly the same. It does however strongly prefer exact matches over partial (prefix) matches.

Change 335728 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add prefix search for labels

https://gerrit.wikimedia.org/r/335728

Change 336911 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add descriptions to ES index

https://gerrit.wikimedia.org/r/336911

I checked the problem in 1. and it looks like the issue is that the match - at least as reported by elastic - is against labels.de.prefix: "Hsiang-Tsung Kung". This is what ES actually matched - it prefers de fallback with inexact prefix to en fallback with exact one. We could play with weights to make it prefer the more exact match - maybe make discount for using fallback smaller or preference for using exact match bigger. Probably would need help from @dcausse on this, e.g. this query:

http://elastic-wikidata.wmflabs.org/api.php?action=wbsearchentities&search=H.%20T.&format=json&language=de-ch&uselang=de-ch&type=item&cirrusDumpQuery=true

produces this result:

http://elastic-wikidata.wmflabs.org/api.php?action=wbsearchentities&search=H.%20T.&format=json&language=de-ch&uselang=de-ch&type=item&cirrusDumpResult=true

and I'm not sure why exactly.

The query seems to use an inexistent field labels_all.near_match changing to labels_all.near_match_folded will fix the highlighting issue.

Thanks for the fix @Smalyshev!

@Ladsgroup can you update the elastic test site so it uses the latest master branch, so we can test this? Thanks!

@daniel: Done. I also imported around 500 events (chemical elements are too heavy to import)

Change 366788 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/mediawiki-config@master] Enable Cirrus search of wbsearchentities when using useCirrus=1

https://gerrit.wikimedia.org/r/366788

Change 366788 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Cirrus search of wbsearchentities when using useCirrus=1

https://gerrit.wikimedia.org/r/366788

Mentioned in SAL (#wikimedia-operations) [2017-07-24T18:16:40Z] <reedy@tin> Synchronized wmf-config/Wikibase.php: T125500 (duration: 00m 43s)

@Lea_Lacroix_WMDE @Lydia_Pintscher: this can now be tested live using the (undocumented) useCirrus parameter. Compare:

This doesn't work with the API sandbox, because useCirrus isn't an official parameter of the API module. Perhaps that should be fixed.

Anyway, we can now run a few queries, put the results on wiki pages, and discuss them with the community. Making such pages from the API responses is a "fun" job. Any volunteers?

I plan to make a page that allows to search with both and compare. We did something like that with completion suggester, I think it shouldn't be too hard to modify it.

I've made a page for search comparison: http://elastic-wikidata.wmflabs.org/wb.html
(despite it being hosted on elastic-wikidata, the data comes from www.wikidata.org).

Note that descriptions for Elastic search are now broken, due to T162292, as soon as that is done descriptions should be fine.

I've made a page for search comparison: http://elastic-wikidata.wmflabs.org/wb.html
(despite it being hosted on elastic-wikidata, the data comes from www.wikidata.org).

Note that descriptions for Elastic search are now broken, due to T162292, as soon as that is done descriptions should be fine.

Thanks! I've run some first test and I think there are still improvements needed in the field of abbrevations and disambiguation pages.

I'm keeping track of my experiences on https://www.wikidata.org/wiki/User:Sjoerddebruin/Cirrus now, it seems that not all exact matches are always showed.

debt renamed this task from Index Wikidata labels and descriptions as separate fields in ElasticSearch to [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch.Aug 1 2017, 5:19 PM
debt added a project: Epic.

I've moved the "combined search" part of this ticket to T190454, and I think the rest for it is done, so I'll resolve it soon unless there are objections.

I am resolving this as "combined search" part now has its own task.

Change 287161 abandoned by Aude:
Introduce FieldDefinitions for search

https://gerrit.wikimedia.org/r/287161