Page MenuHomePhabricator

"somevalue" SDOC statements not visible in search index
Open, MediumPublic

Description

In these edits https://commons.wikimedia.org/w/index.php?title=File%3AChiesa_di_San_Francesco_-_Trevi_21.jpg&type=revision&diff=371019235&oldid=366733414 I added some structured information to a file. This is visible at https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M82275323 . For the creator I use "somevalue" and that doesn't show up in the search index, see https://commons.wikimedia.org/w/index.php?title=File:Chiesa_di_San_Francesco_-_Trevi_21.jpg&action=cirrusdump :

statement_keywords	
0	"P6216=Q50423863"
1	"P275=Q18199165"

When I look at https://commons.wikimedia.org/w/index.php?title=File:Betsey_Johnson_dress_other_cardigan.jpg&action=cirrusdump I see that qualifiers are supported:

statement_keywords	
0	"P180=Q467"
1	"P180=Q467[P3828=Q200539]"
2	"P180=Q467[P3828=Q877140]"
3	"P180=Q467[P3828=Q37501]"
4	"P180=Q11442"

For the example file I would expect something like

statement_keywords	
0	"P170=somevalue"
1	"P170=somevalue[P3831=Q33231]"
2	"P170=somevalue[P3831=Q33231]"
2	"P170=somevalue[P2093=Diego Baglieri]"
(etc.)

So please modify the search to also index these. You probably want to tackle novalue while you're at it. Not sure how to make the distinction between the string and the keywords.

I see that Wikidata has the same problem (for example https://www.wikidata.org/w/index.php?title=Q29569412&action=cirrusdump ), but on Wikidata it's less pressing because we mainly use SPARQL.

Event Timeline

Aklapper renamed this task from "somevalue" SDOC statements not visisble in search index to "somevalue" SDOC statements not visible in search index.Oct 19 2019, 1:18 PM

This is an intentional limitation. There is a configuration sent to the builder:

  • searchIndexProperties: List of property IDs to index
  • searchIndexTypes: List of property types to index. Property of this type will be indexed regardless of $propertyIds
  • searchIndexPropertiesExclude: List of property IDs to exclude

At a high level, we cannot simply index all of the wikidata statements with qualifiers into elasticsearch. Each unique token is considered a word and must be held in multiple in-memory data structures across the cluster.

It looks like the current values for commonswiki are:

searchIndexProperties: P180 (depicts)
searchIndexTypes: string, external-id, wikibase-item, wikibase-property, wikibase-lexeme, wikibase-form, wikibase-sense
searchIndexPropertiesExclude: nothing

Can we adjust these to cover most use cases while cutting off the long tail of unique values?

It looks like the current values for commonswiki are:

searchIndexProperties: P180 (depicts)
searchIndexTypes: string, external-id, wikibase-item, wikibase-property, wikibase-lexeme, wikibase-form, wikibase-sense
searchIndexPropertiesExclude: nothing

Can we adjust these to cover most use cases while cutting off the long tail of unique values?

Focus is Structured Data on Commons, Wikidata is for bonus points. Can you please add a link to the configuration for both Wikidata and Commons? I wouldn't have a clue where to find that. Thanks.

It looks like the current values for commonswiki are:

searchIndexProperties: P180 (depicts)
searchIndexTypes: string, external-id, wikibase-item, wikibase-property, wikibase-lexeme, wikibase-form, wikibase-sense
searchIndexPropertiesExclude: nothing

Can we adjust these to cover most use cases while cutting off the long tail of unique values?

Focus is Structured Data on Commons, Wikidata is for bonus points. Can you please add a link to the configuration for both Wikidata and Commons? I wouldn't have a clue where to find that. Thanks.

These are the properties for commonswiki (SDoC). They are set globally in SearchSettingsForWikibase.php[1]. Two of those vary per-wiki, and are set from InitializeSettings.php[2].

[1] https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/96648448e9aa3ae270d0cb311bbaa343c5b0b2d0/wmf-config/SearchSettingsForWikibase.php#63
[2] https://codesearch.wmflabs.org/operations/?q=wmgWikibaseSearchIndexProperties&i=nope&files=&repos=

@dcausse can you take a look at this one? thanks!

dcausse moved this task from needs triage to Wikibase Search on the Discovery-Search board.

Indexing somevalue and novalue will require a change in WikibaseCirrusSearch as they are explicitely excluded no matter what is configured.
Once the extension fixed we could update the config to add somevalue and novalue to the repo setting searchIndexTypes.
I also share the same concern Erik had (try to avoid adding plenty of unique values) but I don't see an obvious way to evaluate the impact of this kind of changes before hand.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.