Page MenuHomePhabricator

Investigate CirrusSearch as a Suggested Edits suggestions backend
Closed, ResolvedPublic

Description

The Android app's Suggested Edits feature requires an API that returns Commons image files (used on Wikipedias) meeting the following criteria:

  1. the image lacks a Wikibase caption in a specified language;
  2. the image has a Wikibase caption in language A but not language B.

The WikibaseCirrusSearch extension provides for searching the content of file captions (labels) while filtering by language. It would seem a short step (in the easier direction) to search for files based on the existence or nonexistence of labels in specified languages.

Questions

  1. Can WikibaseCirrusSearch easily be updated to support the above queries?
  2. Does CirrusSearch support returning random results rather than according to some particular ordering?
  3. Does CirrusSearch currently support searching for Commons files by usage on other wikis, or could it feasibly be updated to do so?

Tagging @dcausse @EBernhardson @Smalyshev as the experts here.

If this is viable, the existing description suggestions API should probably be moved over to use it as well.

Related Objects

Event Timeline

  1. Can WikibaseCirrusSearch easily be updated to support the above queries?

a. the image lacks a Wikibase caption in a specified language;

This is doable with a bool must_not and an exists query. Basically this is supported by the elasticsearch backend and our existing indexing pipeline, but the specific query would have to be written. Example:

{
    "bool": {
        "must_not": [
            {"exists": {"field": "labels.en.plain"}}
        ],
        "filter":[
            {"match": {"namespace": 6}}
        ]
    }
}

b. the image has a Wikibase caption in language A but not language B.

Basically the same as above, only difference is multiple exist's queries and putting them in the correct side of the bool.

  1. Does CirrusSearch support returning random results rather than according to some particular ordering?
  • The most efficient way to get a "random" ordering is the index order. This is very efficient because you get the first N results found, and we stop looking after finding N results. No ordering is applied. This top N will change depending on new/updated document indexing, background segment merging, which shard answers the queries, etc. This is already available as the just_match sort order.
  • I would have to evaluate the performance, but it is possible to sort by random numbers generated per query. Generating millions of random numbers and sorting by them is likely not that efficient, but we can evaluate it.
  • There may be some middle ground, but if either of the above work we should probably stick to the simplest methods.
  1. Does CirrusSearch currently support searching for Commons files by usage on other wikis, or could it feasibly be updated to do so?

We currently track in the commonswiki media index which wiki's have local files with the same name, but we don't know anything about which wikis or articles are using the images. The same type of functionality used to track the duplicate file names could plausibly be implemented, but would probably take a bit of work to ensure it works correctly. Once deployed it would take an additional 8-10 weeks to populate.

Thanks very much @EBernhardson for the info. This looks promising! (And sorry for the delayed response; I created this task immediately before going on vacation :) ).

As a sanity check, inlabel/incaption must not be set up on commonswiki yet, right? Otherwise I would expect https://commons.wikimedia.org/w/api.php?action=query&format=json&list=search&srsearch=inlabel:pluto@en&srnamespace=6 to return at least https://commons.wikimedia.org/wiki/File:Pluto-01_Stern_03_Pluto_Color_TXT.jpg as a result, rather than nothing. (Similar searches are working as expected on Commons Beta.)

Change 506835 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/WikibaseCirrusSearch@master] WIP: Add TermExistsFeature

https://gerrit.wikimedia.org/r/506835

I took a first stab at implementing this. Unfortunately, an error is thrown when I attempt to test (any CirrusSearch functionality) in vagrant:

For example, when requesting http://wikidata.wiki.local.wmftest.net:8080/w/api.php?action=query&formatversion=2&list=search&srsearch=inlabel:foo@en:

{
    "error": {
        "code": "cirrussearch-backend-error",
        "info": "We could not complete your search due to a temporary problem. Please try again later.",
        "docref": "See http://commons.wiki.local.wmftest.net:8080/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes."
    }
}

There seems to be some problem with my indices:

vagrant@cleanvagrant:/vagrant/mediawiki$ mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=wikidatawiki
indexing namespaces...
	Indexing namespaces...done
content index...
	Fetching Elasticsearch version...6.5.4...ok
	Scanning available plugins...
		analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
		analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
		extra-analysis-serbian, extra-analysis-slovak, ltr
	Picking analyzer...english
	Inferring index identifier...wikidatawiki_content_first
	Creating index...ok
		Validating number of shards...ok
		Validating replica range...ok
		Validating shard allocation settings...done
		Validating max shards per node...ok
	Validating analyzers...ok
	Validating mappings...
		Validating mapping...different...failed!
Couldn't update existing mappings. You may need to reindex.
Here is elasticsearch's error message: mapper_parsing_exception: analyzer [aa_plain] not found for field [plain]

(The result is the same regardless of the wiki.)

Weird, I just tried to do updateSearchIndexConfig on my vagrant and it worked fine. Which roles do you have enabled?

On the VM with the errors, I have wikibasecirrussearch, uls, and mediainfo applied.

I just set up a clean VM with only wikibasecirrussearch applied, and I seem to have got it working, but only partially; a search for 'foo' yields the expected result, but 'inlabel:foo' does not.

I tested this out today and initially ran into some errors, but they turned out to be related to moving code between wikibase and the wikibasecirrussearch extensions. After updating both to the master branch I'm able to both create and update wikidata indices correctly in vagrant. Could you double check that you have the latest master for both?

@EBernhardson Thanks for checking it out. I seem to have gotten the errors resolved, but now I'm seeing the same behavior on both of my Vagrant copies: given a wikidatawiki with an entity labeled "Foo," search=foo returns the entity, but search=inlabel:foo does not.

Is there some additional setup step I might have missed?

There are essentially two debug calls that we use to determine what is happening here, to see where the problem lies:

cirrusDumpQuery: Any full text query can have &cirrusDumpQuery appended to the url and instead of performing a search it will dump out the raw json request that we would be sending to elasticsearch. For example, https://commons.wikimedia.org/w/index.php?search=inlabel%3Afoo&cirrusDumpQuery

The important part of this response is in the middle a bit and looks like (appologies for the formatting, my in-browser json pretty printer does bad copy/paste). having this query against labels_all.plain injected means the keyword was appropriately transformed into an elasticsearch query.

{
 multi_match: {
  query: "foo",
  operator: "and",
  fields: [
   "labels_all.plain"
  ]
 }
}

action=cirrusDump: Any wiki page can use action=cirrusDump to get the backend representation, as stored in elasticsearch, of that page. Open up a page that should have been a result of the query and run that action. For example: https://commons.wikimedia.org/wiki/File:Kate_Foo_Kune.jpg?action=cirrusdump . The important part of this is the labels section, for this page it is:

labels: {
 en: [
  "Kate Foo Kune"
 ]
}

Basically which one of these two has the wrong information will inform where things went wrong. Or maybe both are missing :)

Both were missing; and then I configured $wgWBCSUseCirrus to true, and lo and behold, both appeared! But for some reason, the inlabel: search is still not returning the expected result.
cirrusdump for item Q2 (Foo): P8461
cirrusDumpQuery for inlabel:foo: P8460

I configured $wgWBCSUseCirrus to true, and lo and behold

Yes, it's false by default now. Maybe we should change it to true now that migration is done?

Both were missing; and then I configured $wgWBCSUseCirrus to true, and lo and behold, both appeared! But for some reason, the inlabel: search is still not returning the expected result.
cirrusdump for item Q2 (Foo): P8461
cirrusDumpQuery for inlabel:foo: P8460

My best guess since both of these are right, is that the index was created before $wgWBCSUseCirrus was set to true? In that case the updateSearchIndexConfig.php script needs to be re-run for the wikidatawiki. This should recreate the index with the custom WBCS mappings which will allow the query to work correctly.

This can be verified by looking at (incredibly verbose output ahead) the results of the cirrus-mapping-dump API call before and after creating. If the mapping was created without WBCS enabled this mapping wont contain all of the appropriate label fields. Essentially elasticsearch will accept any property we send in an update request, but it wont actually do anything with it unless it is also in the mapping which describes how to handle the field. Example: https://www.wikidata.org/w/api.php?action=cirrus-mapping-dump

I configured $wgWBCSUseCirrus to true, and lo and behold

Yes, it's false by default now. Maybe we should change it to true now that migration is done?

Almost certainly, it made sense for the transition but now it only defaults the extension to an unexpected state.

I got my indices rebuilt, and it now works! \o/ Thanks again for the assistance, @EBernhardson!

For the record, I ran into a surmountable issue during rebuilding. The build process for both the content and general indices would always appear to fail at this step:

vagrant@cleanvagrant:/vagrant/mediawiki$ mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki=wikidatawiki --reindexAndRemoveOk --indexIdentifier=now
indexing namespaces...
	Indexing namespaces...done
content index...
	Fetching Elasticsearch version...6.5.4...ok
	Scanning available plugins...
		analysis-hebrew, analysis-icu, analysis-nori, analysis-smartcn, analysis-stconvert
		analysis-stempel, analysis-ukrainian, experimental-highlighter, extra, extra-analysis-esperanto
		extra-analysis-serbian, extra-analysis-slovak, ltr
	Picking analyzer...english
	Setting index identifier...wikidatawiki_content_1556720726
	Creating index...ok
		Validating number of shards...
Unexpected Elasticsearch failure.
Elasticsearch failed in an unexpected way. This is always a bug in CirrusSearch.
Error type: Elastica\Exception\ResponseException
Message: index_not_found_exception: no such index
Trace:
#0 /vagrant/mediawiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Request.php(193): Elastica\Transport\Http->exec(Object(Elastica\Request), Array)
#1 /vagrant/mediawiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(688): Elastica\Request->send()
#2 /vagrant/mediawiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index.php(559): Elastica\Client->request('wikidatawiki_co...', 'GET', Array, Array)
#3 /vagrant/mediawiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(383): Elastica\Index->request('wikidatawiki_co...', 'GET', Array)
#4 /vagrant/mediawiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(74): Elastica\Index\Settings->request()
#5 /vagrant/mediawiki/extensions/CirrusSearch/includes/Maintenance/Validators/NumberOfShardsValidator.php(38): Elastica\Index\Settings->get()
#6 /vagrant/mediawiki/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php(371): CirrusSearch\Maintenance\Validators\NumberOfShardsValidator->validate()
#7 /vagrant/mediawiki/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php(317): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->validateIndexSettings()
#8 /vagrant/mediawiki/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php(266): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->validateIndex()
#9 /vagrant/mediawiki/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(61): CirrusSearch\Maintenance\UpdateOneSearchIndexConfig->execute()
#10 /vagrant/mediawiki/maintenance/doMaintenance.php(96): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute()
#11 /vagrant/mediawiki/extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php(70): require_once('/vagrant/mediaw...')
#12 /var/www/w/MWScript.php(98): require_once('/vagrant/mediaw...')
#13 {main}

Then I would have to manually delete the old index with, e.g., curl -XDELETE localhost:9200/wikidatawiki_content_first. Despite the error, the new indices appear to work fine.

Change 507597 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/extensions/WikibaseCirrusSearch@master] Default UseCirrus to true

https://gerrit.wikimedia.org/r/507597

As a sanity check, inlabel/incaption must not be set up on commonswiki yet, right? Otherwise I would expect https://commons.wikimedia.org/w/api.php?action=query&format=json&list=search&srsearch=inlabel:pluto@en&srnamespace=6 to return at least https://commons.wikimedia.org/wiki/File:Pluto-01_Stern_03_Pluto_Color_TXT.jpg as a result, rather than nothing. (Similar searches are working as expected on Commons Beta.)

Not sure what was up here, but for posterity, the search is now working as expected (n.b., specifying srnamespace=6 is critical).

Change 506835 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Add HasDataForLang feature

https://gerrit.wikimedia.org/r/506835

For Wikidata items, would it be possible to index sitelinks in the same way as descriptions and labels? (I suppose anything's possible, but what I'm really asking is whether it's been considered and whether there's any specific reason not to do it.) It probably seems kind of weird coming from the perspective of searching for specific text within a field, but being able to add the existence/nonexistence of a sitelink for a given lang as a search criterion would be quite useful for Suggested Edits.

I think I could work out how to add support on the PHP side just looking at what's done for labels and descriptions, but not sure if there's other stuff that would need to happen that I don't know about just from reading over the code.

It could be added, but probably not in the same form as fields/descriptions. Probably a single keyword field that contains an array of sites that are linked to, those can then be filtered on like a typical text field. I'm not familiar with how sitelinks are handled, but as long as they are an explicit property of the page and can only change as part of a page edit everything else should "just work". When adding new properties to cirrussearch it will take 8-10 weeks to populate the production indices before features using it can be shipped.

Should it be sitelinks as "which wiki is linked to this entity" or sitelinks as "which title on which wiki is linked to this entity"? If we're talking about titles, then do we need only keyword match or full language supporting match? Do we want the full title, with namespaces or just the name part?

Should it be sitelinks as "which wiki is linked to this entity" or sitelinks as "which title on which wiki is linked to this entity"?

Just being able to search by linked wiki would probably be sufficient for the current use case, though being able to get the actual title linked in the result would be best.

If we're talking about titles, then do we need only keyword match or full language supporting match?

I am not sure I quite understand the distinction here, but what I have in mind here is a pretty basic keyword search, like hassitelink:zhwiki.

Do we want the full title, with namespaces or just the name part?

Being able to narrow the query by (numeric) namespace would be great, but looking at the storage format for sitelinks (taking Imjingang Station (Q54345) as an example):

mholloway-shell@mwmaint1002:~$ PHP=php7.2 mwscript shell.php --wiki=wikidatawiki
Psy Shell v0.9.9 (PHP 7.2.8-1+0~20180725124257.2+stretch~1.gbp571e56 — cli) by Justin Hileman
>>> sudo MediaWiki\MediaWikiServices::getInstance()->getRevisionStore()->getRevisionByTitle( Title::newFromText( 'Q54345' ) )->getContent( 'main' )->getNativeData()['sitelinks'];
=> [
     "kowiki" => [
       "site" => "kowiki",
       "title" => "임진강역",
       "badges" => [],
     ],
     "enwiki" => [
       "site" => "enwiki",
       "title" => "Imjingang Station",
       "badges" => [],
     ],
     "jawiki" => [
       "site" => "jawiki",
       "title" => "臨津江駅",
       "badges" => [],
     ],
     "zhwiki" => [
       "site" => "zhwiki",
       "title" => "临津江站",
       "badges" => [],
     ],
     "commonswiki" => [
       "site" => "commonswiki",
       "title" => "Category:Imjingang Station",
       "badges" => [],
     ],
   ]

It unfortunately doesn't appear that they're currently stored. As things stand, I think we'll want the full title string including any namespace prefix.

what I have in mind here is a pretty basic keyword search, like hassitelink:zhwiki

For this, list of wiki names that have sitelinks (as keywords) is enough.

Being able to narrow the query by (numeric) namespace would be great

This would be much harder as Wikidata doesn't know which numeric value each namespace has in target wiki.

As things stand, I think we'll want the full title string including any namespace prefix.

Then it probably makes sense to store it as language string, however then it'd be complicated to assign each title to specific wiki I think (since Elastic doesn't really do structured data).

Maybe it would be easier to have a blacklist for P31 (instance of) values like "Wikimedia category" or "Wikimedia template".

I'm going to call this investigation resolved. WBCS is working very well for our purposes, and only needs now to be incorporated into the existing WikimediaEditorTasks suggestions API module (e.g., https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaEditorTasks/+/510255/). Possible new search features can be discussed on dedicated tickets.

  • I would have to evaluate the performance, but it is possible to sort by random numbers generated per query. Generating millions of random numbers and sorting by them is likely not that efficient, but we can evaluate it.

Hey @EBernhardson, I didn't ask to pursue this option last quarter because of time constraints and not being sure it would be needed, but it turned out it might have been useful, and it's coming up again in the context of T229315. Do you think you'd have time to evaluate this?

Change 526531 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Implement a random sort order

https://gerrit.wikimedia.org/r/526531

In a quick test over a few dozen queries it looks like this is faster than our existing ranking. On a rethink it makes sense, before we had to lookup various term statistics and perform rescore queries, but here we generate a random number for each visited document. Uploaded a patch that adds the random rescore functionality.

Change 526531 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Implement a random sort order

https://gerrit.wikimedia.org/r/526531

Change 507597 abandoned by Mholloway:
Default UseCirrus to true

https://gerrit.wikimedia.org/r/507597