Page MenuHomePhabricator

Add new elasticsearch field to index the number of outgoing links
Closed, ResolvedPublic3 Estimated Story Points

Description

See parent task for details.

AC:

  • new field configuration is added to analysis chain config chains
  • reindexing all wikis is not part of this task

Event Timeline

Gehel set the point value for this task to 3.Sep 12 2022, 3:59 PM

Change 831988 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add token_count subfield to outgoing_link

https://gerrit.wikimedia.org/r/831988

Change 831988 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add token_count subfield to outgoing_link

https://gerrit.wikimedia.org/r/831988

Change 833031 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@wmf/1.40.0-wmf.1] Add token_count subfield to outgoing_link

https://gerrit.wikimedia.org/r/833031

Change 833031 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@wmf/1.40.0-wmf.1] Add token_count subfield to outgoing_link

https://gerrit.wikimedia.org/r/833031

Mentioned in SAL (#wikimedia-operations) [2022-09-19T20:59:10Z] <ebernhardson@deploy1002> Synchronized php-1.40.0-wmf.1/extensions/CirrusSearch/includes/Maintenance/MappingConfigBuilder.php: Backport: [[gerrit:833031|Add token_count subfield to outgoing_link (T317546)]] (duration: 03m 51s)

So this is a bit silly, but here is how someone can use the count in a query:

{
    "_source": [
        "title", "outgoing_link"
    ],
    "track_total_hits": true,
    "size": 1,
    "stored_fields": [],
    "query": {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "doc['outgoing_link.token_count'].length"
            }
        }
    }
}

In particular, note that we take .length of the token_count, rather than .value as might be expected. Due to elasticsearch oddities, when we index ["a", "b", "c"] and apply the token_count to it what gets stored in doc_values is [1,1,1]. So we are basically ignoring the count and instead taking the length of the doc_values since we know that all contained values are 1. A script could sum the doc_values, but as mentioned thats pointless here since they will always be 1.

Gehel added a subscriber: EBernhardson.

@Tgr : could you confirm that this works as expected for your use case? Thanks!

After discussion with @Tgr, I'll close this for now. Full validation needs a reindex, which is tracked on T147505. Feel free to re-open if there is an issue after the reindex.

Etonkovidova added a subscriber: Etonkovidova.

Based on the above comment, closing as Resolved.

@Tgr reindexing is completed, this field is now available on all indices in all clusters (prod and beta)

@Tgr reindexing is completed, this field is now available on all indices in all clusters (prod and beta)

Thank you @EBernhardson!