Page MenuHomePhabricator

[plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms
Closed, ResolvedPublic5 Estimated Story Points

Description

Now that mjolnir is now able to do queries and feature collection from 7.10.2 we are running into some new issues, exemplified by the following features collected from a single document:

'title_sum_raw_df': 0.0,
'title_min_raw_df': 3.4028235e+38,
'title_max_raw_df': 0.0,
'title_mean_raw_df': nan,
'title_stddev_raw_df': nan,

In this constructed example the query terms only matched the document body, but not the title. In 6.8.23 this returned all 0's, but in 7.10.2 we are getting Float.MAX_VALUE for min, and nan's for mean/stddev. Per the codebase the 7.10.2 version should still be returning 0's for all values there.

This will need to be fixed in the LTR plugin, fixing it only on the mjolnir side would introduce a variation (plausibly minimal) between the features used at training time and the features used at evaluation time. This almost certainly also effects the currently deployed models.

Event Timeline

Plugin patch: https://github.com/ebernhardson/elasticsearch-learning-to-rank/commit/c9a59cb840f872d29263a02275cedae16ba43aa4
Based on upstream change (which mixed solutions to several related problems into one patch): https://github.com/o19s/elasticsearch-learning-to-rank/pull/380

Decided to keep our fix relateively narrow to the problem we are experiencing, although the general solutions provided in that pull might be relevant to us.

Change 865178 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/software/elasticsearch/plugins@master] Update ltr plugin to 7.10.2-wmf1

https://gerrit.wikimedia.org/r/865178

Moving back to Ready for Dev -- SRE/Ops and clearing Assignee for the actual deploy.

dcausse renamed this task from Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms to [plugin deploy] Incorrect stats returning from 7.10.2 ltr plugin for non-matching terms .Dec 12 2022, 1:38 PM

Change 865178 merged by Bking:

[operations/software/elasticsearch/plugins@master] Update ltr plugin to 7.10.2-wmf1

https://gerrit.wikimedia.org/r/865178

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:24:52Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:25:12Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:35:51Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:36:30Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:44:13Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:44:16Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:44:38Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:52:11Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T20:52:30Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T21:34:28Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T21:45:59Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T22:32:41Z] <bking@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-09T22:33:10Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T00:48:17Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: plugin upgrade - bking@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T02:08:40Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T02:41:08Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T02:46:49Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T03:12:09Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T20:18:33Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247

Mentioned in SAL (#wikimedia-operations) [2023-01-10T22:42:49Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: plugin upgrade - ryankemper@cumin1001 - T324247

@EBernhardson Plugin upgrade (rolling restarts) on relforge/cloudelastic/codfw/eqiad are complete. Will defer to you for validating that the fix worked as intended.

I re-enabled the mjolnir dag within airflow, will see how it progresses.