Page MenuHomePhabricator

Investigate and cleanup broken weighted_tags in cirrus indices
Closed, ResolvedPublic3 Estimated Story Points

Description

It appears that some weighted_tags have the __DELETE_GROUPING__ tag in the search index: P83504.

We should investigate why __DELETE_GROUPING__ is part of the search index and then cleanup this data from the search index.

AC:

  • find the cause and fix it
  • cleanup the search indices in eqiad, codfw and cloudelastic

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Bump cirrus-extra to 1.3.20-wmf8repos/search-platform/opensearch-plugins-deb!13ebernhardsonwork/ebernhardson/noop-nullmaster
Improve handling arround weighted_tags delete markerrepos/search-platform/cirrus-streaming-updater!198ebernhardsonwork/ebernhardson/noop-hintsmain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel set the point value for this task to 3.

For recommendation.link/__DELETE_GROUPING__: that was the legacy, partially broken way how GrowthExperiments used to deleted weighted tags under some circumstances. The code that used to do that was removed in December 2024 in context of T379522 in Remove legacy way of clearing link recommendations + temp config. So all that remains should be cleaning up these tags from the search index.

Change #1197330 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Prune weighted_tags deleted marker during reindex

https://gerrit.wikimedia.org/r/1197330

Found one recent page with the problem: https://ce.wikipedia.org/w/index.php?curid=1348911&action=cirrusDump (page created on 2025-10-19T05:58:16Z)
The index doc has both:

  • recommendation.link/__DELETE_GROUPING__,
  • classification.ores.articletopic/__DELETE_GROUPING__

Corresponding intermediate update events are:

{
  "change_type": "REV_BASED_UPDATE",
  "cirrussearch_cluster_group": "chi",
  "cirrussearch_index_name": "cewiki_content",
  "cirrussearch_noop_hints": {
    "weighted_tags": "multilist"
  },
  "dt": "2025-10-19T05:58:16Z",
  "fields": {
    "weighted_tags": [
      "classification.prediction.articletopic/Culture.Sports|1000",
      "classification.prediction.articletopic/Culture.Biography.Biography*|1000",
      "classification.ores.articletopic/__DELETE_GROUPING__"
    ]
  },
  "meta": {
    "domain": "ce.wikipedia.org",
    "dt": "2025-10-19T06:03:48.453879Z",
    "id": "234d72e7-6f2c-4eb3-ac49-69383291a74a",
    "request_id": "6829faf6-2969-4257-a454-c50128debea4",
    "stream": "cirrussearch.update_pipeline.update.v1",
    "uri": "https://ce.wikipedia.org/wiki/%D0%90%D1%85%D1%82%D1%83%D1%80%D1%81%D0%BB%D0%BE_(%D1%84%D1%83%D1%82%D0%B1%D0%BE%D0%BB)"
  },
  "namespace_id": 0,
  "page_id": 1348911,
  "rev_id": 10912485,
  "wiki_id": "cewiki",
  "$schema": "/mediawiki/cirrussearch/update_pipeline/update/1.0.1"
}
{
  "change_type": "REV_BASED_UPDATE",
  "cirrussearch_cluster_group": "chi",
  "cirrussearch_index_name": "cewiki_content",
  "cirrussearch_noop_hints": {
    "weighted_tags": "multilist"
  },
  "dt": "2025-10-19T06:36:41Z",
  "fields": {
    "weighted_tags": [
      "classification.prediction.articletopic/Culture.Sports|1000",
      "classification.prediction.articletopic/Culture.Biography.Biography*|1000",
      "recommendation.link/__DELETE_GROUPING__",
      "classification.ores.articletopic/__DELETE_GROUPING__"
    ]
  },
  "meta": {
    "domain": "ce.wikipedia.org",
    "dt": "2025-10-19T06:42:14.504229Z",
    "id": "bb964185-30f4-4dd9-84c7-061d40e6a0a4",
    "request_id": "2130a7a9-e516-402b-b252-9e5694ec9a8f",
    "stream": "cirrussearch.update_pipeline.update.v1",
    "uri": "https://ce.wikipedia.org/wiki/%D0%90%D1%85%D1%82%D1%83%D1%80%D1%81%D0%BB%D0%BE_(%D1%84%D1%83%D1%82%D0%B1%D0%BE%D0%BB)"
  },
  "namespace_id": 0,
  "page_id": 1348911,
  "rev_id": 10912487,
  "wiki_id": "cewiki",
  "$schema": "/mediawiki/cirrussearch/update_pipeline/update/1.0.1"
}
{
  "change_type": "PAGE_RERENDER",
  "cirrussearch_cluster_group": "chi",
  "cirrussearch_index_name": "cewiki_content",
  "dt": "2025-10-19T13:14:03Z",
  "meta": {
    "domain": "ce.wikipedia.org",
    "dt": "2025-10-19T13:19:35.813586Z",
    "id": "c9b7a58d-0eb6-43c7-a7ca-00d891055590",
    "request_id": "b1f44e0e-1fbe-4af7-99a7-c7013d68c6c2",
    "stream": "cirrussearch.update_pipeline.update.v1",
    "uri": "https://ce.wikipedia.org/wiki/%D0%90%D1%85%D1%82%D1%83%D1%80%D1%81%D0%BB%D0%BE_(%D1%84%D1%83%D1%82%D0%B1%D0%BE%D0%BB)"
  },
  "namespace_id": 0,
  "page_id": 1348911,
  "wiki_id": "cewiki",
  "$schema": "/mediawiki/cirrussearch/update_pipeline/update/1.0.1"
}

Change #1197330 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Prune weighted_tags deleted marker during reindex

https://gerrit.wikimedia.org/r/1197330

Deployed the additional logging yesterday. This morning I reviewed the WARN logs emitted by the jobmanager's in codfw since the restart and none of the new log messages are being emitted. Based on the total volume of incorrectly indexed pages it seems likely whatever is causing these issues has happened in the last 12 hours, meaning whatever is causing this it's probably not the parts of SUP where logging was added.

I think the problem is in the extra plugin, I could reproduce it when the weighted_tags is current null in opensearch with the following bulk sequence:

{"index": {"_index": "my_database_content", "_id": "10000"}}
{}
{"update": {"_index": "my_database_content", "_id": "10000"}}
{"script":{"source":"super_detect_noop","lang":"super_detect_noop","params":{"handlers":{"weighted_tags":"multilist","version":"documentVersion"},"source":{"version":1,"weighted_tags":["mytag/__DELETE_GROUPING__","myothertag/somedata|2"]}}},"upsert":{"version":1}}

Change #1198569 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/extra@master] multilist: Improve support for null source value

https://gerrit.wikimedia.org/r/1198569

Thanks! With the reproduction was pretty easy to work up a fix.

Change #1198569 merged by jenkins-bot:

[search/extra@master] multilist: Improve support for null source value

https://gerrit.wikimedia.org/r/1198569