Page MenuHomePhabricator

New CirrusSearch dumps are not properly formatted
Closed, ResolvedPublic5 Estimated Story Points

Description

The new cirrus dumps available in are not properly formatted.

The redirect array looks like:

"redirect": [
  [
    0,
    "Area code 256"
  ],
  [
    0,
    "Area code 938"
  ]
],

But should look like:

"redirect": [
  {
    "namespace": 0,
    "title": "Area code 256"
  },
  {
    "namespace": 0,
    "title": "Area code 938"
  }
],

This seems to affect other array of objects, for instance the coordinates array looks like this:

"coordinates": [
  [
    {
      "lon": 8.816666666666666,
      "lat": 51.78333333333333
    },
    null,
    1000,
    "earth",
    null,
    true,
    null,
    null
  ]
],

From https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/cirrussearch/update_pipeline/update/current.yaml
The potential fields affected are:

  • redirect
  • coordinates
  • lexeme_forms

AC:

  • new cirrus dumps are properly formatted

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
pfischer set the point value for this task to 5.Dec 1 2025, 4:54 PM
EBernhardson subscribed.

The code fix itself ended up being pretty straight forward. We might use this opportunity to re-run the most recent dump, learn a bit more about how replacing an already published dump would work.

Patches shipped, 20260104 dump was rerun and looks reasonable. I imported the simplewiki dump into a local instance and it loaded without issues.