Page MenuHomePhabricator

Look into logstash for errors on maps
Closed, ResolvedPublic5 Estimated Story Points

Description

Let's lookout for errors and errorish status codes in the logs for the maps servers.

Document what we find and see if it fits any existing task.

Outcomes

New logstash dashboard for debugging Kartotherian: https://logstash.wikimedia.org/goto/9e502017ee8d1c81b8ea94ebf5573ecd

Obvious bugs and cleanup work have been split into subtasks.

Out of roughly 1M log messages across all hosts in 1 week, these are the main categories and frequencies of noteworthy errors,

MessageTaskImpactShare
groupIds not availableT308223Missing all annotations83% (see T309188#8292087 for explanation)
Marker symbol '%s' is invalidT145475 (and some are just invalid)Missing map7%
Failed to parse color: "#function fill() { [native code] }"T308560Missing map3%
Bad geojson - unknown type ExternalDataT308223Missing map1.5%
SPARQL query result contains non-unique IDT308223No effect0.4%
ETIMEDOUTMissing map0.4%
XML document not well formed0.2%
"ids" or "query" parameter must be givenMissing map0.1%
image created from bytes must be 2048 pixels or fewer on each side0.07%
ESOCKETTIMEDOUT0.05%

A few errors also show that SPARQL queries and GeoJSON are sometimes invalid.

Looking at the "groupIds not available" errors specifically, taking a sample of 1 week and roughly 900k missing groupIds messages,

CategoryTaskShare
Wikidata thumbnails missing titleT30969548%
Transliterated text needs language parameterT24631427%
Other errors where revid is not presentT30977317%
Errors although revid is presentT3097027.3%
Wikivoyage template malfunctioning? Errors like "groupIds not available: Maske,Track,Aktivität,Anderes,Anreise,Ausgehen,Aussicht,Besiedelt,Fehler,Gebiet,Kaufen,Küche,Sehenswert,Unterkunft,aquamarinblau,cosmos,gold,hellgrün,orange,pflaumenblau,rot,silber,violett"1.3%

There should be no more requests without revid, and those with revid should succeed almost always, so these are surprising statistics. The wikidata subtask is an easy win.

Related Objects

Event Timeline

Pulling from the doc above, we saw roughly 1.3M errors over the sampled week May 13-20, and of those 1.1M were the "groupIds not available" message.

Spot-checking the huge number of bad groupIds, I'm not coming up with a general theory yet. To take one example, the article for the Russian Embassy in Bejing on Chinese Wikipedia is requested roughly a dozen times per day, with a consistent but incorrect map hash. This map is coming from a template and pulling the Wikidata coordinate for the page item, but neither the coordinate nor the template seem to have changed within the last few months. Ultimately, the mapframe is generated by this Lua module which hasn't changed in years.

What's even more confounding is that the map image request is now being generated with revids for versioning, (sample link), and this revid is for the latest revision. The issue can be seen before versioned maps were deployed, with the same incorrect groupId being requested.

I'll check the referrer and agent for these requests, to be sure these are regular users visiting the page and not bots or coming from an off-wiki source. But the revids parameter is a strong indicator that MediaWiki is at fault.

Interesting lead to follow up on: the lang parameter to Kartotherian is always zh, however there seems to be a relationship between the bad map group ID and the request's accept-language being "zh-cn". Indeed, setting uselang on the page view https://zh.wikipedia.org/wiki/%E4%BF%84%E7%BE%85%E6%96%AF%E9%A7%90%E8%8F%AF%E5%A4%A7%E4%BD%BF%E9%A4%A8?uselang=zh-cn results in the broken map: https://maps.wikimedia.org/img/osm-intl,10,a,a,270x200.png?lang=zh&domain=zh.wikipedia.org&title=%E4%BF%84%E7%BE%85%E6%96%AF%E9%A7%90%E8%8F%AF%E5%A4%A7%E4%BD%BF%E9%A4%A8&revid=71713591&groups=_bc660fd4ba69fba1a1a083ce694bb530f37ad88f

Requesting the mapdata with a uselang=zh-cn parameter results in the bad group ID:
https://zh.wikipedia.org/w/api.php?action=query&prop=mapdata&revids=60731942&uselang=zh-cn

This suggests that parser cache is being split by interface language, within the mapframe. Here's the livedata with uselang=en,

{
  "_dc1eec1fc2c94a5130dac46def8fec9060c3fdfb": [
    {
      "type": "ExternalData",
      "service": "geoshape",
      "url": "https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q4374043",
      "properties": {
        "stroke-width": 3,
        "stroke": "#FF0000",
        "title": "俄羅斯駐華大使館",
        "fill": "#606060"
      }
    },
    {
      "type": "ExternalData",
      "service": "geoline",
      "url": "https://maps.wikimedia.org/geoline?getgeojson=1&ids=Q4374043",
      "properties": {
        "stroke-width": 5,
        "stroke": "#FF0000",
        "title": "俄羅斯駐華大使館"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "coordinates": [
          116.422897,
          39.944452
        ],
        "type": "Point"
      },
      "properties": {
        "title": "俄羅斯駐華大使館",
        "marker-color": "#5E74F3"
      }
    }
  ]
}

and with uselang=zh-cn,

{
  "_bc660fd4ba69fba1a1a083ce694bb530f37ad88f": [
    {
      "type": "ExternalData",
      "service": "geoshape",
      "url": "https://maps.wikimedia.org/geoshape?getgeojson=1&ids=Q4374043",
      "properties": {
        "stroke-width": 3,
        "stroke": "#FF0000",
        "title": "俄罗斯驻华大使馆",
        "fill": "#606060"
      }
    },
    {
      "type": "ExternalData",
      "service": "geoline",
      "url": "https://maps.wikimedia.org/geoline?getgeojson=1&ids=Q4374043",
      "properties": {
        "stroke-width": 5,
        "stroke": "#FF0000",
        "title": "俄罗斯驻华大使馆"
      }
    },
    {
      "type": "Feature",
      "geometry": {
        "coordinates": [
          116.422897,
          39.944452
        ],
        "type": "Point"
      },
      "properties": {
        "title": "俄罗斯驻华大使馆",
        "marker-color": "#5E74F3"
      }
    }
  ]
}

It's traditional vs. simplified chinese, "華" vs. "华", so probably due to some exciting layer which intentionally splits by interface language preferences, regardless of the page's content language.

awight updated the task description. (Show Details)
awight updated the task description. (Show Details)
awight changed the point value for this task from 3 to 5.
awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2022-05-25 board.

Note: geoshapes service errors need to have additional fields added to the log messages, so that we can connect them to a specific page.

Would close. Investigation finished from our scope so far.