Page MenuHomePhabricator

Strip old metadata from old Parsoid content <head>: mw:TimeUuid, user, comment
Open, MediumPublic

Description

As discussed in T125266 and implemented in https://gerrit.wikimedia.org/r/#/c/271930/2/lib/wt2html/DOMPostProcessor.js, Parsoid recently removed user information from their <head>. For old content (spec version 1.1.0), we should strip this information before returning it to the client to avoid exposing possibly-sensitive user information.

Additionally, we used to inject a mw:TimeUuid meta tag as a fall-back solution for clients not supplying If-Match headers to the html2wt end point. We should verify that all major clients like VE have moved to the If-Match header, and strip it from old content as well.

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenBUG REPORTNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
DuplicateNone
Resolvedmatmarex
Resolvedmatmarex
Resolved mobrovac
Resolved mobrovac
Resolved mobrovac
OpenNone
Resolvedcscott
ResolvedABreault-WMF
Resolvedcscott
Opencscott
Resolvedssastry
ResolvedJgiannelos
OpenJgiannelos
OpenJgiannelos
OpenJgiannelos
OpenJgiannelos
ResolvedJgiannelos

Event Timeline

Related PR by @Pchelolo to document If-Match & log remaining mw:TimeUuid users: https://github.com/wikimedia/restbase/pull/529

After the last deployment we're logging the clients who don't supply the If-Match header, but do supply the mw:TimeUuid meta property. There's around 5 requests like that per hour. Here's a sample log entry:

{
  "_index": "logstash-2016.03.02",
  "_type": "restbase",
  "_id": "AVM4Ry_qO3D718AOW20G",
  "_score": null,
  "_source": {
    "host": "restbase1008",
    "level": "WARNING",
    "version": "1.0",
    "@version": "1",
    "@timestamp": "2016-03-02T17:01:41.790Z",
    "source_host": "10.64.32.178",
    "pid": 289,
    "levelPath": "warn/parsoid/etag",
    "root_req.method": "post",
    "root_req.uri": "/en.wikipedia.org/v1/transform/html/to/wikitext/Luke_Skywalker/707928665",
    "root_req.headers.content-length": "377642",
    "root_req.headers.content-type": "application/x-www-form-urlencoded",
    "root_req.headers.user-agent": "wikimedia/multi-http-client v1.0",
    "root_req.headers.x-client-ip": "::ffff:10.64.48.34",
    "root_req.headers.x-request-id": "6fa06749-e098-11e5-9d0c-d4de92971dc1",
    "request_id": "6fa06749-e098-11e5-9d0c-d4de92971dc1",
    "type": "restbase",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "message": "Client did not supply etag, fallback to mw:TimeUuid meta element",
    "normalized_message": "Client did not supply etag, fallback to mw:TimeUuid meta element",
    "gelf_level": "4"
  },
  "sort": [
    1456938101790
  ]
}

This particular log entry suggests that this edit was done by a bot, but more investigation is needed.

Flow doesn't use RESTBase, and I don't think the Parsoid change is applicable to the endpoint we use (we don't use the same kind of revisions as MW).

There's been 700 cases when the If-Match was not supplied over the last month and only 2 user agents:

  1. https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper - a user script. I will reach out to the maintainer
  2. And, surprisingly, still VE https://logstash.wikimedia.org/goto/4b751b3fbf5de1044bd4255c693fc724

Tagging as a good onboarding bug as once the subtask is resolved, it will be easy to fix in code and it provides a great glimpse into what a render is, how Parsoid, RESTBase and VE interacts and what constraints we need to maintain in order to make the 3 works together correctly.

There's been 700 cases when the If-Match was not supplied over the last month and only 2 user agents:

  1. https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper - a user script. I will reach out to the maintainer
  2. And, surprisingly, still VE https://logstash.wikimedia.org/goto/4b751b3fbf5de1044bd4255c693fc724

We are currently exposing "weak" etags (apparently because Varnish makes them weak). Per the HTTP spec, weak ETags never match an If-Match header. RESTbase ignores this and allows them to work, but MediaWiki's REST framework is more strict, causing all If-Match conditionals to fail with weak etags (T238849). This lead to If-Match handing to be disabled entirely for the endpoint in the parsoid extension, which breaks things the other way around. See T310710 for further discussion.

If we do not find a reliable way to get strong etags through varnish, we should probably go back to supplying them as part of the payload, be it embedded in HTML or in a JSON field.