Page MenuHomePhabricator

Strip old metadata from old Parsoid content <head>: mw:TimeUuid, user, comment
Open, MediumPublic

Description

As discussed in T125266 and implemented in https://gerrit.wikimedia.org/r/#/c/271930/2/lib/wt2html/DOMPostProcessor.js, Parsoid recently removed user information from their <head>. For old content (spec version 1.1.0), we should strip this information before returning it to the client to avoid exposing possibly-sensitive user information.

Additionally, we used to inject a mw:TimeUuid meta tag as a fall-back solution for clients not supplying If-Match headers to the html2wt end point. We should verify that all major clients like VE have moved to the If-Match header, and strip it from old content as well.

Event Timeline

Related PR by @Pchelolo to document If-Match & log remaining mw:TimeUuid users: https://github.com/wikimedia/restbase/pull/529

After the last deployment we're logging the clients who don't supply the If-Match header, but do supply the mw:TimeUuid meta property. There's around 5 requests like that per hour. Here's a sample log entry:

{
  "_index": "logstash-2016.03.02",
  "_type": "restbase",
  "_id": "AVM4Ry_qO3D718AOW20G",
  "_score": null,
  "_source": {
    "host": "restbase1008",
    "level": "WARNING",
    "version": "1.0",
    "@version": "1",
    "@timestamp": "2016-03-02T17:01:41.790Z",
    "source_host": "10.64.32.178",
    "pid": 289,
    "levelPath": "warn/parsoid/etag",
    "root_req.method": "post",
    "root_req.uri": "/en.wikipedia.org/v1/transform/html/to/wikitext/Luke_Skywalker/707928665",
    "root_req.headers.content-length": "377642",
    "root_req.headers.content-type": "application/x-www-form-urlencoded",
    "root_req.headers.user-agent": "wikimedia/multi-http-client v1.0",
    "root_req.headers.x-client-ip": "::ffff:10.64.48.34",
    "root_req.headers.x-request-id": "6fa06749-e098-11e5-9d0c-d4de92971dc1",
    "request_id": "6fa06749-e098-11e5-9d0c-d4de92971dc1",
    "type": "restbase",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "message": "Client did not supply etag, fallback to mw:TimeUuid meta element",
    "normalized_message": "Client did not supply etag, fallback to mw:TimeUuid meta element",
    "gelf_level": "4"
  },
  "sort": [
    1456938101790
  ]
}

This particular log entry suggests that this edit was done by a bot, but more investigation is needed.

Flow doesn't use RESTBase, and I don't think the Parsoid change is applicable to the endpoint we use (we don't use the same kind of revisions as MW).

There's been 700 cases when the If-Match was not supplied over the last month and only 2 user agents:

  1. https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper - a user script. I will reach out to the maintainer
  2. And, surprisingly, still VE https://logstash.wikimedia.org/goto/4b751b3fbf5de1044bd4255c693fc724

Tagging as a good onboarding bug as once the subtask is resolved, it will be easy to fix in code and it provides a great glimpse into what a render is, how Parsoid, RESTBase and VE interacts and what constraints we need to maintain in order to make the 3 works together correctly.

There's been 700 cases when the If-Match was not supplied over the last month and only 2 user agents:

  1. https://en.wikipedia.org/wiki/User:Jackmcbarn/editProtectedHelper - a user script. I will reach out to the maintainer
  2. And, surprisingly, still VE https://logstash.wikimedia.org/goto/4b751b3fbf5de1044bd4255c693fc724

We are currently exposing "weak" etags (apparently because Varnish makes them weak). Per the HTTP spec, weak ETags never match an If-Match header. RESTbase ignores this and allows them to work, but MediaWiki's REST framework is more strict, causing all If-Match conditionals to fail with weak etags (T238849). This lead to If-Match handing to be disabled entirely for the endpoint in the parsoid extension, which breaks things the other way around. See T310710 for further discussion.

If we do not find a reliable way to get strong etags through varnish, we should probably go back to supplying them as part of the payload, be it embedded in HTML or in a JSON field.