Page MenuHomePhabricator

Improve resource-purge request_id and dt propagation
Open, MediumPublic

Description

Now that we are producing all purges into resource-purge topic, I would like to take a moment to improve propagation of request_id and dt fields.

We need to:

  1. Make sure we're propagating original request_id through the whole chain. @Joe do you think this would be useful to have the reqId of the original request that the purge originated from?
  2. 'dt' field would contain the timestamp when the resource being purged has changed.
  3. 'root_event.dt' should be properly propagated via the whole chain of resource_change events so that we know the timestamp of original edit that resulted in the purge.

Event Timeline

I don't think we need the request_id to be preserved - purged is definitely not the place where to do analysis of such data.

What we'd need is:

  • meta.dt to be set to the time at which the resource directly changed. So in the case of restbase the time at which the content was invalidated in restbase
  • root.dt to be set to the time at which the event that caused the purge has happened.

So for instance say that in the following scenario:

  • Template A gets edited at 9:00
  • Page B gets invalidated and re-parsed in restbase at 11:00

we expect to get "root.dt" to be 9:00 and "meta.dt" to be 11:00.

daniel triaged this task as Medium priority.May 8 2020, 10:08 AM

Change 595601 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Preserve datetime field of the purge events

https://gerrit.wikimedia.org/r/595601

'dt' field would contain the timestamp when the resource being purged has changed.

If this schema should apply to MW, then I think this needs to be defined better. Specifically to better explain what "resource" and "change" here mean. For example, if the purge message is to purge article "Foo" (internally as result of edit to template X used by template Y used by Foo). Then should it reflect the time of the edit to "X", or the time that the job queue updated the parser cache for article "Foo"?

Likewise, if a null-edit or purge action is submitted manually from the wiki for article "Foo", should it match the Last-Modified date of the HTTP resource and thus when it was last meaningfully edited, or when the purge happened?

It seems to me like the only thing useful for the purge handler is the timestamp of when the purge event was created by the producer. Anything about what Last-Modified will be or when some cascading chain started seems like things it either can't or shouldn't use, right? Having said that - what will this actually be used for?

Change 595601 merged by jenkins-bot:
[operations/deployment-charts@master] Preserve datetime field of the purge events

https://gerrit.wikimedia.org/r/595601

It seems to me like the only thing useful for the purge handler is the timestamp of when the purge event was created by the producer. Anything about what Last-Modified will be or when some cascading chain started seems like things it either can't or shouldn't use, right? Having said that - what will this actually be used for?

AFAIK @Joe uses the timestamps to reject the purges that originated from actions that happened very long (more the max_age limit) time ago. However, I'm not thinking that's incorrect both for RESTBase and for MW given how RESTBase cache / ParserCache is updated.

If the 'root_event.dt' contains a timestamp of the template edit, and the job queue just got to updating a certain page a week later, and a purge in question is a result of this update, there is no guarantee the parser cache was actually purged before, since it's done by the same job queue job, thus even though the Varnish-level cache for the article might have expired many times already, it might have been refetched from the parser cache / restbase cache containing the old template.

Whether we care to purge by his time, or waiting out for the max age to expire is a separate question, since so muchtime has passed already. But rejecting purges > max age old based on root_event can not be justified by guaranteeing that the new content should have already been fetched from the application

FYI, meta.dt is used as both the Kafka message timestamp as well as the timestamp for hourly partitioning in Hive. In most cases it should be used to represent the event time, or the timestamp at which the event you are modeling happened. In the general case of resource_change events, I think you would want this to be the timestamp the resource was changed.

However, for the resource-purge stream, IIUC you are using resource-purge as more of an RPC call than an event resource change model? In that case, it makes sense to set meta.dt to the time the RPC event happened, i.e. the time that the resource-purge event is produced? Another field (root_event.dt?) that represents the original causing event's event time makes sense too.

(Sorry if I'm not fully following, just wanted to provide some context around meta.dt.)