Page MenuHomePhabricator

Better observability/visualization for MediaWiki jobs
Open, LowPublic

Description

Problem Statement:

While refactoring the way changes are dispatched from Wikidata to client wikis [0] from using maintenance scripts to using jobs, it became clear that the options to observe jobs aren't exactly great. That is a problem, because this refactoring is touching a very vital part of Wikimedia infrastructure and while the observability of the old dispatch mechanism was (is) quite bad, we're now still working mostly in the dark.

It would make future maintenance and debugging work much easier, if it were simpler to observe what jobs are being created and what they are doing. That is, I want to be able to look and see what jobs are actually running and doing, instead of just reasoning from the code and crossing my fingers when I make changes.

Idea for improving things:

I would like to see some tree-like visualization starting from a (debug) request that shows:

  • What jobs were created in that request?
  • Which other jobs were triggered by those? (and so on, -> tree)
  • What were the (debug- or even trace-level?) logs of each job?
  • How long did each of these jobs take to complete and when did they run? (-> Timeline view?)
  • Maybe: What hooks were called in the context of each job?

[0]: See T48643, T272356, and Wikibase change dispatching scripts to jobs in general

Event Timeline

For seeing tree of jobs being triggered from root job, there are lots of tools in the wild for using, like https://www.jaegertracing.io/

Adding Platform Engineering serviceops and Analytics as this is related to all three teams. I'm aware that it will take time to get to this problem being substantially improved, but it would be nice to start the process around getting better observability of jobs :)

Data Eng (analytics) is in the process of solving on a similar problem using Airflow, but more for data pipeline jobs than for the MediaWiki job queue.

The older and more generic ticket about this problem is T237361: Discuss common needs in a job manager/scheduler.

Michael renamed this task from Better observability/visualization for jobs to Better observability/visualization for MediaWiki jobs.Sep 24 2021, 8:15 AM

Data Eng (analytics) is in the process of solving on a similar problem using Airflow, but more for data pipeline jobs than for the MediaWiki job queue.

The older and more generic ticket about this problem is T237361: Discuss common needs in a job manager/scheduler.

Thank you for the pointers. 🙏
I've updated this task's title to make it easier to see what it is about and maybe reduce confusion :)

This task mixes quite a few things. I'll start by answering your questions to the best of my knowledge.

  • What jobs were created in that request?

I don't think we have that information at the moment, but it would just mean we'd need to attach the Request-Id header to the metadata of the job being created

  • Which other jobs were triggered by those? (and so on, -> tree)

This information is already available as every job retains its *root job* information. I'm not sure that's preserved in logging though.

*What were the (debug- or even trace-level?) logs of each job?

I'm not sure what you want to do here. We don't save logs at those levels for production requests - you would like to have a way to propagate logging setting from an X-Wikimedia-Debug request to all the jobs it generates? If that's the idea, yes you would need to modify our code to do so. But please clarify :)

  • How long did each of these jobs take to complete and when did they run? (-> Timeline view?)

This data is already available in our production logs (not sure about logstash) but surely we don't have a timeline view organized. If the problem is visualization, yes you'd probably need tracing of jobs as @Ladsgroup suggested.

  • Maybe: What hooks were called in the context of each job?

I guess this is again just for debugging purposes - I'll let a MediaWiki developer answer this.

I don't think bringing up airflow is really in-topic here, while I do agree that tracing the jobs trees and execution times via jaeger would give us a huge amount of additional, easy-to-visualize insights in how a complex root job generates their job tree. Unfortunately, right now we don't have a jaeger infrastructure to do so.

I also want to underline that the problem statement is slightly misleading: while it's true observability of jobs isn't great, and in particular the UI for that is unfortunate, it's still miles better than what the current system of wikidata dispatching gives you :)

This task mixes quite a few things.

Yes, it is a wishlist of outcomes that I think might significantly improve the understanding that simple software developers (like me) have on what is going on with jobs. :)
It is intended to be starting or contributing to a larger conversation on the topic.

*What were the (debug- or even trace-level?) logs of each job?

I'm not sure what you want to do here. We don't save logs at those levels for production requests - you would like to have a way to propagate logging setting from an X-Wikimedia-Debug request to all the jobs it generates? If that's the idea, yes you would need to modify our code to do so. But please clarify :)

Yes. My idea would be that only specific requests (for example with a certain debug header) would provide these logs, as I assume that storing this data for all jobs would incur prohibitive costs for little gain.

I also want to underline that the problem statement is slightly misleading: while it's true observability of jobs isn't great, and in particular the UI for that is unfortunate, it's still miles better than what the current system of wikidata dispatching gives you :)

I'm sorry, my intention wasn't at all to cast any blame whatsoever. I'll see if I can rephrase the description to make that clear. Rather the opposite: The abysmal observability of the Wikidata dispatching system contributed to its continued existence over time, because developers are understandably hesitant to touch a system that is this opaque.

My goal here is for the MediaWiki job system to become more accessible, so that it is easier for us devs to follow what is going on and to change jobs with confidence.

[...] I do agree that tracing the jobs trees and execution times via jaeger would give us a huge amount of additional, easy-to-visualize insights in how a complex root job generates their job tree. Unfortunately, right now we don't have a jaeger infrastructure to do so.

I think where I want this conversation to go is: Would "a huge amount of additional, easy-to-visualize insights in how a complex root job generates their job tree" be worth having?

[...] I do agree that tracing the jobs trees and execution times via jaeger would give us a huge amount of additional, easy-to-visualize insights in how a complex root job generates their job tree. Unfortunately, right now we don't have a jaeger infrastructure to do so.

I think where I want this conversation to go is: Would "a huge amount of additional, easy-to-visualize insights in how a complex root job generates their job tree" be worth having?

I think the answer is a resounding yes. And we want to set up tracing anyways for different uses as well, so this is definitely a direction worth exploring. But it's not something that will be ready in a couple of weeks nor, probably, next quarter. This is an additional use-case we should definitely consider.

*What were the (debug- or even trace-level?) logs of each job?

I'm not sure what you want to do here. We don't save logs at those levels for production requests - you would like to have a way to propagate logging setting from an X-Wikimedia-Debug request to all the jobs it generates? If that's the idea, yes you would need to modify our code to do so. But please clarify :)

We thankfully already support this mechanism for requests, see https://wikitech.wikimedia.org/wiki/WikimediaDebug#Debug_logging for details.

I think it would mostly be matter of having change-propagation forward the header when submitting a job, but as usual the devil is in the details.

Let me add some points here:

  • While I agree jobs have better o11y than the current dispatching system. Dispatching has been working in production for quite long time and had issues which led to it being polished and tuned. We don't have that for the jobs we are going to deploy soon. So it's hard to see if it's working well, not working well, buggy, etc.
  • I had an action item for X-Wikimedia-Debug case (to implement it). Not just respecting the log levels, I would like to use xhgui on jobs as well. Seeing a callgraph of a job would be possibly quite informative. i.e. I would like to make sure jobs triggered by a request to mwdebug get run in the same node (with respecting xhgui, verbose, etc. etc.)
  • I would like to have a way to run a job in cli. With parameters defined and such (not a job runner taking jobs from a producer) so I can see if it works well or nott, what are debug logs, etc. That would have saved a lot of my time last week.
  • having job parameters in hadoop would be extremely useful, we keep all user requests (up to 90 days) in hadoop. Having that in hadoop would enable me to run analysis and come back with numbers like Job Foo triggers Job Bar at 36% of the time and Job Baz in the rest. etc.

I think we should focus on solutions and low-hanging fruits for the deployment of jobs replacement of dispatcher

having job parameters in hadoop would be extremely useful, we keep all user requests (up to 90 days) in hadoop. Having that in hadoop would enable me to run analysis and come back with numbers like Job Foo triggers Job Bar at 36% of the time and Job Baz in the rest. etc.

This is the first time I've heard of anyone wanting this! We used to import all of the mediawiki.job.* topics, but recently during a migration, we decided to stop since no one needed them. Bringing them back into Hadoop would not be hard (file a ticket if you really want this plz!0

That said, because the mediawiki.job.* topics are not well schema-ed, automatic Hive ingestion is difficult, so we never did it. You'd have to query the raw imported JSON. This should be easy enough with Spark though, since Spark can infer a schema for a directory of JSON by scanning over all the JSON data.

If I can do beeline in stat1005 and look at the data, I don't care about the rest.

There is a general problem that our analytics data gets underused specially for technical purposes. Our data analysts and PMs heavily analyze the data but for understanding our infra, it's doesn't get used at all. Most engineers don't know the full potentional of such data. That's a discussion for later though :)

If I can do beeline in stat1005 and look at the data

This would be possible, but you'd have to either create a Hive table using JSON Serde with the schema for the specific job you are querying, or perhaps use a Hive JSON parsing function like get_json_object.

It'd probably be easier to query this using Spark SQL (in Python or Scala, whatever you prefer).

But anyway we don't have that data in Hadoop atm!

our analytics data gets underused specially for technical purposes

I agree!

@Ladsgroup @Michael let us know if having the MW job event data in Hadoop would be useful, how urgent, etc. and we can prioritize this appropriately.

@Ladsgroup @Michael let us know if having the MW job event data in Hadoop would be useful, how urgent, etc. and we can prioritize this appropriately.

We finished the rework of the new Change Dispatching hike that triggered the creation of this task. We encountered a few issues where this functionality would have been useful.

For example trying to figure out why the job duration of one of the newly created jobs peakes about every two hours, until @Lucas_Werkmeister_WMDE noticed that the pattern matched the job inseration rate of an unrelated job:

image.png (632×1 px, 388 KB)

I would expect such mysteries to come up every now and then, and so it would be useful to have this tool at our disposal. Though I'm not aware of any currently open mystery where using this tool would be an obvious next step.

  • I had an action item for X-Wikimedia-Debug case (to implement it). Not just respecting the log levels, I would like to use xhgui on jobs as well. Seeing a callgraph of a job would be possibly quite informative. i.e. I would like to make sure jobs triggered by a request to mwdebug get run in the same node (with respecting xhgui, verbose, etc. etc.)

This came up elsewhere and I made a task for it: T354042: Forward X-Wikimedia-Debug header to MediaWiki jobs