Page MenuHomePhabricator

Jobrunner monitoring still calles /rpc/runJobs.php
Closed, ResolvedPublic

Description

The puppet role for the jobrunner_tls includes service monitoring that calls /rpc/runJobs.php

runJobs.php is the old way of calling job runners within the WMF cluster and after the move the the Kafka-based job queue it's not used anymore. Actually, this monitoring call is the only thing that still calls that rpc endpoint, so if we remove or replace it with something else, we would be able to to remove that piece of code entirely.

Instead, if possible, the monitoring should try making a POST call to /rpc/RunSingleJob.php with a serialized NullJob - that will be much closer to what the real system does. I'm not sure if it's possible to make the monitoring role do a POST request though.

Event Timeline

Yes, it's possible to make the monitoring check do a POST request.

It uses the check_http nagios (icinga) plugin and that has a parameter for it.

https://www.monitoring-plugins.org/doc/man/check_http.html

Change 566374 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga/mediawiki: update jobrunner monitoring, add command to use a POST request

https://gerrit.wikimedia.org/r/566374

See the example change above. You would just have to replace "POST_DATA" with the actual data as a string.

Let's hold on this one until we finish T220127 cause we will change how the job execution is called in production.

As a part of the WMF job execution overhaul under T244826 we're planning to kill both /rpc endpoints in mw-config and replace the /rpc/RunSingleJob with an internal REST endpoint.

The REST endpoint will be protected by verifying an event signature - every event will be signed by MW secret key. So, now in order to have a service check of the job runner, we would need to somehow POST a correctly signed event.

Options are:

  1. Sign a sample event with MW private key and statically put it into the patch above. The signature is a keyed hash of the the serialized event, so we wouldn't expose the key. Downside - in case the key is changed the checker will start failing.
  2. Get a key in puppet and add a script to sign an event. Downside - too complicated.
  3. Poke a special hole in our own armors - verify the request origin within the executor and allow to bypass signature verification just for monitoring. Downsides - it's crazy ugly and insecure.
  4. Remove special check for jobrunner. The current one actually doesn't quite work since it's not calling the same entry point the actual code is calling, and the entry points almost don't share code.

Am I missing some obvious solution? I think I would vote for either 2 or 4 with a big preference towards 2 if we indeed can get a hold of MW private key within puppet. The script to serialize a proper event should be fairly simple.

Pchelolo added a subscriber: Joe.

Per conversation with @Joe regarding the future of this monitoring check, all we need to do is verify that MW is up and running and is able to respond on jobrunner hosts. All the complexity I've described in the comment above is not required. So, I guess we can use appserver https check that appserver uses.

Change 576301 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/puppet@production] jobrunner: add simple HTTP check

https://gerrit.wikimedia.org/r/576301

Change 575392 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/puppet@production] Remove references to obsolete rpc/RunJobs.php endpoint

https://gerrit.wikimedia.org/r/575392

Who is this blocked on, and do they know that?

As a part of this we were thinking to redo the whole job execution part and unify jobrunner and app server apache configurations, and that's got blocked on the fact it's not a good idea to do something that risky during the derisking time :)

I guess we can just remove this monitoring call altogether for now @hnowlan and deal with adding a different monitoring check as a part of T246389

Change 592631 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/puppet@production] mediawiki:jobrunner_tls: Remove runjobs monitoring

https://gerrit.wikimedia.org/r/592631

Change 566374 abandoned by Dzahn:
icinga/mediawiki: update jobrunner monitoring, add command to use a POST request

Reason:
https://phabricator.wikimedia.org/T243096#5917650

https://gerrit.wikimedia.org/r/566374

Change 592631 merged by Hnowlan:
[operations/puppet@production] mediawiki:jobrunner_tls: Remove runjobs monitoring

https://gerrit.wikimedia.org/r/592631

Check has been removed - other monitoring will be added as part of T246389

Change 575392 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint

https://gerrit.wikimedia.org/r/575392

Change 805775 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] rpc: Remove unused RunJobs.php

https://gerrit.wikimedia.org/r/805775

Change 805775 merged by jenkins-bot:

[operations/mediawiki-config@master] rpc: Remove unused RunJobs.php

https://gerrit.wikimedia.org/r/805775

Mentioned in SAL (#wikimedia-operations) [2022-06-21T13:28:38Z] <daniel@deploy1002> Synchronized rpc/: Config: [[gerrit:805775|rpc: Remove unused RunJobs.php (T175146 T243096)]] (duration: 03m 45s)

Change 576301 abandoned by Hnowlan:

[operations/puppet@production] jobrunner: add simple HTTP check

Reason:

mw-jobrunner replaces this

https://gerrit.wikimedia.org/r/576301