Page MenuHomePhabricator

GWToolset needs cancel functionality
Closed, DeclinedPublic

Description

See http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20150601.txt and https://lists.wikimedia.org/pipermail/glamtools/2015-May/000434.html

So the other day, gwtoolset overloaded Charles's server. There is no way to cancel it, short of people in operations manually deleting the uploaded metadata xml file. Which is kind of crazy. We need some sort of easy cancel button, or at least something more sane.

Event Timeline

Bawolff raised the priority of this task from to Needs Triage.
Bawolff updated the task description. (Show Details)
Bawolff subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

[04:03] <apergos> ok, that should have cleaned it up
[04:04] <apergos> there should be a script that, given the xml file name and the user account, can 1) find it, 2) find the redis job queue entrie(s), 3) delete the swift copy of the xml file, 4) delete all related baggage from redis

The import seems to be ongoing*, so what was done was not enough https://logstash.wikimedia.org/#/dashboard/temp/ATwrcvoFRCm7q-0IGwXU8A

(*And now finally it has stopped)

The import seems to be ongoing, so what was done was not enough https://logstash.wikimedia.org/#/dashboard/temp/ATwrcvoFRCm7q-0IGwXU8A

I don't have access to log_stash, but looking at the publically available log, it appears that the create new metadata jobs part has stopped (As of 10:30-ish, so matching when apergos killed things), and its finishing out the existing upload jobs. So I suspect it will stop by itself in an hour or two

and an interesting point is that now the upload are successful.... at least since 3:17 UTC

The metadata job automatically recreates itself when it's done, so deleting it from redis is not so easy. Also, a high-level extension directly interacting with a specific jobqueue implementation is poor architecture. So either the job queue should get the capability to cancel jobs based on title (the only identifying information the job queue is aware of, I think) and GWToolset should be changed to use deterministic titles, or there should be some sort of kill switch (such as a specific memcache key) that all metadata jobs check before executing.

@aaron, what do you think? Do you see value in having some generic job kill functionality?

I'd stick with the GWT specific solution of some key to check (perhaps using memcached).

Would be nice if it could also notify the owner that it got killed - that's T69144.

GWtoolset falls under the reading team now? (I'm not complaining, I'm happy it falls under any team. Just struck me as odd that a feature solely involving content creation without consuming the content falls under reading. But again, I'm not complaining)

One of the principles of the reorganization was that responsibilities are kept until they can be handed over to someone. The backend engineers from the old multimedia team are in Reading, and the new multimedia team has a frontend focus, so until someone else steps up, putting out fires (or in this case, preventing them) will fall to Reading Infrastructure. Apart from that, and code review, the Foundation does not support GWT as far as I know.

Tgr removed Tgr as the assignee of this task.May 2 2017, 5:08 PM

I never got around to do this, but the problem didn't reoccur either so it's probably not that important.