Page MenuHomePhabricator

SPIKE: Verify revision check jobs are logging errors and getting retried as desired [16HRS]
Closed, ResolvedPublicSpike

Description

References:

Now that we have T360698: Use jobs to fetch scores and then attempt reverts,
I played around with forcing errors by setting hardcoding the liftwing client to localhost in LiftWingClient.php#92 to check when retries happen and where errors get logged:

$url = 'http://127.0.0.1/service/lw/inference/v1/models/' . $this->model . ':predict';

The logger does not seem to work in the job runner. AutoModeratorFetchRevScoreJob.php#130

		$logger->info( 'Fetching scores for revisions checks.' );
		try {
			$score = $liftWingClient->get( $this->revId );
			$reverted = $revisionCheck->maybeRevert( $score );
		} catch ( RuntimeException $exception ) {
			$msg = $exception->getMessage();
			$logger->error( 'There was an error trying to obtain the score: ' . $msg );
			return false;
		}

does not produce useful log output:

$ docker compose logs --since 12h -tf mediawiki-jobrunner 2>&1 | grep AutoModerator
mediawiki-jobrunner-1  | 2024-04-26T15:56:28.576312592Z 2024-04-26 15:56:28 AutoModeratorFetchRevScoreJob Main_Page wikiPageId=6 revId=7017 originalRevId= user=172.18.0.1 tags=[] namespace=0 title=Main_Page requestId=bb3805661dea94dceef749ef (id=36,timestamp=20240426155627) STARTING
mediawiki-jobrunner-1  | 2024-04-26T15:56:28.720401492Z 2024-04-26 15:56:28 AutoModeratorFetchRevScoreJob Main_Page wikiPageId=6 revId=7017 originalRevId= user=172.18.0.1 tags=[] namespace=0 title=Main_Page requestId=bb3805661dea94dceef749ef (id=36,timestamp=20240426155627) t=143 error=

however setting

		try {
			$score = $liftWingClient->get( $this->revId );
			$reverted = $revisionCheck->maybeRevert( $score );
		} catch ( RuntimeException $exception ) {
			$this->setLastError( $exception->getMessage() );
			return false;
		}

resulted in useful jobrunner log output:

$ docker compose logs --since 12h -tf mediawiki-jobrunner 2>&1 | grep AutoModerator
mediawiki-jobrunner-1  | 2024-04-26T15:32:19.738118910Z 2024-04-26 15:32:19 AutoModeratorFetchRevScoreJob Main_Page wikiPageId=6 revId=7015 originalRevId= user=172.18.0.1 tags=[] namespace=0 title=Main_Page requestId=5b8f0bebda7732cc16f66290 (id=29,timestamp=20240426153218) STARTING
mediawiki-jobrunner-1  | 2024-04-26T15:32:20.075329412Z 2024-04-26 15:32:20 AutoModeratorFetchRevScoreJob Main_Page wikiPageId=6 revId=7015 originalRevId= user=172.18.0.1 tags=[] namespace=0 title=Main_Page requestId=5b8f0bebda7732cc16f66290 (id=29,timestamp=20240426153218) t=337 error=Failed to make LiftWing request to [http://127.0.0.1/service/lw/inference/v1/models/revertrisk-language-agnostic:predict], * Error fetching URL: Failed to connect to 127.0.0.1 port 80: Connection refused

as did dropping the try/catch altogether

$score = $liftWingClient->get( $this->revId );
$reverted = $revisionCheck->maybeRevert( $score );
$ docker compose logs --since 12h -tf mediawiki-jobrunner 2>&1 | grep AutoModerator
mediawiki-jobrunner-1  | 2024-04-26T15:36:08.724849372Z 2024-04-26 15:36:08 AutoModeratorFetchRevScoreJob Main_Page wikiPageId=6 revId=7016 originalRevId= user=172.18.0.1 tags=[] namespace=0 title=Main_Page requestId=74cad7b1632ac1b8171c96e0 (id=33,timestamp=20240426153607) t=143 error=RuntimeException: Failed to make LiftWing request to [http://127.0.0.1/service/lw/inference/v1/models/revertrisk-language-agnostic:predict], * Error fetching URL: Failed to connect to 127.0.0.1 port 80: Connection refused
mediawiki-jobrunner-1  | 2024-04-26T15:36:08.724921872Z * There was a problem during the HTTP request: 0 Error in /var/www/html/w/extensions/AutoModerator/src/LiftWingClient.php on line 127

Retries don't appear to be happening in my local environment, however they were earlier this week. I did do some table truncation I was simply checking the local sql queue via:

$ docker compose exec mediawiki bash -c "echo 'select job_id,job_attempts,job_title,substring(job_params,1,255) as job_params from job where job_cmd = \"AutoModeratorFetchRevScoreJob\" and date(job_timestamp) = curdate() order by job_id asc;' | php maintenance/run.php sql"
stdClass Object
(
    [job_id] => 29
    [job_attempts] => 1
    [job_title] => Main_Page
    [job_params] => a:8:{s:10:"wikiPageId";i:6;s:5:"revId";i:7015;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:10:"172.18.0.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQui
)
stdClass Object
(
    [job_id] => 33
    [job_attempts] => 1
    [job_title] => Main_Page
    [job_params] => a:8:{s:10:"wikiPageId";i:6;s:5:"revId";i:7016;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:10:"172.18.0.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQui
)
stdClass Object
(
    [job_id] => 36
    [job_attempts] => 1
    [job_title] => Main_Page
    [job_params] => a:8:{s:10:"wikiPageId";i:6;s:5:"revId";i:7017;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:10:"172.18.0.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQui
)

repushing jobs shows that there are no abandoned jobs

$ docker compose exec mediawiki bash -c "php maintenance/run.php manageJobs --action repush-abandoned --type AutoModeratorFetchRevScoreJob"
Last re-push time: 20240426163742; current time: 20240426164303
Re-pushed 0 job(s) [0 skipped].

I'll attach a patch with what I believe to be the correct return / error logging behavior

Event Timeline

Change #1024775 had a related patch set uploaded (by Jsn.sherman; author: Jsn.sherman):

[mediawiki/extensions/AutoModerator@master] Update error handling in FetchRevScore job

https://gerrit.wikimedia.org/r/1024775

In a branch based on the master branch (where I have made no changes to the job queue code), I did have abandoned jobs that I was able to re-push:

$ docker compose exec mediawiki bash -c "php maintenance/run.php manageJobs --action repush-abandoned --type AutoModeratorFetchRevScoreJob"
WARN[0000] /Users/susanacardenasmolinar/Documents/Wikimedia/code/docker-mediawiki/mediawiki/docker-compose.yml: `version` is obsolete 
Last re-push time: 19700101000001; current time: 20240426184234
Re-pushed 3 job(s) [0 skipped].

I also have jobs with multiple attempts:

$ docker compose exec mediawiki bash -c "echo 'select job_id,job_attempts,job_title,substring(job_params,1,255) as job_params from job where job_cmd = \"AutoModeratorFetchRevScoreJob\" and date(job_timestamp) = curdate() order by job_id asc;' | php maintenance/run.php sql"
WARN[0000] /Users/susanacardenasmolinar/Documents/Wikimedia/code/docker-mediawiki/mediawiki/docker-compose.yml: `version` is obsolete 
stdClass Object
(
    [job_id] => 1264
    [job_attempts] => 3
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:296;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:12:"192.168.65.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQ
)
stdClass Object
(
    [job_id] => 1269
    [job_attempts] => 3
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:297;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)
stdClass Object
(
    [job_id] => 1281
    [job_attempts] => 3
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:299;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)
stdClass Object
(
    [job_id] => 1389
    [job_attempts] => 1
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:296;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:12:"192.168.65.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQ
)
stdClass Object
(
    [job_id] => 1390
    [job_attempts] => 1
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:297;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)
stdClass Object
(
    [job_id] => 1391
    [job_attempts] => 1
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:299;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)

Change #1024775 merged by jenkins-bot:

[mediawiki/extensions/AutoModerator@master] Update error handling in FetchRevScore job

https://gerrit.wikimedia.org/r/1024775

We have a merged patch here - is this task resolved, or is there outstanding work?

In a branch based on the master branch (where I have made no changes to the job queue code), I did have abandoned jobs that I was able to re-push:

$ docker compose exec mediawiki bash -c "php maintenance/run.php manageJobs --action repush-abandoned --type AutoModeratorFetchRevScoreJob"
WARN[0000] /Users/susanacardenasmolinar/Documents/Wikimedia/code/docker-mediawiki/mediawiki/docker-compose.yml: `version` is obsolete 
Last re-push time: 19700101000001; current time: 20240426184234
Re-pushed 3 job(s) [0 skipped].

I also have jobs with multiple attempts:

$ docker compose exec mediawiki bash -c "echo 'select job_id,job_attempts,job_title,substring(job_params,1,255) as job_params from job where job_cmd = \"AutoModeratorFetchRevScoreJob\" and date(job_timestamp) = curdate() order by job_id asc;' | php maintenance/run.php sql"
WARN[0000] /Users/susanacardenasmolinar/Documents/Wikimedia/code/docker-mediawiki/mediawiki/docker-compose.yml: `version` is obsolete 
stdClass Object
(
    [job_id] => 1264
    [job_attempts] => 3
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:296;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:12:"192.168.65.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQ
)
stdClass Object
(
    [job_id] => 1269
    [job_attempts] => 3
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:297;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)
stdClass Object
(
    [job_id] => 1281
    [job_attempts] => 3
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:299;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)
stdClass Object
(
    [job_id] => 1389
    [job_attempts] => 1
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:296;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:0;s:5:"mName";s:12:"192.168.65.1";s:8:"mActorId";i:9;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:1:"1";s:16:"*mQ
)
stdClass Object
(
    [job_id] => 1390
    [job_attempts] => 1
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:297;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)
stdClass Object
(
    [job_id] => 1391
    [job_attempts] => 1
    [job_title] => Florida_Man
    [job_params] => a:8:{s:10:"wikiPageId";i:27;s:5:"revId";i:299;s:13:"originalRevId";b:0;s:4:"user";O:19:"MediaWiki\User\User":21:{s:3:"mId";i:3;s:5:"mName";s:9:"User24601";s:8:"mActorId";i:4;s:9:"mRealName";s:0:"";s:6:"mEmail";s:0:"";s:8:"mTouched";s:14:"20240306021025";s
)

I notice that the only jobs with 3 attempts have the lowest job_ids. Could you add the job timestamp to the query to verify that you are getting retries on jobs created from the latest commit?

jsn.sherman renamed this task from Verify revision check jobs are logging errors and getting retried as desired to SPIKE: Verify revision check jobs are logging errors and getting retried as desired [16HRS].May 1 2024, 1:33 PM
jsn.sherman changed the task status from Open to In Progress.
jsn.sherman claimed this task.
jsn.sherman added a project: Spike.
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptMay 1 2024, 1:33 PM
jsn.sherman moved this task from Ready to Done on the Moderator-Tools-Team (Kanban) board.