Page MenuHomePhabricator

API errors in relation to UTRS
Closed, ResolvedPublic

Description

UTRS was affected by two different errors from December 12 to December 20th
assertbotfailed: 2022-12-13 12:15:05 to 2022-12-20 16:30:02
Invalid CSRF: 2022-12-12 10:23:48 to 2022-12-20 15:49:49

[2022-12-20 16:30:02] production.ERROR: Code: assertbotfailed
Message: You do not have the "bot" right, so the action could not be completed.
Result: {"error":{"code":"assertbotfailed","info":"You do not have the \"bot\" right, so the action could not be completed.","*":"See https:\/\/en.wikipedia.org\/w\/api.php for API usage. Subscribe t>

Message: You do not have the \"bot\" right, so the action could not be completed.
Result: {\"error\":{\"code\":\"assertbotfailed\",\"info\":\"You do not have the \\\"bot\\\" right, so the action could not be completed.\",\"*\":\"See https:\\/\\/en.wikipedia.org\\/w\\/api.php for A>

and

[2022-12-20 15:49:49] production.ERROR: Failed to send verification email for appeal #67039 {"exception":"[object] (RuntimeException(code: 0): Failed to send verification email for appeal #67039 at />
[previous exception] [object] (Mediawiki\\Api\\UsageException(code: 0): Code: badtoken
Message: Invalid CSRF token.
Result: {\"error\":{\"code\":\"badtoken\",\"info\":\"Invalid CSRF token.\",\"*\":\"See https:\\/\\/en.wikipedia.org\\/w\\/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list a>

Both issues have resolved themselves at this point without intervention of any sort on my end.
Logging for the record just in case it happens again or someone else has the same issue or knows what is going on.

Event Timeline

Started around when the incident that caused T325056: Can't log in or out or save edits - Invalid CSRF token did but don't know why it persisted after that was resolved

So this, unfortunately, has started again for the 2nd part (CSRF token).
First occurrence on: [2022-12-27 18:23:30]
and is occurring regularly again

So I did a deeper diver now that I have access to the error logs. The Email errors have been showing up at least since October, that's as far as the logs I have back go.
The AssertBot was only for the time period mentioned, so ignoring that until it returns.

So I looked into the database and compared the appeal number in the logs with all database lines in appeals. The only thing they have in common is that the verification key was on file, but all that means is the email in the error wasn't sent.
These errors occurred for both meta and enwiki, so it's not site based or cookie based,

I compiled a CLI script using the exact same code that is triggering an error in this commit:
https://github.com/UTRS2/utrs/commit/8c3b6776eacaea8327d8ac9f0774f351b644913e

I ran though several possibilities, but my testing was limited by who has an open appeal and a valid email address on file for the wiki as it does send an email to the user. No errors showed up through the CLI script.
This leads me to believe the issue is intermittent.
So I dug through the addwiki/mediawiki-api-base package that we use to send these emails of. We use version 2.8.0. There was an update to 3.0.0 that included the following change:
https://github.com/addwiki/mediawiki-api-base/commit/44fe415d22ac71f61b7c03f4ce73774ff3264bf4

I suspect that might be the cause of the bug. I am updating the software we use to take 3.0.0 for this package.
Tagging @Addshore as they may be able to shed light on the problem and confirm the solution.

@TheresNoTime @Urbanecm_WMF
Can I get a copy of what you can from logstash for 2023-01-17 07:49:36 on meta to try and figure this issue out. I'm continually getting Bad CSRF token, but can't diagnose it because I can't reproduce it anywhere other than the automated job that runs this.

Urbanecm edited subscribers, added: Urbanecm; removed: Urbanecm_WMF.EditedJan 18 2023, 6:23 AM

Can I get a copy of what you can from logstash for 2023-01-17 07:49:36 on meta to try and figure this issue out. I'm continually getting Bad CSRF token, but can't diagnose it because I can't reproduce it anywhere other than the automated job that runs this.

I don’t think there’d be useful information in Logstash at this point (for normal requests, it only includes errors]. Would it be possible for you/UTRS to send a couple of requests with the debug logging enabled (see docs; the header should be X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet; log). I know you said you can’t reproduce this, but if a few jobs can run with this header added, it would help.

The header enables verbose logging on Wikimedia’s end and it will provide Logstash with a lot more information that can be used to help you figure out what is happening.

Please only send a few requests with the debug logs enabled (don’t leave it enabled permanently) — it is very verbose, and just a few example failing requests should be enough to help with debugging this.

I should be able to modify the vendor files for addwiki's API without affecting the end user to enable the debug header. I will also see what I can insert in terms of variable debugging locally. This only happens a maximum of 3 times per day, if that, as it is intermittent - so it will only spam if I leave it for a few days. I will try to work on this in the next few days and get back this ticket when it's on, and then again after 3-6 failures.

Mentioned in SAL (#wikimedia-cloud) [2023-01-20T02:58:48Z] <AmandaNP> add WM Header debug manually into vendor files on server T325692

Should this need to be reverted in an emergency, please feel free to have someone go on utrs-production and navigate to /var/www/html/vendor/addwiki/mediawiki-api-base/src/Guzzle/ClientFactory.php and remove the header from line 53.

Otherwise @Urbanecm this has been implemented for the time being.

AmandaNP triaged this task as Low priority.

I have filled https://github.com/UTRS2/utrs/pull/609 which seems to be the sum of the issue of async vs sync
I'm going to pull the debug config especially since it didn't even seem to work.
I'll see if any more issues of this occur in the coming days.

It's been a few days, and I've only gotten 1 error from an LTA related case...but I've found a way to debug more, so I will follow up again in a few days.