Page MenuHomePhabricator

Cannot fetch Zero carriers/proxies JSON files from eqsin
Closed, ResolvedPublic

Description

Context: The cache nodes run a periodic script to fetch updated carriers/proxies JSON data from Zerowiki. The fetch is authenticated and works correctly at all of the existing datacenters, but authentication fails with Exception: API login phase2 gave result Failed, expected "Success" from the new eqsin (Singapore) cache nodes.

My best suspicion would be that there's an IP-address whitelist configured somewhere (in a repo? on-wiki?) for allowing this account to login to Zerowiki for the fetching, and the whitelist is missing the new private eqsin network ranges.

Event Timeline

BBlack triaged this task as High priority.Feb 23 2018, 4:36 PM
BBlack created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Assuming it is a whitelist of the private networks containing prod caches, the new additions to the list for ipv6+ipv4 would be:

2001:df2:e500:101::/64
10.132.0.0/24

I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that zerofetch.py is using the deprecated action=login login flow, including relying on fields that were removed in 1.28.0-wmf.13 in mid-2016 (https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2016-July/000114.html). I'm actually surprised this script didn't break everywhere long ago. Er, maybe it's possible the machines running the script in the other DCs just haven't had to reauthenticate in a very long time?

The error is coming from here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/files/zerofetch.py;684dbccc37c6e06e8ac0f5c47586dfbb45ceb151$58

OTOH, giving credence to the IP whitelisting theory, there is a zero-script-ips user right present on ZeroWiki, but I can't find it documented anywhere, and can't find any code that checks for it. Maybe @dr0ptp4kt knows more about this.

I'll look more later (have to run off to an appt soon), but one thing I notice right off the bat is that zerofetch.py is using the deprecated action=login login flow, including relying on fields that were removed in 1.28.0-wmf.13 in mid-2016 (https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2016-July/000114.html). I'm actually surprised this script didn't break everywhere long ago. Er, maybe it's possible the machines running the script in the other DCs just haven't had to reauthenticate in a very long time?

They re-authenticate every time they run (~ every 15 minutes), and they're all still working. The script doesn't cache authentication details (e.g. session/token info) between runs. I looked into this a little bit, and while I do see there's a deprecation warning issued when doing this style of two-phase login, current MW core code does still support it. There's even a testcase at: https://github.com/wikimedia/mediawiki/blob/master/tests/phpunit/includes/api/ApiLoginTest.php#L70 which uses the same basic methodology as the zerofetch script.

Change 413787 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[operations/puppet@production] Update zerofetch script to report login failure reason

https://gerrit.wikimedia.org/r/413787

I looked into this a little bit, and while I do see there's a deprecation warning issued when doing this style of two-phase login, current MW core code does still support it. There's even a testcase at: https://github.com/wikimedia/mediawiki/blob/master/tests/phpunit/includes/api/ApiLoginTest.php#L70 which uses the same basic methodology as the zerofetch script.

Interesting. I guess I should have said "reportedly removed." Submitted a patch to the zerofetch script just to get a little more info into the logs.

FWIW I seem to be able to log in to ZeroWiki from any IP with my user account, which has the same set of rights as the account I believe is in question here, and hit the zeroportal API and fetch this data.

Change 413787 merged by BBlack:
[operations/puppet@production] Update zerofetch script to report login failure reason

https://gerrit.wikimedia.org/r/413787

Merged your patch (thanks). New failure in eqsin is:

Exception: API login phase2 gave result Failed with reason "Incorrect username or password entered. Please try again.", expected "Success"

However, the username/password are identical to the ones used at the other sites (they come from a file with an identical checksum).

I've tested setting the HTTPS_PROXY environment variable before a manual script run from eqsin, causing the request to be proxied through a generic proxy server in eqiad, and this makes the script succeed from eqsin. That pretty much rules out all other possible causes except some form of network address whitelist/blacklist going on somewhere on the zerowiki side.

I've been digging around on the various hints about grants/roles named like some variation of zero-script, zeroscript, zero-script-ips, zeroscriptips... looking around in whatever special pages I can see on https://zero.wikimedia.org/wiki/Special:SpecialPages using my own account (which has some elevated privs there), and doing various code searches across the ZeroPortal extension, MW core, etc. I have yet to put the pieces together to figure out where such an IP address whitelist might be configured at. I've dug around for places which in general seem to define IP whitelists (puppet repos, mediawiki-config, etc) and not found any cases where eqsin seems to be missing...

It does seem possible that the wrongpassword reason text given above can be caused by things other than actual wrong passwords. One notable example is PreAuthentication hooks. Still, something seems missing...

+@Tgr for insights particularly around auth stuff.

Mentioned in SAL (#wikimedia-operations) [2018-02-24T01:13:10Z] <Reedy> added eqsin ipv6 range to botpasswords ip range restriction T188111

Reedy claimed this task.
Reedy subscribed.

@BBlack and I had come to the same solution around the same sort of time

The zerofetcher user had a bot_password account with an ip range restriction that didn't include eqsin's IPv6 range. That has now been added

mysql:wikiadmin@db1075 [zerowiki]> select bp_restrictions from bot_passwords where bp_app_id='zerofetcher';
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| bp_restrictions                                                                                                                                               |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"IPAddresses":["10.0.0.0/8","91.198.174.0/24","208.80.152.0/22","2620:0:860::/46","198.35.26.0/23","185.15.56.0/22","2a02:ec80::/32", "2001:df2:e500::/48"]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

zerofetch.py is using the deprecated action=login login flow, including relying on fields that were removed in 1.28.0-wmf.13 in mid-2016 (https://lists.wikimedia.org/pipermail/mediawiki-api-announce/2016-July/000114.html).

As the email says,

Note that the lgtoken *parameter* to action=login is not removed, nor is
the 'token' response value included along with a NeedToken response.

It used to be that the user token (the secret value used in the long-lived variant of the session cookie; not related to CSRF tokens) was returned in an lgtoken field, to humor API clients which could not deal with Set-Cookie headers. This was indeed dropped when that email was sent.

nor is the 'token' response value included along with a NeedToken response.

Is the part that threw me. A 'token' response value is still included with the NeedToken response.

Yeah, that was meant in the sense "nor is the 'token' response value included along with a NeedToken response removed". I guess there are two opposite ways to interpret that sentence.