Page MenuHomePhabricator

[Timebox: 18hrs] Frequent 502 responses when submitting edits
Closed, ResolvedPublicBUG REPORT

Description

When submitting edits (manual edits or API calls using WBI), I frequently get response 502. When retrying, the edit normally gets through, although I have had to retry manual edits up to six times. WBI, when recieving a 502, sleeps for 60 seconds; after that, the edit always gets through. I am editing https://wikibase.inguma.eus and https://lexbib.elex.is, two of the wikibases migrated to wikibase.cloud in Batch A.

When I have two processes run on the same wikibase at the same time, 502 seems to happen much more often, although I also get it quite frequently when having a single bot running. When having a (WBI-based) bot writing data for a while, parallel (manual) edits will get 502.

I also have had WBI item write actions hanging, until I made a parallel edit; then, the hanging process woke up again. I use WBI version 0.12.

When using QuickStatements, I have not experienced this problem.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I was able to replicate this locally, looking further into it

Sharing my findings so far:

I am able to consistently reproduce this error if I run two instances of this example script in parallel: https://phabricator.wikimedia.org/P30583

So far I wasn't able to figure out more details other than it seems to happen regardless of creating or updating an entity.

None of the mediawiki pods seem to log a relevant error message when this occurs, which confuses me.

The only error I can see is the following from the nginx logs. I tried to increase some nginx timeouts, but that didn't make any difference (also it doesn't look like something is timing out at that point - 1s delta between request and error).

172.17.0.24 - - [28/Jun/2022:14:05:24 +0000] "POST /w/api.php HTTP/1.1" 200 175 "-" "DEBUG-T309070"
2022/06/28 14:05:25 [error] 31#0: *1729 upstream prematurely closed connection while reading response header from upstream, client: 172.17.0.24, server: _, request: "POST /w/api.php HTTP/1.1", upstream: "http://10.96.199.199:80/w/api.php", host: "debugdebugdebug.wbaas.localhost"

Disabling our custom hooks and/or (redis) cache didn't affect this problem for me:
https://github.com/wbstack/mediawiki/blob/main/dist/wbstack/src/Settings/Hooks.php
https://github.com/wbstack/mediawiki/blob/main/dist/wbstack/src/Settings/ProductionCache.php

Also, just when this error occurs, this warning can be seen in the mariadb logs: https://phabricator.wikimedia.org/T310597#8036294
(which may just be an indication about the lost connection from the mw api)

So I had a look at this yesterday using the excellent python script @Deniz_WMDE made (https://phabricator.wikimedia.org/P30583), I'd almost propose we create some repo for snippets like this for easy reuse against the platform, IMO we really need to start dog-fooding the thing we've built.

I was able to re-create the 502:s locally. Around the time these things happen i was also seeing errors similar to T310121: Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration)., and T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets) , seeing that these issues are still persistent on the platform there is a chance these things are all connected, we are just seeing different errors pop up in different pods and still no clear idea as to what is causing them.

From a previous look at profiling saving items T299522: [INVESTIGATION 16hrs] Speed improvements for saving statements and creating new items we noticed the StopForumSpam extension is causing significant slow-down when saving, around 11 seconds when using the Special:NewItem page. This slowness could trip over timeouts.

I used this as a starting point to investigate further and disabled both SpamBlackList(This thing pulls external lists) and StopForumSpam (This thing also pulls a big external list and then runs many while loops to calculate if an IP is bad). This made the 502 error go away and I was able to run the python script much faster creating a bit more than one item per second. Re-enabling SpamBlackList caused a minor slow-down again but I wasn't seeing any 502:s.

Since spam is a problem we probably want this extension to run, so in an attempt to re-enable it I tried out the following without any success in stopping the 502:s from occuring.

Use a local ip-list https://github.com/wbstack/mediawiki/pull/265/

As written in the docs of the extension (https://www.mediawiki.org/wiki/Extension:StopForumSpam) it is possible to download one of these lists, put it in the image and point the extension to read that using the $wgSFSIPListLocation variable.
This helped with the slowness but did not resolve the 502 errors occuring after a while. I started off with the original contents of https://www.stopforumspam.com/downloads/listed_ip_30.zip, but as the 502:s still occured i trimmed the list down to just one entry but the problem still persisted.

Increase php memory_limit

# Tweak other PHP.ini settings
RUN { \
		echo 'memory_limit = 512M'; \
	} > /usr/local/etc/php/conf.d/tweaks.ini

But this didn't help.

Increase maxmemory / resources of redis

I gave both the redis primary and replica loads of memory (R 500 / L 900) and increase maxmemory as suggested in T299522#7973128. But still got the 502:s

Increase resources on the mediawiki-webapi pod

webapi:
  requests:
    cpu: 500m
    memory: 350Mi
  limits:
    cpu: 1000m
    memory: 750Mi

But didn't resolve the 502:s

None of these things seems to have helped, and I'm really no closer to understanding why the upstream (mediawiki) would close the connection. In the previously linked pr (https://github.com/wbstack/mediawiki/pull/265/) I've also added debug logs to see what the StopSpamForum extension is actually doing. See https://phabricator.wikimedia.org/P30707 for traces of a single api edit, where we apparently check if the IP is blocked 4 times. The interesting part here is we always seem to have this failure when we enter the code run by https://github.com/wikimedia/ipset, however there is no logs output inside mediawiki to indicate what happened, it just stops. This kind of makes me thing it's either a configuration error, maybe related to configuration changes in T309687: Registering on wikibase.cloud exposes internal IP rather than the registering user or something in the php/apache configuration of the dockerfile.

	public function isIpDenyListed( $ip ) {
		if ( IPUtils::isIPAddress( $ip ) === null ) {
			return false;
		}

		return $this->getIpDenyListSet()->match( $ip );
	}

This commit https://github.com/wikimedia/mediawiki-extensions-StopForumSpam/commit/ee89a5063c6c87a61b41c1db7295ea3cd9d92111 which we currently don't have deployed kind of looks to me that anyone with sfsblock-bypass on their user would circumvent this entire check and don't have this problem + way faster edit speeds.

conny-kawohl_WMDE renamed this task from Frequent 502 responses when submitting edits to [Timebox: 18hrs] Frequent 502 responses when submitting edits.Jul 7 2022, 9:20 AM

I was also able to reproduce the 502's locally with just one copy of this updater.py script. Took around 5mins to see this failure. Nothing obviously appearing at all in the MW logs.

My suspicion is that we might be hitting the internal 30s timeout for php (https://www.php.net/manual/en/function.set-time-limit.php) and this is therefore killing stuff before we a) see a log and b) the connection to the DB is properly closed which would explain the errors in those logs which I do see e.g.:
2022-07-18 9:51:57 875 [Warning] Aborted connection 875 to db: 'mwdb_de54a27940' user: 'mwu_ba21cf5817' host: '172.17.0.1' (Got an error reading communication packets)

Unassigned but will pick up again this afternoon. Time box remaining ~16.5 hrs

My suspicion is that we might be hitting the internal 30s timeout for php

Good thinking, I looked at this and tried bumping it to 240s but I still saw the error within 1 minute. Also, I tried 10s just to force the error and then you can see it in the output of the server response: Fatal error: Maximum execution time of 10 seconds exceeded

This is how I tried it out (mediawiki Dockerfile):

# Tweak other PHP.ini settings
RUN { \
		echo 'memory_limit = 256M'; \
		echo 'max_execution_time = 240'; \ # I added this line
	} > /usr/local/etc/php/conf.d/tweaks.ini

Time box remaining ~15 hrs

Attempted to adjust the http timeouts by setting

$wgAsyncHTTPTimeout = 240;
$wgHTTPTimeout = 240;

in LocalSettings.php.

This then resulted in

{"error":{"code":"internal_api_error_DBQueryError","info":"[d43eb32f451c41bd203e100f] Exception caught: A database query error has occurred. This may indicate a bug in the software.","errorclass":"Wikimedia\\Rdbms\\DBQueryError","*":"Wikimedia\\Rdbms\\DBQueryError at /v
ar/www/html/w/includes/libs/rdbms/database/Database.php(1809)\nfrom /var/www/htm

coming from update.py and I observed that the secondary sql pod had been killed due to a failed liveness check

I could still repeat this today very clearly. After removing the readiness/liveness checks on the SQL I could happily run one update.py script for ages without it failing. I could also run a second update.py concurrently that would run for a while and not 502 but 504 (seemingly hitting the 60s timeout in the platform nginx?)

Given that @toan noticed these problems seem to disappear when the SFS extension is unloaded we looked at the docs: https://www.mediawiki.org/wiki/Extension:StopForumSpam. This makes it clear that object caching must be enabled. It seems like this is something we don't have for web requests (redis is used for caching for maintenance scripts)

Timebox left: 9 - some time from Deniz

FWIW I think I managed to capture profile information while creating an item, maybe we can use this to analyze what is going on within the SFS extension: https://drive.google.com/file/d/1j_rtfSSsmcUMDriUkGLf_AQA-S61xxEE/view?usp=sharing


Another one for which I tried to load the IP list from disk and enable caching:

$wgSFSIPListLocation = '/var/www/html/w/listed_ip_30_ipv46.txt';
$wgMainCacheType = CACHE_DB;

https://drive.google.com/file/d/1aZmRSLcqUFHmg_d_5jP9VOD1G_JZWaZY/view?usp=sharing


And another one with SFS disabled: https://drive.google.com/file/d/1mLWiV_9DP3lWZfuzp6WULp_sVMbe0ojA/view?usp=sharing


edit: updated the links to the right request files

I spent some time yesterday to capture more profile information and looked a bit into it.

timebox left: ~3h

Timebox now empty; some last thoughts:

Stepping through the code as well as looking at the profiles we took resulted in me noticing the following things:

  • A large amount of time seems to be spent on this addCIDR method
  • This is used to build the IPSet object
  • The IPSet object doesn't actually appear to be cached by the DenyListManager
  • What is cached by the DenyListManager is a hex encoded list of ips
  • This is only cached if there is a cache configured

I managed to get this hex list cached by regenerating the ip list manually using the maintenance script. However I was unable to do this without substantially bumping the cpu limit (I moved it up to 15) on my machine. With a lower cpu limit I saw [warning] [SQLBagOStuff] Lowered set() TTL for global:sfs-denylist-set due to snapshot lag (high regeneration time).. When this error is thrown this sets the TTL of the cached list to only one second.

Tarrow claimed this task.

Change 824415 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/StopForumSpam@master] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824415

Change 824415 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@master] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824415

Change 824184 had a related patch set uploaded (by Reedy; author: Addshore):

[mediawiki/extensions/StopForumSpam@REL1_38] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824184

Change 824185 had a related patch set uploaded (by Reedy; author: Addshore):

[mediawiki/extensions/StopForumSpam@REL1_37] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824185

Change 824426 had a related patch set uploaded (by Reedy; author: Addshore):

[mediawiki/extensions/StopForumSpam@wmf/1.39.0-wmf.25] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824426

Change 824185 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@REL1_37] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824185

Change 824426 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@wmf/1.39.0-wmf.25] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824426

Change 824184 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@REL1_38] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824184

I think this got accidentally closed as it was turned into a timeboxed task in the past, but the good news is that T315594 will hopefully resolve it :)

I think this got accidentally closed as it was turned into a timeboxed task in the past, but the good news is that T315594 will hopefully resolve it :)

and it's deployed now!

@DL2204 please let us know if you still experience these issues from now on, but it should be fixed now.
Also as a convenient side effect, item creation got a bit faster now :)

Yes, I am still experiencing this issue. WBI v12.0 gets 502 responses quite frequently, then waits for 60 seconds (the default in WBI), and then the edit gets through. With "quite frequently" I mean about once every 2-3 minutes, while the script is constantly editing entities (without any sleep time between edits but the few milliseconds the script needs to put together the content to edit).

Overall speed of API responses has increased now, yes!

This comment was removed by DL2204.
DL2204 reopened this task as Open.EditedDec 27 2022, 8:09 PM

This is still an issue, as seen in this errorlogging (python mwclient). The scrips I have running are here.

I'm seeing similar issues in work I'm doing at eew-edgi.wikibase.cloud. In my case, it's frequent 504 errors thrown. I've been trying a number of things to better handle large build-out scenarios where I need to populate thousands of reference items such as U.S. states, counties, and cities. Parallel processing using something like Dask just seems to make matters worse. I continue to see problems I reported in a previous issue (https://phabricator.wikimedia.org/T330796) where full availability of an item's functionality is sometimes severely delayed. I see an item in recent changes and can get to it by QID, but I can't turn the item up in UI search or SPARQL queries.

Reading through the thread here, it looks like there is a whole "gauntlet" being run through for any API write action to get all the parts and pieces of the Wikibase stack interoperating. Something like the RaiseWikibase approach doesn't seem like it will work with wikibase.cloud as I assume there is no direct database access. For this whole experiment to work, though, we're going to need some way to instantiate millions of items in a given WB instance. Ideally, we can start trusting more and more of those items to link directly into Wikidata, but we still need some form of federation that allows a specific WB context to decide what it's going to trust and leverage from some other instance. And we'll be adding thousands to millions of new items unique to a given WB instance that can hopefully be contributed to the commons at some point.

A somewhat crude approach might be a dropbox of some kind where we could queue up JSON documents for items being built with a processor handling them at whatever rate the wikibase.cloud resources will support. Using a message queuing tech, this could support FIFO and polling and other features to allow us to check whether something we're working on has been processed or is still stuck in the queue. An item sitting in the dropbox document store is not yet functional in the WB instance, but we at least know it's been dropped off and will be processed at some point.

@Tarrow: If this is not Wikibase Cloud, then which other active project tag is this task about?

Tarrow removed Tarrow as the assignee of this task.Aug 3 2023, 1:26 PM
Tarrow claimed this task.

We quickly reviewed this and we believe that this is no longer an issue. See:

image.png (345×812 px, 21 KB)
which shows very few 5xx errors in the last 30days apart from when we had a known incident

in anycase if it is we wouldn't want such a generic large task and would rather have a more focussed investigation that is shorter