[Timebox: 18hrs] Frequent 502 responses when submitting edits
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	DL2204
	May 24 2022, 5:20 AM

Description

When submitting edits (manual edits or API calls using WBI), I frequently get response 502. When retrying, the edit normally gets through, although I have had to retry manual edits up to six times. WBI, when recieving a 502, sleeps for 60 seconds; after that, the edit always gets through. I am editing https://wikibase.inguma.eus and https://lexbib.elex.is, two of the wikibases migrated to wikibase.cloud in Batch A.

When I have two processes run on the same wikibase at the same time, 502 seems to happen much more often, although I also get it quite frequently when having a single bot running. When having a (WBI-based) bot writing data for a while, parallel (manual) edits will get 502.

I also have had WBI item write actions hanging, until I made a parallel edit; then, the hanging process woke up again. I use WBI version 0.12.

When using QuickStatements, I have not experienced this problem.

Details

Subject	Repo	Branch	Lines +/-
Fix DenyListManager singleton	mediawiki/extensions/StopForumSpam	master	+13 -6
Fix DenyListManager singleton	mediawiki/extensions/StopForumSpam	REL1_38	+13 -6
Fix DenyListManager singleton	mediawiki/extensions/StopForumSpam	REL1_37	+13 -6
Fix DenyListManager singleton	mediawiki/extensions/StopForumSpam	wmf/1.39.0-wmf.25	+13 -6

Customize query in gerrit

Related Objects

Mentioned In: T325967: Frequent 404 responses from wikibase cloud instances, downtime up to 2 minutes
T315594: Build new updated mediawiki image - with SFS update
T315447: Investigate caching of serialisation of StopForumSpam IPSet
T313215: [24hrs] Investigate how to set up StopForumSpam correctly
T312801: smartmeta is missing updates to ElasticSearch
T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets)
Mentioned Here: T330796: Problems with search in newish wikibase instance
T315594: Build new updated mediawiki image - with SFS update
T309687: Registering on wikibase.cloud exposes internal IP rather than the registering user
T299522: [INVESTIGATION 16hrs] Speed improvements for saving statements and creating new items
T310121: Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration).
P30707 Single traced api edit
T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets)
P30583 update.py - Example to reproduce T309070

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• toan moved this task from Product prioritized backlog to Launch Migration Kanban (2022) on the Wikibase Cloud board.Jun 22 2022, 2:07 PM

• toan edited projects, added Wikibase Cloud (Launch Migration Kanban (2022)); removed Wikibase Cloud.

Deniz_WMDE claimed this task.Jun 23 2022, 8:27 AM

Deniz_WMDE moved this task from Backlog to Doing on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

I was able to replicate this locally, looking further into it

Jan_Dittrich updated the task description. (Show Details)Jun 28 2022, 9:10 AM

Sharing my findings so far:

I am able to consistently reproduce this error if I run two instances of this example script in parallel: https://phabricator.wikimedia.org/P30583

So far I wasn't able to figure out more details other than it seems to happen regardless of creating or updating an entity.

None of the mediawiki pods seem to log a relevant error message when this occurs, which confuses me.

The only error I can see is the following from the nginx logs. I tried to increase some nginx timeouts, but that didn't make any difference (also it doesn't look like something is timing out at that point - 1s delta between request and error).

172.17.0.24 - - [28/Jun/2022:14:05:24 +0000] "POST /w/api.php HTTP/1.1" 200 175 "-" "DEBUG-T309070"
2022/06/28 14:05:25 [error] 31#0: *1729 upstream prematurely closed connection while reading response header from upstream, client: 172.17.0.24, server: _, request: "POST /w/api.php HTTP/1.1", upstream: "http://10.96.199.199:80/w/api.php", host: "debugdebugdebug.wbaas.localhost"

Deniz_WMDE mentioned this in T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets) .Jun 29 2022, 9:27 AM

Disabling our custom hooks and/or (redis) cache didn't affect this problem for me:
https://github.com/wbstack/mediawiki/blob/main/dist/wbstack/src/Settings/Hooks.php
https://github.com/wbstack/mediawiki/blob/main/dist/wbstack/src/Settings/ProductionCache.php

Also, just when this error occurs, this warning can be seen in the mariadb logs: https://phabricator.wikimedia.org/T310597#8036294
(which may just be an indication about the lost connection from the mw api)

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Jun 30 2022, 8:53 AM

Deniz_WMDE moved this task from Doing to Backlog on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

Deniz_WMDE subscribed.

• toan claimed this task.Jun 30 2022, 9:08 AM

• toan moved this task from Backlog to Doing on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

So I had a look at this yesterday using the excellent python script @Deniz_WMDE made (https://phabricator.wikimedia.org/P30583), I'd almost propose we create some repo for snippets like this for easy reuse against the platform, IMO we really need to start dog-fooding the thing we've built.

I was able to re-create the 502:s locally. Around the time these things happen i was also seeing errors similar to T310121: Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration)., and T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets) , seeing that these issues are still persistent on the platform there is a chance these things are all connected, we are just seeing different errors pop up in different pods and still no clear idea as to what is causing them.

From a previous look at profiling saving items T299522: [INVESTIGATION 16hrs] Speed improvements for saving statements and creating new items we noticed the StopForumSpam extension is causing significant slow-down when saving, around 11 seconds when using the Special:NewItem page. This slowness could trip over timeouts.

I used this as a starting point to investigate further and disabled both SpamBlackList(This thing pulls external lists) and StopForumSpam (This thing also pulls a big external list and then runs many while loops to calculate if an IP is bad). This made the 502 error go away and I was able to run the python script much faster creating a bit more than one item per second. Re-enabling SpamBlackList caused a minor slow-down again but I wasn't seeing any 502:s.

Since spam is a problem we probably want this extension to run, so in an attempt to re-enable it I tried out the following without any success in stopping the 502:s from occuring.

Use a local ip-list https://github.com/wbstack/mediawiki/pull/265/

As written in the docs of the extension (https://www.mediawiki.org/wiki/Extension:StopForumSpam) it is possible to download one of these lists, put it in the image and point the extension to read that using the $wgSFSIPListLocation variable.
This helped with the slowness but did not resolve the 502 errors occuring after a while. I started off with the original contents of https://www.stopforumspam.com/downloads/listed_ip_30.zip, but as the 502:s still occured i trimmed the list down to just one entry but the problem still persisted.

Increase php memory_limit

# Tweak other PHP.ini settings
RUN { \
		echo 'memory_limit = 512M'; \
	} > /usr/local/etc/php/conf.d/tweaks.ini

But this didn't help.

Increase maxmemory / resources of redis

I gave both the redis primary and replica loads of memory (R 500 / L 900) and increase maxmemory as suggested in T299522#7973128. But still got the 502:s

Increase resources on the mediawiki-webapi pod

webapi:
  requests:
    cpu: 500m
    memory: 350Mi
  limits:
    cpu: 1000m
    memory: 750Mi

But didn't resolve the 502:s

None of these things seems to have helped, and I'm really no closer to understanding why the upstream (mediawiki) would close the connection. In the previously linked pr (https://github.com/wbstack/mediawiki/pull/265/) I've also added debug logs to see what the StopSpamForum extension is actually doing. See https://phabricator.wikimedia.org/P30707 for traces of a single api edit, where we apparently check if the IP is blocked 4 times. The interesting part here is we always seem to have this failure when we enter the code run by https://github.com/wikimedia/ipset, however there is no logs output inside mediawiki to indicate what happened, it just stops. This kind of makes me thing it's either a configuration error, maybe related to configuration changes in T309687: Registering on wikibase.cloud exposes internal IP rather than the registering user or something in the php/apache configuration of the dockerfile.

	public function isIpDenyListed( $ip ) {
		if ( IPUtils::isIPAddress( $ip ) === null ) {
			return false;
		}

		return $this->getIpDenyListSet()->match( $ip );
	}

The wbaas-deploy branch with some of the attempts made https://github.com/wmde/wbaas-deploy/compare/futile-attempts-502?expand=1

This commit https://github.com/wikimedia/mediawiki-extensions-StopForumSpam/commit/ee89a5063c6c87a61b41c1db7295ea3cd9d92111 which we currently don't have deployed kind of looks to me that anyone with sfsblock-bypass on their user would circumvent this entire check and don't have this problem + way faster edit speeds.

• toan removed • toan as the assignee of this task.Jul 1 2022, 12:10 PM

• toan moved this task from Doing to Backlog on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

conny-kawohl_WMDE renamed this task from Frequent 502 responses when submitting edits to [Timebox: 18hrs] Frequent 502 responses when submitting edits.Jul 7 2022, 9:20 AM

conny-kawohl_WMDE edited projects, added Wikibase Cloud (WB Cloud Sprint 0); removed Wikibase Cloud (Launch Migration Kanban (2022)).

Tarrow mentioned this in T312801: smartmeta is missing updates to ElasticSearch.Jul 13 2022, 9:37 AM

Tarrow claimed this task.Jul 18 2022, 8:48 AM

Tarrow moved this task from Sprint Backlog to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.

I was also able to reproduce the 502's locally with just one copy of this updater.py script. Took around 5mins to see this failure. Nothing obviously appearing at all in the MW logs.

My suspicion is that we might be hitting the internal 30s timeout for php (https://www.php.net/manual/en/function.set-time-limit.php) and this is therefore killing stuff before we a) see a log and b) the connection to the DB is properly closed which would explain the errors in those logs which I do see e.g.:
2022-07-18 9:51:57 875 [Warning] Aborted connection 875 to db: 'mwdb_de54a27940' user: 'mwu_ba21cf5817' host: '172.17.0.1' (Got an error reading communication packets)

Unassigned but will pick up again this afternoon. Time box remaining ~16.5 hrs

Tarrow moved this task from Doing to Sprint Backlog on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 18 2022, 10:08 AM

Evelien_WMDE mentioned this in T313215: [24hrs] Investigate how to set up StopForumSpam correctly .Jul 18 2022, 11:36 AM

Deniz_WMDE claimed this task.Jul 18 2022, 12:32 PM

Deniz_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.

In T309070#8083808, @Tarrow wrote:

My suspicion is that we might be hitting the internal 30s timeout for php

Good thinking, I looked at this and tried bumping it to 240s but I still saw the error within 1 minute. Also, I tried 10s just to force the error and then you can see it in the output of the server response: Fatal error: Maximum execution time of 10 seconds exceeded

This is how I tried it out (mediawiki Dockerfile):

# Tweak other PHP.ini settings
RUN { \
		echo 'memory_limit = 256M'; \
		echo 'max_execution_time = 240'; \ # I added this line
	} > /usr/local/etc/php/conf.d/tweaks.ini

Time box remaining ~15 hrs

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Jul 18 2022, 2:45 PM

Deniz_WMDE moved this task from Doing to Sprint Backlog on the Wikibase Cloud (WB Cloud Sprint 0) board.

Tarrow moved this task from Sprint Backlog to Doing on the Wikibase Cloud (WB Cloud Sprint 0) board.Jul 18 2022, 3:39 PM

Attempted to adjust the http timeouts by setting

$wgAsyncHTTPTimeout = 240;
$wgHTTPTimeout = 240;

in LocalSettings.php.

This then resulted in

{"error":{"code":"internal_api_error_DBQueryError","info":"[d43eb32f451c41bd203e100f] Exception caught: A database query error has occurred. This may indicate a bug in the software.","errorclass":"Wikimedia\\Rdbms\\DBQueryError","*":"Wikimedia\\Rdbms\\DBQueryError at /v
ar/www/html/w/includes/libs/rdbms/database/Database.php(1809)\nfrom /var/www/htm

coming from update.py and I observed that the secondary sql pod had been killed due to a failed liveness check

I could still repeat this today very clearly. After removing the readiness/liveness checks on the SQL I could happily run one update.py script for ages without it failing. I could also run a second update.py concurrently that would run for a while and not 502 but 504 (seemingly hitting the 60s timeout in the platform nginx?)

Given that @toan noticed these problems seem to disappear when the SFS extension is unloaded we looked at the docs: https://www.mediawiki.org/wiki/Extension:StopForumSpam. This makes it clear that object caching must be enabled. It seems like this is something we don't have for web requests (redis is used for caching for maintenance scripts)

Timebox left: 9 - some time from Deniz

Tarrow moved this task from WB Cloud Sprint 0 to Wikibase.cloud (WB Cloud Sprint 1) on the Wikibase Cloud board.Jul 20 2022, 12:30 PM

Tarrow edited projects, added Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)); removed Wikibase Cloud (WB Cloud Sprint 0).

Deniz_WMDE claimed this task.Jul 21 2022, 8:02 AM

Deniz_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.

Timebox left: 7

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Jul 21 2022, 1:08 PM

Deniz_WMDE moved this task from Doing to Sprint Backlog on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.

Deniz_WMDE claimed this task.Jul 22 2022, 7:15 AM

Deniz_WMDE moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.

FWIW I think I managed to capture profile information while creating an item, maybe we can use this to analyze what is going on within the SFS extension: https://drive.google.com/file/d/1j_rtfSSsmcUMDriUkGLf_AQA-S61xxEE/view?usp=sharing

Another one for which I tried to load the IP list from disk and enable caching:

$wgSFSIPListLocation = '/var/www/html/w/listed_ip_30_ipv46.txt';
$wgMainCacheType = CACHE_DB;

https://drive.google.com/file/d/1aZmRSLcqUFHmg_d_5jP9VOD1G_JZWaZY/view?usp=sharing

And another one with SFS disabled: https://drive.google.com/file/d/1mLWiV_9DP3lWZfuzp6WULp_sVMbe0ojA/view?usp=sharing

edit: updated the links to the right request files

Timebox left: 5

Deniz_WMDE removed Deniz_WMDE as the assignee of this task.Jul 22 2022, 11:08 AM

Deniz_WMDE moved this task from Doing to Sprint Backlog on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.

Tarrow claimed this task.Jul 25 2022, 10:44 AM

Tarrow moved this task from Sprint Backlog to Doing on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.

I spent some time yesterday to capture more profile information and looked a bit into it.

timebox left: ~3h

Michael subscribed.Jul 28 2022, 9:47 AM

Timebox now empty; some last thoughts:

Stepping through the code as well as looking at the profiles we took resulted in me noticing the following things:

A large amount of time seems to be spent on this addCIDR method
This is used to build the IPSet object
The IPSet object doesn't actually appear to be cached by the DenyListManager
What is cached by the DenyListManager is a hex encoded list of ips
This is only cached if there is a cache configured

I managed to get this hex list cached by regenerating the ip list manually using the maintenance script. However I was unable to do this without substantially bumping the cpu limit (I moved it up to 15) on my machine. With a lower cpu limit I saw [warning] [SQLBagOStuff] Lowered set() TTL for global:sfs-denylist-set due to snapshot lag (high regeneration time).. When this error is thrown this sets the TTL of the cached list to only one second.

Tarrow moved this task from Doing to Done on the Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)) board.Jul 28 2022, 1:31 PM

conny-kawohl_WMDE removed Tarrow as the assignee of this task.Aug 2 2022, 1:08 PM

Tarrow closed this task as Resolved.Aug 3 2022, 12:34 PM

Tarrow claimed this task.

Addshore mentioned this in T315447: Investigate caching of serialisation of StopForumSpam IPSet.Aug 17 2022, 6:49 PM

Change 824415 had a related patch set uploaded (by Addshore; author: Addshore):

[mediawiki/extensions/StopForumSpam@master] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824415

gerritbot added a project: Patch-For-Review.Aug 18 2022, 9:45 AM

Change 824415 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@master] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824415

Change 824184 had a related patch set uploaded (by Reedy; author: Addshore):

[mediawiki/extensions/StopForumSpam@REL1_38] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824184

Change 824185 had a related patch set uploaded (by Reedy; author: Addshore):

[mediawiki/extensions/StopForumSpam@REL1_37] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824185

Change 824426 had a related patch set uploaded (by Reedy; author: Addshore):

[mediawiki/extensions/StopForumSpam@wmf/1.39.0-wmf.25] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824426

Change 824185 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@REL1_37] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824185

Change 824426 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@wmf/1.39.0-wmf.25] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824426

Change 824184 merged by jenkins-bot:

[mediawiki/extensions/StopForumSpam@REL1_38] Fix DenyListManager singleton

https://gerrit.wikimedia.org/r/824184

• toan mentioned this in T315594: Build new updated mediawiki image - with SFS update.Aug 18 2022, 6:08 PM

I think this got accidentally closed as it was turned into a timeboxed task in the past, but the good news is that T315594 will hopefully resolve it :)

In T309070#8168757, @Deniz_WMDE wrote:

I think this got accidentally closed as it was turned into a timeboxed task in the past, but the good news is that T315594 will hopefully resolve it :)

and it's deployed now!

@DL2204 please let us know if you still experience these issues from now on, but it should be fixed now.
Also as a convenient side effect, item creation got a bit faster now :)

Yes, I am still experiencing this issue. WBI v12.0 gets 502 responses quite frequently, then waits for 60 seconds (the default in WBI), and then the edit gets through. With "quite frequently" I mean about once every 2-3 minutes, while the script is constantly editing entities (without any sleep time between edits but the few milliseconds the script needs to put together the content to edit).

Overall speed of API responses has increased now, yes!

DL2204 reopened this task as Open.Dec 27 2022, 2:51 PM

This comment was removed by DL2204.

DL2204 closed this task as Resolved.Dec 27 2022, 7:42 PM

This is still an issue, as seen in this errorlogging (python mwclient). The scrips I have running are here.

DL2204 mentioned this in T325967: Frequent 404 responses from wikibase cloud instances, downtime up to 2 minutes.Dec 28 2022, 10:45 AM

Aklapper edited projects, added Wikibase Cloud; removed Patch-For-Review, Wikibase Cloud (Wikibase.cloud (WB Cloud Sprint 1)).Jan 16 2023, 9:20 AM

Evelien_WMDE moved this task from Backlog (incoming) to Ready to Pick Up on the Wikibase Cloud board.Feb 13 2023, 10:41 AM

I'm seeing similar issues in work I'm doing at eew-edgi.wikibase.cloud. In my case, it's frequent 504 errors thrown. I've been trying a number of things to better handle large build-out scenarios where I need to populate thousands of reference items such as U.S. states, counties, and cities. Parallel processing using something like Dask just seems to make matters worse. I continue to see problems I reported in a previous issue (https://phabricator.wikimedia.org/T330796) where full availability of an item's functionality is sometimes severely delayed. I see an item in recent changes and can get to it by QID, but I can't turn the item up in UI search or SPARQL queries.

Reading through the thread here, it looks like there is a whole "gauntlet" being run through for any API write action to get all the parts and pieces of the Wikibase stack interoperating. Something like the RaiseWikibase approach doesn't seem like it will work with wikibase.cloud as I assume there is no direct database access. For this whole experiment to work, though, we're going to need some way to instantiate millions of items in a given WB instance. Ideally, we can start trusting more and more of those items to link directly into Wikidata, but we still need some form of federation that allows a specific WB context to decide what it's going to trust and leverage from some other instance. And we'll be adding thousands to millions of new items unique to a given WB instance that can hopefully be contributed to the commons at some point.

A somewhat crude approach might be a dropbox of some kind where we could queue up JSON documents for items being built with a processor handling them at whatever rate the wikibase.cloud resources will support. Using a message queuing tech, this could support FIFO and polling and other features to allow us to check whether something we're working on has been processed or is still stuck in the queue. An item sitting in the dropbox document store is not yet functional in the WB instance, but we at least know it's been dropped off and will be processed at some point.

Tarrow removed a project: Wikibase Cloud.Apr 3 2023, 12:02 PM

@Tarrow: If this is not Wikibase Cloud, then which other active project tag is this task about?

Aklapper added a project: Wikibase Cloud.Apr 23 2023, 6:49 PM

Tarrow removed Tarrow as the assignee of this task.Aug 3 2023, 1:26 PM

We quickly reviewed this and we believe that this is no longer an issue. See:

which shows very few 5xx errors in the last 30days apart from when we had a known incident

in anycase if it is we wouldn't want such a generic large task and would rather have a more focussed investigation that is shorter

	F41665799: image.png
	Jan 11 2024, 2:47 PM

[Timebox: 18hrs] Frequent 502 responses when submitting editsClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related Objects

Event Timeline

[Timebox: 18hrs] Frequent 502 responses when submitting edits
Closed, ResolvedPublicBUG REPORT
Actions