Page MenuHomePhabricator

Quickstatements, "backend is overloaded"
Open, HighPublic

Description

For a few days now, people have reported errors and slowness on my quickstatements tool.
I just had a chance to look at the error.log and this is the latest:

2019-01-21 07:37:41: (mod_fastcgi.c.3002) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 263
2019-01-21 07:37:41: (mod_fastcgi.c.3002) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 263
2019-01-21 07:37:41: (mod_fastcgi.c.3594) all handlers for /quickstatements/api.php? on .php are down.
2019-01-21 07:37:57: (mod_fastcgi.c.2765) fcgi-server re-enabled:  0 /var/run/lighttpd/php.socket.quickstatements
2019-01-21 07:37:57: (mod_fastcgi.c.2765) fcgi-server re-enabled:  0 /var/run/lighttpd/php.socket.quickstatements
2019-01-21 07:38:02: (mod_fastcgi.c.3002) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 263
2019-01-21 07:38:02: (mod_fastcgi.c.3002) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 263
2019-01-21 07:38:02: (mod_fastcgi.c.3594) all handlers for /quickstatements/api.php?oauth_verifier=26c0fb6c9c8841ff17ca68bcb5d33d59&oauth_token=fc68e7d4aaa9f063b21be600ef627e9f on .php are down.
2019-01-21 07:38:21: (mod_fastcgi.c.2765) fcgi-server re-enabled:  0 /var/run/lighttpd/php.socket.quickstatements
2019-01-21 07:38:21: (mod_fastcgi.c.2765) fcgi-server re-enabled:  0 /var/run/lighttpd/php.socket.quickstatements

The log is basically filled with this.
Is there anything I can do? Anything for Labs to fix?

Event Timeline

Magnus created this task.Jan 21 2019, 8:25 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2019, 8:25 AM
Aklapper renamed this task from backend issues? to "backend is overloaded".Jan 21 2019, 9:48 AM
Krenair edited projects, added Toolforge; removed Cloud-Services.Jan 21 2019, 9:56 AM
Addshore renamed this task from "backend is overloaded" to Quickstatements, "backend is overloaded".Jan 21 2019, 10:07 AM

As far as I can see, this is not limited to QuickStatements - when I'm logged out (as just happened again - my "usual" time is at 21 min into the hour), I cannot access SourceMD either. Access normally comes back around 37 min into the hour, even if I press the "log in" button immediately after being logged out.

As far as I can see, this is not limited to QuickStatements - when I'm logged out (as just happened again - my "usual" time is at 21 min into the hour), I cannot access SourceMD either. Access normally comes back around 37 min into the hour, even if I press the "log in" button immediately after being logged out.

I don't see the "backend is overloaded" errors in sourcemd. Might not be related.

bd808 added a subscriber: bd808.Jan 21 2019, 7:56 PM

The error message seems to indicate an overload of fcgi container that is running php scripts for the tool inside its Kubernetes container. See https://github.com/lighttpd/lighttpd1.4/blob/d1a23569161148f5acde8d4a6fb78c44284e1853/src/mod_fastcgi.c#L2906-L2909

The related part of the lighttpd configuration that is generated by webservice looks like this:

fastcgi.server += ( ".php" =>
        ((
                "bin-path" => "/usr/bin/php-cgi",
                "socket" => "/var/run/lighttpd/php.socket.quickstatements",
                "max-procs" => 2,
                "bin-environment" => (
                        "PHP_FCGI_CHILDREN" => "2",
                        "PHP_FCGI_MAX_REQUESTS" => "500"
                ),
                "bin-copy-environment" => (
                        "PATH", "SHELL", "USER"
                ),
                "broken-scriptfilename" => "enable",
                "allow-x-send-file" => "enable"
         ))
)

This spawns 2 /usr/bin/php-cgi processes (max-procs) each of which is configured to spawn 2 children to actually handle requests (PHP_FCGI_CHILDREN). This combination means the tool can handle 4 concurrent http requests. We do not currently have any way to tune these values per-tool via webservice when using either the grid engine or Kubernetes backends.

From my end, the repetitive logout issue looks resolved. Anyone know what caused and fixed it? Anyone still having problems?

There have been slow response from Toolforge services. I use sourcemd and scholia. sourcemd was just slow, and have been slow from time to time. In scholia I have seen error messages like:

Wed Jan 30 09:26:20 2019 - *** uWSGI listen queue of socket 3 full !!! (101/100) ***
Wed Jan 30 09:26:21 2019 - *** uWSGI listen queue of socket 3 full !!! (101/100) ***
Wed Jan 30 09:26:22 2019 - *** uWSGI listen queue of socket 3 full !!! (101/100) ***

I wonder if these problems are the same.

I am having those repetitive logout issues again.

It worked for a while today but now I'm logged out again and cannot log back in.
All edits through the tool seem to have stalled as well (for any user), and https://tools.wmflabs.org/quickstatements/api.php gives a HTTP 500 error.

Daniel_Mietchen triaged this task as High priority.Feb 15 2019, 2:57 PM
bd808 added a comment.Feb 16 2019, 1:03 AM

It worked for a while today but now I'm logged out again and cannot log back in.
All edits through the tool seem to have stalled as well (for any user), and https://tools.wmflabs.org/quickstatements/api.php gives a HTTP 500 error.

There is an active outage/degradation of the ToolsDB service (T216208) that is likely to be effecting Quickstatements and many other tools.

Hello. am I the only one who cannot use #QuickStatements on Wikidata today ? The url doesn't open https://tools.wmflabs.org/quickstatements
Please help.

bd808 added a comment.Jan 22 2020, 8:01 PM

Hello. am I the only one who cannot use #QuickStatements on Wikidata today ? The url doesn't open https://tools.wmflabs.org/quickstatements
Please help.

The tool seems to be up and responsive to requests, but I did notice that it had issues if the url used was https://tools.wmflabs.org/quickstatements with no trailing /. The lack of an automatic redirect is tracked in T242719: https://tools.wmflabs.org/{toolname} no longer redirects to https://tools.wmflabs.org/{toolname}/ on new k8s cluster. A full webservice stop; webservice --backend=kubernetes php7.3 start restart would fix this, but to be less intrusive I live edited the Kubernetes Ingress object for the tool to add the redirect annotation.

@Magnus Now that you have migrated quickstatements to the 2020 Kubernetes cluster, there is a way that you can scale the number of concurrent connections that the tool can process: /usr/bin/kubectl scale --current-replicas=1 --replicas=2 deployment/quickstatements

This command asks the Kubernetes cluster to check the number of running "pods" managed by the quickstatment deployment and if it finds only 1 running increase that to 2. Once there is more than one pod running, the Service object for quickstatements will use a round-robin algorithm to spread requests across the pool. This should be fine for a PHP based tool such as quickstatements because of PHP's inherent shared-nothing request state model.

Today we do not have a convenient way to do this scaling from the webservice command itself, but using kubectl directly as shown above is possible. The cluster should continue to recreate the increased number of pods following a webservice restart, but a webservice stop; webservice start cycle will reset the Deployment to the default config of requesting only 1 pod.

@bd808 Thanks, now running quickstatements as two repilcas