Page MenuHomePhabricator

Earwig's copyvio tool 504 gateway time-out issues
Open, MediumPublic

Description

We've been having issues with 504 Gateway Time-out errors with this tool since about February 12. The statistics shown at https://stats.uptimerobot.com/BN16RUOP5/784331770 are deceptive; Even when the tool will load, I am getting a 504 gateway time out error almost every time I try to perform a comparison.

Details

Related Gerrit Patches:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Meh, it seems he is not available currently on Discord. It would be faster if he was.

Masumrezarock100 reopened this task as Open.EditedFeb 17 2020, 1:31 PM

No, this is not the same problem. I am seeing a 504 gateway timeout error if I visit https://tools.wmflabs.org/copyvios/ while T243736 describes the error message as

An error occurred while using the search engine (Google Error: HTTP Error 403: Forbidden). Note: there is a daily limit on the number of search queries the tool is allowed to make. You may repeat the check without using the search engine.

I remember I used it yesterday and it was working but it is down at the moment.

I raised this on IRC in #wikimedia-cloud earlier. @arturo checked the logs and found uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) and restarted the web service.

He also found the following error:

open("/usr/lib/uwsgi/plugins/python3_plugin.so"): No such file or directory [core/utils.c line 3664]
! UNABLE to load uWSGI plugin: /usr/lib/uwsgi/plugins/python3_plugin.so: cannot open shared object file: No such file or directory !
[uWSGI] getting INI configuration from /data/project/copyvios/www/python/uwsgi.ini

He suggested that the metrics at https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-copyvios&refresh=5m&from=now-1h&to=now indicated it needed higher CPU limits.

Earwig added a subscriber: bd808.Feb 17 2020, 4:34 PM

I'm not sure how to raise the limits beyond where they currently are (1 vCPU and 4GiB of RAM). Because we didn't have this issue with the old k8s cluster, I'm also not sure what has changed now. @bd808, any ideas?

Also, the error Arturo found is not a concern. That will happen for any Python 2 webservice because, as I understand it, the container image only makes Python 2 available but sets uwsgi up to load both the Python 2 and 3 plugins.

The CPU usage being reported by grafana does not seem correct. Inside the container, with ps, I see:

tools.copyvios@:~$ ps auxwwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
tools.c+   102  0.4  0.0  44308  4192 pts/1    Ss   19:08   0:00 /bin/bash
tools.c+   115  0.0  0.0  41304  2700 pts/1    R+   19:09   0:00  \_ ps auxwwf
tools.c+    39  0.0  0.0  44312  4104 pts/0    Ss+  16:41   0:00 /bin/bash
tools.c+     1  0.0  0.8 208688 68936 ?        Ss   16:38   0:08 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
tools.c+     9  3.8  2.5 969832 210368 ?       Sl   16:39   5:44 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
tools.c+    10  3.7  2.7 963352 220660 ?       Sl   16:39   5:34 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
tools.c+    11  3.1  2.4 976900 196780 ?       Sl   16:39   4:39 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
tools.c+    12  7.4  4.1 1026328 335184 ?      Sl   16:39  11:09 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
tools.c+    14  9.5  3.1 985556 259056 ?       Sl   16:39  14:17 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
tools.c+    92  4.0  1.4 943992 121220 ?       Sl   19:02   0:16 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini

That's about 30% usage, which top agrees with. However, grafana reports usage consistently around 100% for the past hour: https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-copyvios&refresh=5s&from=1581963062073&to=1581966662073&panelId=1&fullscreen

Looking (for the first time) at that grafana graph, and due respect to your 30% calculation: the usage reported there was spiky for the first ~9 days, but went to an almost-solid 100% about them time we noticed the timeouts.

It came back briefly, but it's gone again. From where I am, it connects to the server and then 504's after 3 minutes.

I have hacked my way to a higher apparent memory/CPU limit by adjusting the container resource requests to match the command-line limits instead of half (so 4 GiB RAM and 1 CPU instead of 2 GiB and 0.5, respectively). While this is certainly not a good long-term solution, I want to see if the tool behaves better under these conditions.

Initial results don't seem especially promising; the spiky CPU usage continues for about half an hour after restarting until reaching 100% and staying there seemingly indefinitely. It's not clear to me why this would happen. One conclusion I can make is that it's not directly related to memory usage; I had a theory earlier that we were bumping into the memory quota and swap thrashing, but I don't think this is happening. Memory usage is only around 600MB out of 4GB (per grafana) when we get stuck.

I tried to debug one worker that seemed to be holding on to the CPU, but it's difficult. gdb is not available inside the pod, so to get a stack trace I made it dump core, but I can't really analyze the core dump because gdb on the bastion doesn't have access to the right binaries/libraries. uwsgi logs a stack trace, but it's for a random thread that was trying to acquire a lock, so that's not helpful.

It should be possible to solve/mitigate this by rearchitecting the tool to handle requests asynchronously, but this is a lot of work to get correct and I don't have time to do that right now. We have survived with the current blocking threaded worker model for years; why is it only an issue on the new cluster even with the limits raised so much? Were the limits on the old cluster really that much higher? I'm so perplexed by all of this.

Thanks for the efforts so far. Hope you can track it down.

Coffee added a subscriber: Coffee.
Coffee triaged this task as High priority.Feb 18 2020, 8:30 PM

This is a very widely used tool that assists in administration and legal compliance, and as such needs quick resolution. Thank you @Earwig for beginning to tackle this.

bd808 lowered the priority of this task from High to Medium.Feb 18 2020, 9:40 PM

This is a very widely used tool that assists in administration and legal compliance, and as such needs quick resolution. Thank you @Earwig for beginning to tackle this.

The 'priority' field in Phabricator means something, but it does not mean "the perceived importance of a bug by those affected by the bug" as as used in the quoted edit.

It should be possible to solve/mitigate this by rearchitecting the tool to handle requests asynchronously, but this is a lot of work to get correct and I don't have time to do that right now. We have survived with the current blocking threaded worker model for years; why is it only an issue on the new cluster even with the limits raised so much? Were the limits on the old cluster really that much higher? I'm so perplexed by all of this.

There are no namespace resource limits on the legacy Kubernetes cluster in Toolforge, so honestly I'm not completely surprised by a tool which has done a lot of manual tuning (T244107#5848724) having problems on the 2020 cluster. This is dangerously close to WP:BEANS, but on the legacy Kubernetes cluster the kernel's OOMKiller won't kick in until the entire host node is in memory trouble rather than when the Docker container's resource limit is reached.

The $HOME/uwsgi.log of the tool is flooded with lines like:

Tue Feb 18 14:11:33 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***

These lines mean that the uWSGI processes' HTTP client listener backlog is being exceeded. That in turn means that requests are coming in faster than they can be processed. From what I know of what copyvios does to process a request this is not entirely surprising. I think that the 101st request waiting in line would get a different error than a 503, but I have not setup a test environment to verify that yet.


The large accept backlog described above made me think of another possible problem however that would result in 503 responses.

In the nginx front proxy for tools.wmflabs.org (dynamicproxy, a.k.a. "yuviproxy"), we have configuration that says:

# Some projects have tools that take data in and process them
# for a long time. While ideally they should be made async, this
# is an interim solution that works for now.
# This is doubly important for websockets
proxy_read_timeout 3600s;

This gives any arbitrary tool 60 minutes to respond to a request. If no response happens in that time then the front proxy would return a 504 response to the caller. The default value for proxy_read_timeout is 60 seconds, which is quite likely not long enough for some (all?) copyvios requests to be handled.

A new thing in the 2020 Kubernetes cluster is the introduction of a second layer of nginx front proxy. This second proxy is a deployment of kubernetes/ingress-nginx which we are using as our bare-metal ingress layer. This nginx service uses a fairly complex combination of templates, ConfigMaps, and annotations on Ingress objects. I checked the live configuration of this proxy layer for the copyvios tool and found this generated config:

## start server copyvios.toolforge.org
server {
        server_name copyvios.toolforge.org ;

        listen 8080  ;
        listen 443  ssl http2 ;

        set $proxy_upstream_name "-";

        location / {

                set $namespace      "tool-copyvios";
                set $ingress_name   "copyvios-subdomain";
                set $service_name   "copyvios";
                set $service_port   "{0 8000 }";
                set $location_path  "/";

                rewrite_by_lua_block {
                        lua_ingress.rewrite({
                                force_ssl_redirect = false,
                                use_port_in_redirects = false,
                        })
                        balancer.rewrite()
                        plugins.run()
                }

                header_filter_by_lua_block {

                        plugins.run()
                }
                body_filter_by_lua_block {

                }

                log_by_lua_block {

                        balancer.log()

                        monitor.call()

                        plugins.run()
                }

                port_in_redirect off;

                set $balancer_ewma_score -1;
                set $proxy_upstream_name    "tool-copyvios-copyvios-8000";
                set $proxy_host             $proxy_upstream_name;
                set $pass_access_scheme $scheme;
                set $pass_server_port $server_port;
                set $best_http_host $http_host;
                set $pass_port $pass_server_port;

                set $proxy_alternative_upstream_name "";

                client_max_body_size                    128M;

                proxy_set_header Host                   $best_http_host;

                # Pass the extracted client certificate to the backend

                # Allow websocket connections
                proxy_set_header                        Upgrade  $http_upgrade;

                proxy_set_header                        Connection  $connection_upgrade;

                proxy_set_header X-Request-ID           $req_id;
                proxy_set_header X-Real-IP              $the_real_ip;

                proxy_set_header X-Forwarded-For        $the_real_ip;

                proxy_set_header X-Forwarded-Host       $best_http_host;
                proxy_set_header X-Forwarded-Port       $pass_port;
                proxy_set_header X-Forwarded-Proto      $pass_access_scheme;

                proxy_set_header X-Original-URI         $request_uri;

                proxy_set_header X-Scheme               $pass_access_scheme;

                # Pass the original X-Forwarded-For
                proxy_set_header X-Original-Forwarded-For $http_x_forwarded_for;

                # mitigate HTTPoxy Vulnerability
                # https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/
                proxy_set_header Proxy                  "";
                # Custom headers to proxied server

                proxy_connect_timeout                   5s;
                proxy_send_timeout                      60s;
                proxy_read_timeout                      60s;

                proxy_buffering                         off;
                proxy_buffer_size                       4k;
                proxy_buffers                           4 4k;
                proxy_request_buffering                 on;
                proxy_http_version                      1.1;

                proxy_cookie_domain                     off;
                proxy_cookie_path                       off;

                # In case of errors try the next upstream server before returning an error
                proxy_next_upstream                     error timeout;
                proxy_next_upstream_timeout             0;
                proxy_next_upstream_tries               3;

                proxy_pass http://upstream_balancer;

                proxy_redirect                          off;

        }

}
## end server copyvios.toolforge.org

Buried in there is the line proxy_read_timeout 60s;. That's default config that we need to change for all tools to preserve the extended timeout behavior that would have been seen when using the grid engine or legacy Kubernetes cluster. I will work on a configuration patch for that.

Change 573016 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy

https://gerrit.wikimedia.org/r/573016

Change 573016 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy
https://gerrit.wikimedia.org/r/573016

This patch is unmerged at the time I write this comment, but the change it represents has been made on the live cluster, so this one difference from the legacy cluster usage should be resolved.

Mentioned in SAL (#wikimedia-cloud) [2020-02-19T00:59:41Z] <bd808> Live hacked the "nginx-configuration" ConfigMap for T245426 (done several hours ago, but I forgot to !log it)

Chrisw80 added a comment.EditedFeb 19 2020, 1:51 AM

Well, I'm not getting a gateway timeout now, but I've had a query spinning for something like twenty minutes now. Perhaps a different issue? My queries usually complete in 90-180 seconds. Update: Additional queries still seem to fail with Gateway Timeout.

Change 573016 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy

https://gerrit.wikimedia.org/r/573016

bd808 added a comment.Feb 20 2020, 3:20 AM

Well, I'm not getting a gateway timeout now, but I've had a query spinning for something like twenty minutes now. Perhaps a different issue? My queries usually complete in 90-180 seconds. Update: Additional queries still seem to fail with Gateway Timeout.

Slow responses from the tool are likely different problem. And possibly the cause of the *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) *** log spam. @Earwig will need to figure that out, but he may need special help from me or another Toolforge admin to do that debugging based on comments in T245426#5891797

@zhuyifei1999 can I nerd snipe you into helping @Earwig debug a misbehaving Flask app on the 2020 Kubernetes cluster? I know you are better that kind of deep dive than I am. :)

Diannaa added a comment.EditedFeb 21 2020, 1:23 PM

Is anybody working on this? The tool is non-functional at present, timing out every time I try to perform a search. I am not getting the gateway time-out, but rather when attempting a comparison it just sits and spins for a lengthy period (5 minutes in the most recent one), eventually returning an error message such as "The URL http://www.isapindia.org/Newsletter/Feb16.html timed out before any data could be retrieved."

@Diannaa, I'm sincerely sorry for the recent problems. I've tried a few different configuration adjustments to get it to play nicely on the new infrastructure but nothing seems to help much. It seems there simply isn't enough available CPU with the current quota for the tool to do its job properly, or there is some other new limitation I can't figure out. This tells me the only solution will be a much more careful performance analysis that might require some substantial reworking of the code. I want to work on this, but I'm stuck due to some Real Life things that are eating my free time/energy for the moment. I can't promise a specific ETA.

@Earwig I was going to ask: Purely selfishly, how easy would it be to break out the "URL comparison" feature into a separate tool? Presumably it uses fewer resources, and (as I said, purely selfishly) it's the only feature I use.

But, as I was deciding to post this, I went back and found that the URL feature does, indeed, seem to be working, at least for a small number of tries. Is it possible that the changes you've made have, at least, fixed that?

@zhuyifei1999 can I nerd snipe you into helping @Earwig debug a misbehaving Flask app on the 2020 Kubernetes cluster? I know you are better that kind of deep dive than I am. :)

ok. will look

Is this concerning the URL https://tools.wmflabs.org/copyvios/? If so, it is not giving me 504 right now. Mind pinging when it does?

Since there are mentions of some sockets above, so I checked if fds look sane. Installed lsof and strace inside the pod, and:

108:34:00 0 ✓ zhuyifei1999@tools-k8s-worker-32: ~$ sudo nsenter -m -n -p -t 30297
2root@tools-k8s-worker-32:/# ps auxfww
3USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
4root 3521 0.6 0.0 39816 3816 ? S 08:34 0:00 -bash
5root 3524 0.0 0.0 41304 2672 ? R+ 08:34 0:00 \_ ps auxfww
6tools.c+ 1 0.0 0.8 208264 68488 ? Ss Feb19 3:01 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
7tools.c+ 3388 90.8 6.9 1449776 570592 ? Sl 07:57 33:30 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
8tools.c+ 3401 0.6 1.5 1155060 130404 ? Sl 08:02 0:12 /usr/bin/uwsgi --plugin python,python3 --http-socket :8000 --chdir /data/project/copyvios/www/python/src --logto /data/project/copyvios/uwsgi.log --callable app --manage-script-name --workers 4 --mount /copyvios=/data/project/copyvios/www/python/src/app.py --die-on-term --strict --master --venv /data/project/copyvios/www/python/venv --ini /data/project/copyvios/www/python/uwsgi.ini
9root@tools-k8s-worker-32:/# lsof -p 1
10COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
11uwsgi 1 tools.copyvios cwd DIR 0,48 4096 171443497 /data/project/copyvios/git/copyvios (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
12uwsgi 1 tools.copyvios rtd DIR 0,110 4096 534336 /
13uwsgi 1 tools.copyvios txt REG 0,110 801664 1065028 /usr/bin/uwsgi-core
14[...]
15uwsgi 1 tools.copyvios 0r CHR 1,3 0t0 25360896 /dev/null
16uwsgi 1 tools.copyvios 1u unix 0xffff9[...]83c800 0t0 25361516 socket
17uwsgi 1 tools.copyvios 2u unix 0xffff9[...]83c800 0t0 25361516 socket
18uwsgi 1 tools.copyvios 3u unix 0xffff9[...]83fc00 0t0 25361510 socket
19uwsgi 1 tools.copyvios 4u REG 0,48 1808831 96539886 /data/project/copyvios/uwsgi.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
20uwsgi 1 tools.copyvios 5u unix 0xffff9[...]83d000 0t0 25361515 socket
21uwsgi 1 tools.copyvios 6u unix 0xffff9[...]83c800 0t0 25361516 socket
22uwsgi 1 tools.copyvios 7u IPv4 25361535 0t0 TCP *:8000 (LISTEN)
23uwsgi 1 tools.copyvios 8u unix 0xffff9[...]f6c400 0t0 25360992 socket
24uwsgi 1 tools.copyvios 9u unix 0xffff9[...]f6f400 0t0 25360993 socket
25uwsgi 1 tools.copyvios 10u unix 0xffff9[...]f6d400 0t0 25360994 socket
26uwsgi 1 tools.copyvios 11u unix 0xffff9[...]f6cc00 0t0 25360995 socket
27uwsgi 1 tools.copyvios 12w REG 0,48 84015 171517789 /data/project/copyvios/git/copyvios/logs/.nfs000000000a39275d0000014a (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
28uwsgi 1 tools.copyvios 13u unix 0xffff9[...]811400 0t0 25360233 socket
29uwsgi 1 tools.copyvios 14u unix 0xffff9[...]810c00 0t0 25360234 socket
30uwsgi 1 tools.copyvios 15u 0000 0,13 0 8269 anon_inode
31uwsgi 1 tools.copyvios 16u IPv4 25360235 0t0 TCP localhost:9001 (LISTEN)
32uwsgi 1 tools.copyvios 22r CHR 1,9 0t0 25360901 /dev/urandom
33root@tools-k8s-worker-32:/# lsof -p 3388
34COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
35uwsgi 3388 tools.copyvios cwd DIR 0,48 4096 171443497 /data/project/copyvios/git/copyvios (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
36uwsgi 3388 tools.copyvios rtd DIR 0,110 4096 534336 /
37uwsgi 3388 tools.copyvios txt REG 0,110 801664 1065028 /usr/bin/uwsgi-core
38[...]
39uwsgi 3388 tools.copyvios 0r CHR 1,3 0t0 25360896 /dev/null
40uwsgi 3388 tools.copyvios 1u unix 0xffff9[...]83c800 0t0 25361516 socket
41uwsgi 3388 tools.copyvios 2u unix 0xffff9[...]83c800 0t0 25361516 socket
42uwsgi 3388 tools.copyvios 3u unix 0xffff9[...]83fc00 0t0 25361510 socket
43uwsgi 3388 tools.copyvios 4u REG 0,48 1808831 96539886 /data/project/copyvios/uwsgi.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
44uwsgi 3388 tools.copyvios 5u unix 0xffff9[...]83d000 0t0 25361515 socket
45uwsgi 3388 tools.copyvios 6u unix 0xffff9[...]83c800 0t0 25361516 socket
46uwsgi 3388 tools.copyvios 7u IPv4 25361535 0t0 TCP *:8000 (LISTEN)
47uwsgi 3388 tools.copyvios 8u 0000 0,13 0 8269 anon_inode
48uwsgi 3388 tools.copyvios 10u 0000 0,13 0 8269 anon_inode
49uwsgi 3388 tools.copyvios 11u unix 0xffff9[...]f6cc00 0t0 25360995 socket
50uwsgi 3388 tools.copyvios 12w REG 0,48 84015 171517789 /data/project/copyvios/git/copyvios/logs/.nfs000000000a39275d0000014a (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
51uwsgi 3388 tools.copyvios 13u IPv4 41390617 0t0 TCP copyvios-68555b94f4-hkf8p:48528->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
52uwsgi 3388 tools.copyvios 14u unix 0xffff9[...]810c00 0t0 25360234 socket
53uwsgi 3388 tools.copyvios 15u 0000 0,13 0 8269 anon_inode
54uwsgi 3388 tools.copyvios 16u IPv4 25360235 0t0 TCP localhost:9001 (LISTEN)
55uwsgi 3388 tools.copyvios 17u IPv4 41391532 0t0 TCP copyvios-68555b94f4-hkf8p:39204->clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306 (CLOSE_WAIT)
56uwsgi 3388 tools.copyvios 18u 0000 0,13 0 8269 anon_inode
57uwsgi 3388 tools.copyvios 19w REG 0,48 271594 171517984 /data/project/copyvios/git/copyvios/.earwigbot/logs/bot.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
58uwsgi 3388 tools.copyvios 20w REG 0,48 8910 171517942 /data/project/copyvios/git/copyvios/.earwigbot/logs/error.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
59uwsgi 3388 tools.copyvios 21w REG 0,48 2536869 171517661 /data/project/copyvios/git/copyvios/.earwigbot/logs/debug.log.2020-02-22_06 (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
60uwsgi 3388 tools.copyvios 22r CHR 1,9 0t0 25360901 /dev/urandom
61uwsgi 3388 tools.copyvios 23u IPv4 41434269 0t0 TCP copyvios-68555b94f4-hkf8p:49412->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
62uwsgi 3388 tools.copyvios 24u IPv4 41461310 0t0 TCP copyvios-68555b94f4-hkf8p:40574->clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306 (ESTABLISHED)
63uwsgi 3388 tools.copyvios 25u IPv4 41406470 0t0 TCP copyvios-68555b94f4-hkf8p:48838->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
64uwsgi 3388 tools.copyvios 26u IPv4 41373364 0t0 TCP copyvios-68555b94f4-hkf8p:38572->clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306 (CLOSE_WAIT)
65uwsgi 3388 tools.copyvios 27u IPv4 41389383 0t0 TCP copyvios-68555b94f4-hkf8p:8000->192-168-81-4.ingress-nginx.ingress-nginx.svc.tools.local:58172 (CLOSE_WAIT)
66uwsgi 3388 tools.copyvios 28u IPv4 41460363 0t0 TCP copyvios-68555b94f4-hkf8p:49900->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
67uwsgi 3388 tools.copyvios 29ur REG 0,48 116736 171448098 /data/project/copyvios/git/copyvios/.earwigbot/exclusions.db (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
68uwsgi 3388 tools.copyvios 31u IPv4 41371299 0t0 TCP copyvios-68555b94f4-hkf8p:47782->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
69root@tools-k8s-worker-32:/# lsof -p 3401
70COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
71uwsgi 3401 tools.copyvios cwd DIR 0,48 4096 171443497 /data/project/copyvios/git/copyvios (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
72uwsgi 3401 tools.copyvios rtd DIR 0,110 4096 534336 /
73uwsgi 3401 tools.copyvios txt REG 0,110 801664 1065028 /usr/bin/uwsgi-core
74[...]
75uwsgi 3401 tools.copyvios 0r CHR 1,3 0t0 25360896 /dev/null
76uwsgi 3401 tools.copyvios 1u unix 0xffff9[...]83c800 0t0 25361516 socket
77uwsgi 3401 tools.copyvios 2u unix 0xffff9[...]83c800 0t0 25361516 socket
78uwsgi 3401 tools.copyvios 3u unix 0xffff9[...]83fc00 0t0 25361510 socket
79uwsgi 3401 tools.copyvios 4u REG 0,48 1808831 96539886 /data/project/copyvios/uwsgi.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
80uwsgi 3401 tools.copyvios 5u unix 0xffff9[...]83d000 0t0 25361515 socket
81uwsgi 3401 tools.copyvios 6u unix 0xffff9[...]83c800 0t0 25361516 socket
82uwsgi 3401 tools.copyvios 7u IPv4 25361535 0t0 TCP *:8000 (LISTEN)
83uwsgi 3401 tools.copyvios 8u 0000 0,13 0 8269 anon_inode
84uwsgi 3401 tools.copyvios 9u unix 0xffff9[...]f6f400 0t0 25360993 socket
85uwsgi 3401 tools.copyvios 10u 0000 0,13 0 8269 anon_inode
86uwsgi 3401 tools.copyvios 11u 0000 0,13 0 8269 anon_inode
87uwsgi 3401 tools.copyvios 12w REG 0,48 84015 171517789 /data/project/copyvios/git/copyvios/logs/.nfs000000000a39275d0000014a (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
88uwsgi 3401 tools.copyvios 13u 0000 0,13 0 8269 anon_inode
89uwsgi 3401 tools.copyvios 14u unix 0xffff9[...]810c00 0t0 25360234 socket
90uwsgi 3401 tools.copyvios 16u IPv4 25360235 0t0 TCP localhost:9001 (LISTEN)
91uwsgi 3401 tools.copyvios 17w REG 0,48 271594 171517984 /data/project/copyvios/git/copyvios/.earwigbot/logs/bot.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
92uwsgi 3401 tools.copyvios 18w REG 0,48 8910 171517942 /data/project/copyvios/git/copyvios/.earwigbot/logs/error.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
93uwsgi 3401 tools.copyvios 19w REG 0,48 58889 171516931 /data/project/copyvios/git/copyvios/.earwigbot/logs/debug.log (nfs-tools-project.svc.eqiad.wmnet:/project/tools/project)
94uwsgi 3401 tools.copyvios 20u IPv4 41493069 0t0 TCP copyvios-68555b94f4-hkf8p:50692->text-lb.eqiad.wikimedia.org:443 (ESTABLISHED)
95uwsgi 3401 tools.copyvios 21u IPv4 41392949 0t0 TCP copyvios-68555b94f4-hkf8p:48552->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
96uwsgi 3401 tools.copyvios 22r CHR 1,9 0t0 25360901 /dev/urandom
97uwsgi 3401 tools.copyvios 23u IPv4 41436690 0t0 TCP copyvios-68555b94f4-hkf8p:49466->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)
98uwsgi 3401 tools.copyvios 24u IPv4 41493090 0t0 TCP copyvios-68555b94f4-hkf8p:41370->clouddb1001.clouddb-services.eqiad1.wikimedia.cloud:3306 (ESTABLISHED)
99uwsgi 3401 tools.copyvios 25u IPv4 41451511 0t0 TCP copyvios-68555b94f4-hkf8p:49780->text-lb.eqiad.wikimedia.org:443 (CLOSE_WAIT)

Nothing seems unusual.

I just tried running it. Although it took a long time to load the first time (30 to 60 seconds) it did load, I used it for a real-live issue and it worked. I tried loading again and this time loaded in a few seconds.

I've been having excellent results the last few hours

zhuyifei1999 changed the task status from Open to Stalled.Feb 23 2020, 12:02 AM

(Waiting for failure)

Diannaa changed the task status from Stalled to Open.Feb 23 2020, 1:11 PM

The tool will not load this morning - just sitting there spinning.

Can confirm. A lot of CPU is being used. gdb is taking forever to attach for some reason.

(gdb) bt
#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x00007ff04b9a5068 in PyThread_acquire_lock (lock=lock@entry=0x1c36570, waitflag=waitflag@entry=1) at ../Python/thread_pthread.h:324
#2  0x00007ff04ba2c3e6 in PyEval_RestoreThread (tstate=0x1b9fc10) at ../Python/ceval.c:357
#3  0x00007ff04997a7d2 in _pysqlite_fetch_one_row () from /data/project/copyvios/www/python/venv/lib/python2.7/lib-dynload/_sqlite3.x86_64-linux-gnu.so
#4  0x00007ff04997a631 in pysqlite_cursor_iternext () from /data/project/copyvios/www/python/venv/lib/python2.7/lib-dynload/_sqlite3.x86_64-linux-gnu.so
#5  0x00007ff04ba2d93e in PyEval_EvalFrameEx (f=<p at remote 0x7ff03c007e88>, throwflag=1270237784) at ../Python/ceval.c:2510
#6  0x00007ff04ba3227d in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4119
#7  call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054
#8  PyEval_EvalFrameEx (f=< at remote 0x1b9fc10>, throwflag=1224238336) at ../Python/ceval.c:2679
#9  0x00007ff04baa5190 in PyEval_EvalCodeEx (co=0x7ff0496022b0, globals=<unknown at remote 0x80>, locals=0x0, args=0x7ff0341060d8, argcount=3, kws=0x7ff01aff8bf8, kwcount=0, defs=0x0, 
    defcount=0, closure=(<cell at remote 0x7ff01ad2ac58>,)) at ../Python/ceval.c:3265
#10 0x00007ff04ba32171 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4129
#11 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054
#12 PyEval_EvalFrameEx (f=<unknown at remote 0x7ff0341060e0>, throwflag=430033928) at ../Python/ceval.c:2679
#13 0x00007ff04baa5190 in PyEval_EvalCodeEx (co=0x7ff0492e0930, globals=<unknown at remote 0x80>, locals=0x0, args=0x1e13d60, argcount=3, kws=0x7ff01aff8bf8, kwcount=0, 
    defs=0x7ff0492eaca8, defcount=1, closure=0x0) at ../Python/ceval.c:3265
#14 0x00007ff04ba32171 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4129
#15 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054

backtrace does not point to right python frame objects... o.O

zhuyifei1999 added a comment.EditedFeb 23 2020, 9:33 PM
(gdb) f
#8  PyEval_EvalFrameEx (f=< at remote 0x1b9fc10>, throwflag=1224238336) at ../Python/ceval.c:2679
2679	in ../Python/ceval.c
(gdb) info reg
rax            0xfffffffffffffe00	-512
rbx            0x7ff048f86500	140669993116928
rcx            0x7ff04f093010	140670094880784
rdx            0x0	0
rsi            0x80	128
rdi            0x1c36570	29582704
rbp            0x7ff0495fbfac	0x7ff0495fbfac
rsp            0x7fff99b48670	0x7fff99b48670
r8             0x3	3
r9             0x7ff01aff8bf8	140669221833720
r10            0x0	0
r11            0x246	582
r12            0x1b9fc10	28965904
r13            0x7ff019865f30	140669197115184
r14            0x3	3
r15            0x7ff03c007ca0	140669775543456
rip            0x7ff04ba3227d	0x7ff04ba3227d <PyEval_EvalFrameEx+22749>
eflags         0x246	[ PF ZF IF ]
cs             0x33	51
ss             0x2b	43
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb) p (PyObject *)0x7ff03c007ca0
$11 = Frame 0x7ff03c007ca0, for file /data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/exclusions.py, line 189, in check (self=<ExclusionsDB(_dbfile='.earwigbot/exclusions.db', _logger=<Logger(name='earwigbot.wiki.exclusionsdb', parent=<Logger(name='earwigbot.wiki', parent=<Logger(name='earwigbot', parent=<RootLogger(name='root', parent=None, handlers=[], level=30, disabled=0, propagate=1, filters=[]) at remote 0x7ff04a4c2a10>, handlers=[<TimedRotatingFileHandler(utc=False, interval=86400, backupCount=7, suffix='%Y-%m-%d', stream=<file at remote 0x7ff045b11f60>, encoding=None, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7ff045983450>, _RLock__count=0) at remote 0x7ff045a8f290>, level=20, when='MIDNIGHT', _name=None, delay=False, rolloverAt=1582502400, baseFilename='/data/project/copyvios/git/copyvios/.earwigbot/logs/bot.log', mode='a', filters=[], extMatch=<_sre.SRE_Pattern at remote 0x7ff0459aac90>, formatter=<BotFormatter(_format=<instancemethod at r...(truncated)

Looking at the disassembly r15 isn't something I can trust :/

Can't get the frame object for quite a few frames by enumerating the registers. I can probably use PyFrameObject's f_back as a linked list but that is not something libpython.py gdb script supports. Time for custom code :/

Finally. Reliable backtrace:

1# adapted some code from https://github.com/python/cpython/blob/2.7/Tools/gdb/libpython.py
2
3class PyBtLinked(gdb.Command):
4 def __init__(self):
5 gdb.Command.__init__ (self,
6 "py-bt-linked",
7 gdb.COMMAND_STACK,
8 gdb.COMPLETE_NONE)
9
10
11 def invoke(self, args, from_tty):
12 try:
13 print('Traceback (most recent call first):')
14
15 def p_stack(filename, lineno, funcname):
16 print(' File "{}", line {}, in {}'.format(
17 filename if filename is not None else '<possibly built-in>',
18 lineno, funcname))
19
20 if filename is not None:
21 import os
22 try:
23 cwd = os.readlink('/proc/{}/cwd'.format(gdb.selected_inferior().pid))
24 except IOError:
25 pass
26 else:
27 filename = os.path.join(cwd, filename)
28
29 try:
30 file = open(filename, 'rb')
31 except IOError:
32 print(' <can\'t open "{}">'.format(filename))
33 else:
34 with file:
35 print(' ' +
36 file.read().split(b'\n')[int(lineno)-1]
37 .decode('utf-8').strip())
38
39 def pystr(pystrptr):
40 assert pystrptr['ob_type']['tp_name'].string() == 'str'
41 pystrptr = pystrptr.cast(gdb.lookup_type('PyStringObject').pointer())
42 field_ob_size = int(pystrptr['ob_size'])
43 field_ob_sval = pystrptr['ob_sval']
44 return ''.join(chr(field_ob_sval[i]) for i in range(field_ob_size))
45
46 def addr2line(code, addrq):
47 co_lnotab = pystr(code['co_lnotab'])
48 lineno = int(code['co_firstlineno'])
49
50 addr = 0
51 for addr_incr, line_incr in zip(co_lnotab[::2], co_lnotab[1::2]):
52 addr += ord(addr_incr)
53 if addr > addrq:
54 return lineno
55 lineno += ord(line_incr)
56 return lineno
57
58 filename, lineno = None, '?'
59
60 frame = None
61
62 gdbframe = gdb.newest_frame()
63 while not frame:
64 gdbframe = gdbframe.older()
65
66 regs = ['rax', 'rbx', 'rcx', 'rdx', 'rsi', 'rdi', 'rbp', 'r8', 'r9', 'r10', 'r11', 'r12', 'r13', 'r14', 'r15']
67 for reg in regs:
68 r = gdbframe.read_register(reg)
69 try:
70 r = r.cast(gdb.lookup_type('PyFrameObject').pointer())
71 if r['ob_type']['tp_name'].string() == 'frame':
72 frame = r
73 except Exception:
74 continue
75
76 while frame:
77 code = frame['f_code']
78 filename = pystr(code['co_filename'])
79 lineno = addr2line(code, int(frame['f_lasti']))
80 funcname = pystr(code['co_name'])
81
82 p_stack(filename, lineno, funcname)
83
84 frame = frame['f_back']
85 except Exception:
86 import traceback
87 traceback.print_exc()
88
89PyBtLinked()

1(gdb) thread apply all py-bt-linked
2
3Thread 12 (Thread 0x7ff01b7fe700 (LWP 5373)):
4Traceback (most recent call first):
5 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 185, in _dequeue
6 self._queues.lock.acquire()
7 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
8 return self._dequeue()
9 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
10 source = self._dequeue()
11 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
12 if not self._handle_once():
13 File "/usr/lib/python2.7/threading.py", line 763, in run
14 self.__target(*self.__args, **self.__kwargs)
15 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
16 self.run()
17 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
18 self.__bootstrap_inner()
19
20Thread 11 (Thread 0x7ff01bfff700 (LWP 5372)):
21Traceback (most recent call first):
22 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 185, in _dequeue
23 self._queues.lock.acquire()
24 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
25 source = self._dequeue()
26 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
27 if not self._handle_once():
28 File "/usr/lib/python2.7/threading.py", line 763, in run
29 self.__target(*self.__args, **self.__kwargs)
30 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
31 self.run()
32 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
33 self.__bootstrap_inner()
34
35Thread 10 (Thread 0x7ff040c94700 (LWP 5371)):
36Traceback (most recent call first):
37 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/markov.py", line 51, in _build
38 chain[phrase] = 1
39 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/markov.py", line 36, in __init__
40 self.chain = self._build()
41 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 224, in _handle_once
42 chain = MarkovChain(text) if text else None
43 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
44 if not self._handle_once():
45 File "/usr/lib/python2.7/threading.py", line 763, in run
46 self.__target(*self.__args, **self.__kwargs)
47 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
48 self.run()
49 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
50 self.__bootstrap_inner()
51
52Thread 9 (Thread 0x7ff041495700 (LWP 5370)):
53Traceback (most recent call first):
54 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 185, in _dequeue
55 self._queues.lock.acquire()
56 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
57 source = self._dequeue()
58 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
59 if not self._handle_once():
60 File "/usr/lib/python2.7/threading.py", line 763, in run
61 self.__target(*self.__args, **self.__kwargs)
62 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
63 self.run()
64 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
65 self.__bootstrap_inner()
66
67Thread 8 (Thread 0x7ff041c96700 (LWP 5369)):
68Traceback (most recent call first):
69 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/pdfminer-20140328-py2.7.egg/pdfminer/psparser.py", line 386, in _parse_float
70 try:
71 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/pdfminer-20140328-py2.7.egg/pdfminer/psparser.py", line 483, in nexttoken
72 self.charpos = self._parse1(self.buf, self.charpos)
73 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/pdfminer-20140328-py2.7.egg/pdfminer/psparser.py", line 557, in nextobject
74 (pos, token) = self.nexttoken()
75 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/pdfminer-20140328-py2.7.egg/pdfminer/pdfdocument.py", line 175, in load
76 (_, obj) = parser.nextobject()
77 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/pdfminer-20140328-py2.7.egg/pdfminer/pdfdocument.py", line 315, in __init__
78 xref.load(parser)
79 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/pdfminer-20140328-py2.7.egg/pdfminer/pdfpage.py", line 121, in get_pages
80 doc = PDFDocument(parser, password=password, caching=caching)
81 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/parsers.py", line 293, in parse
82 for page in pages:
83 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 159, in _open_url
84 return handler(content, source.parser_args).parse()
85 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 218, in _handle_once
86 text = self._open_url(source)
87 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
88 if not self._handle_once():
89 File "/usr/lib/python2.7/threading.py", line 763, in run
90 self.__target(*self.__args, **self.__kwargs)
91 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
92 self.run()
93 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
94 self.__bootstrap_inner()
95
96Thread 7 (Thread 0x7ff042497700 (LWP 5368)):
97Traceback (most recent call first):
98 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 185, in _dequeue
99 self._queues.lock.acquire()
100 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
101 return self._dequeue()
102 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
103 source = self._dequeue()
104 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
105 if not self._handle_once():
106 File "/usr/lib/python2.7/threading.py", line 763, in run
107 self.__target(*self.__args, **self.__kwargs)
108 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
109 self.run()
110 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
111 self.__bootstrap_inner()
112
113Thread 6 (Thread 0x7ff042c98700 (LWP 5367)):
114Traceback (most recent call first):
115 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 185, in _dequeue
116 self._queues.lock.acquire()
117 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
118 source = self._dequeue()
119 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
120 if not self._handle_once():
121 File "/usr/lib/python2.7/threading.py", line 763, in run
122 self.__target(*self.__args, **self.__kwargs)
123 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
124 self.run()
125 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
126 self.__bootstrap_inner()
127
128Thread 5 (Thread 0x7ff043499700 (LWP 5366)):
129Traceback (most recent call first):
130 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 185, in _dequeue
131 self._queues.lock.acquire()
132 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
133 return self._dequeue()
134 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
135 source = self._dequeue()
136 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
137 if not self._handle_once():
138 File "/usr/lib/python2.7/threading.py", line 763, in run
139 self.__target(*self.__args, **self.__kwargs)
140 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
141 self.run()
142 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
143 self.__bootstrap_inner()
144
145Thread 4 (Thread 0x7ff043f9a700 (LWP 5365)):
146Traceback (most recent call first):
147 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/search.py", line 159, in search
148 result = self._open(url + urlencode(params))
149 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/__init__.py", line 155, in copyvio_check
150 workspace.enqueue(searcher.search(chunk), exclude)
151 File "./copyvios/checker.py", line 139, in _perform_check
152 short_circuit=not query.noskip)
153 File "./copyvios/checker.py", line 80, in _get_results
154 _perform_check(query, page, use_engine, use_links)
155 File "./copyvios/checker.py", line 41, in do_check
156 _get_results(query, follow=not _coerce_bool(query.noredirect))
157 File "/data/project/copyvios/www/python/src/app.py", line 103, in index
158 query = do_check()
159 File "/data/project/copyvios/www/python/src/app.py", line 38, in inner
160 return func(*args, **kwargs)
161 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
162 return self.view_functions[rule.endpoint](**req.view_args)
163 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
164 rv = self.dispatch_request()
165 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
166 response = self.full_dispatch_request()
167 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2309, in __call__
168 return self.wsgi_app(environ, start_response)
169
170Thread 3 (Thread 0x7ff04479b700 (LWP 5364)):
171Traceback (most recent call first):
172 File "./copyvios/checker.py", line 139, in _perform_check
173 short_circuit=not query.noskip)
174 File "./copyvios/checker.py", line 80, in _get_results
175 _perform_check(query, page, use_engine, use_links)
176 File "./copyvios/checker.py", line 41, in do_check
177 _get_results(query, follow=not _coerce_bool(query.noredirect))
178 File "/data/project/copyvios/www/python/src/app.py", line 103, in index
179 query = do_check()
180 File "/data/project/copyvios/www/python/src/app.py", line 38, in inner
181 return func(*args, **kwargs)
182 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
183 return self.view_functions[rule.endpoint](**req.view_args)
184 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
185 rv = self.dispatch_request()
186 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
187 response = self.full_dispatch_request()
188 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2309, in __call__
189 return self.wsgi_app(environ, start_response)
190
191Thread 2 (Thread 0x7ff044f9c700 (LWP 5363)):
192Traceback (most recent call first):
193 File "./copyvios/checker.py", line 139, in _perform_check
194 short_circuit=not query.noskip)
195 File "./copyvios/checker.py", line 80, in _get_results
196 _perform_check(query, page, use_engine, use_links)
197 File "./copyvios/checker.py", line 41, in do_check
198 _get_results(query, follow=not _coerce_bool(query.noredirect))
199 File "/data/project/copyvios/www/python/src/app.py", line 103, in index
200 query = do_check()
201 File "/data/project/copyvios/www/python/src/app.py", line 38, in inner
202 return func(*args, **kwargs)
203 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
204 return self.view_functions[rule.endpoint](**req.view_args)
205 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
206 rv = self.dispatch_request()
207 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
208 response = self.full_dispatch_request()
209 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2309, in __call__
210 return self.wsgi_app(environ, start_response)
211
212Thread 1 (Thread 0x7ff04f4b0780 (LWP 5362)):
213Traceback (most recent call first):
214 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/exclusions.py", line 189, in check
215 for (excl,) in conn.execute(query, (sitename, "all")):
216 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/__init__.py", line 129, in <lambda>
217 exclude = lambda u: self._exclusions_db.check(self.site.name, u)
218 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 334, in enqueue
219 if exclude_check and exclude_check(url):
220 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/__init__.py", line 145, in copyvio_check
221 workspace.enqueue(parser.get_links(), exclude)
222 File "./copyvios/checker.py", line 139, in _perform_check
223 short_circuit=not query.noskip)
224 File "./copyvios/checker.py", line 80, in _get_results
225 _perform_check(query, page, use_engine, use_links)
226 File "./copyvios/checker.py", line 41, in do_check
227 _get_results(query, follow=not _coerce_bool(query.noredirect))
228 File "/data/project/copyvios/www/python/src/app.py", line 103, in index
229 query = do_check()
230 File "/data/project/copyvios/www/python/src/app.py", line 38, in inner
231 return func(*args, **kwargs)
232 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1799, in dispatch_request
233 return self.view_functions[rule.endpoint](**req.view_args)
234 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 1813, in full_dispatch_request
235 rv = self.dispatch_request()
236 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2292, in wsgi_app
237 response = self.full_dispatch_request()
238 File "/data/project/copyvios/www/python/venv/local/lib/python2.7/site-packages/flask/app.py", line 2309, in __call__
239 return self.wsgi_app(environ, start_response)

@Earwig I'm unfamiliar with the code base. Getting a flame graph profiler will probably be a massive PITA. Do you see anything odd in the backtraces?

1(gdb) thread apply all py-bt-linked
2
3Thread 12 (Thread 0x7ff01b7fe700 (LWP 5698)):
4Traceback (most recent call first):
5 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
6 site, queue = self._queues.unassigned.get(timeout=timeout)
7 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
8 self._acquire_new_site()
9 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
10 return self._dequeue()
11 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
12 source = self._dequeue()
13 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
14 if not self._handle_once():
15 File "/usr/lib/python2.7/threading.py", line 763, in run
16 self.__target(*self.__args, **self.__kwargs)
17 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
18 self.run()
19 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
20 self.__bootstrap_inner()
21
22Thread 11 (Thread 0x7ff01bfff700 (LWP 5697)):
23Traceback (most recent call first):
24 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
25 site, queue = self._queues.unassigned.get(timeout=timeout)
26 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
27 self._acquire_new_site()
28 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
29 return self._dequeue()
30 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
31 source = self._dequeue()
32 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
33 if not self._handle_once():
34 File "/usr/lib/python2.7/threading.py", line 763, in run
35 self.__target(*self.__args, **self.__kwargs)
36 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
37 self.run()
38 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
39 self.__bootstrap_inner()
40
41Thread 10 (Thread 0x7ff040c94700 (LWP 5696)):
42Traceback (most recent call first):
43 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
44 site, queue = self._queues.unassigned.get(timeout=timeout)
45 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
46 self._acquire_new_site()
47 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
48 return self._dequeue()
49 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
50 source = self._dequeue()
51---Type <return> to continue, or q <return> to quit---q
52Python Exception <class 'KeyboardInterrupt'> Quit:
53Error occurred in Python command: Quit
54(gdb) set pagination off
55(gdb) thread apply all py-bt-linked
56
57Thread 12 (Thread 0x7ff01b7fe700 (LWP 5698)):
58Traceback (most recent call first):
59 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
60 site, queue = self._queues.unassigned.get(timeout=timeout)
61 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
62 self._acquire_new_site()
63 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
64 return self._dequeue()
65 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
66 source = self._dequeue()
67 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
68 if not self._handle_once():
69 File "/usr/lib/python2.7/threading.py", line 763, in run
70 self.__target(*self.__args, **self.__kwargs)
71 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
72 self.run()
73 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
74 self.__bootstrap_inner()
75
76Thread 11 (Thread 0x7ff01bfff700 (LWP 5697)):
77Traceback (most recent call first):
78 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
79 site, queue = self._queues.unassigned.get(timeout=timeout)
80 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
81 self._acquire_new_site()
82 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
83 return self._dequeue()
84 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
85 source = self._dequeue()
86 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
87 if not self._handle_once():
88 File "/usr/lib/python2.7/threading.py", line 763, in run
89 self.__target(*self.__args, **self.__kwargs)
90 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
91 self.run()
92 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
93 self.__bootstrap_inner()
94
95Thread 10 (Thread 0x7ff040c94700 (LWP 5696)):
96Traceback (most recent call first):
97 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
98 site, queue = self._queues.unassigned.get(timeout=timeout)
99 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
100 self._acquire_new_site()
101 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
102 return self._dequeue()
103 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
104 source = self._dequeue()
105 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
106 if not self._handle_once():
107 File "/usr/lib/python2.7/threading.py", line 763, in run
108 self.__target(*self.__args, **self.__kwargs)
109 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
110 self.run()
111 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
112 self.__bootstrap_inner()
113
114Thread 9 (Thread 0x7ff041495700 (LWP 5695)):
115Traceback (most recent call first):
116 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
117 site, queue = self._queues.unassigned.get(timeout=timeout)
118 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
119 self._acquire_new_site()
120 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
121 return self._dequeue()
122 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
123 source = self._dequeue()
124 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
125 if not self._handle_once():
126 File "/usr/lib/python2.7/threading.py", line 763, in run
127 self.__target(*self.__args, **self.__kwargs)
128 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
129 self.run()
130 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
131 self.__bootstrap_inner()
132
133Thread 8 (Thread 0x7ff041c96700 (LWP 5694)):
134Traceback (most recent call first):
135 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
136 site, queue = self._queues.unassigned.get(timeout=timeout)
137 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
138 self._acquire_new_site()
139 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
140 return self._dequeue()
141 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
142 source = self._dequeue()
143 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
144 if not self._handle_once():
145 File "/usr/lib/python2.7/threading.py", line 763, in run
146 self.__target(*self.__args, **self.__kwargs)
147 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
148 self.run()
149 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
150 self.__bootstrap_inner()
151
152Thread 7 (Thread 0x7ff042497700 (LWP 5693)):
153Traceback (most recent call first):
154 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
155 site, queue = self._queues.unassigned.get(timeout=timeout)
156 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
157 self._acquire_new_site()
158 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
159 return self._dequeue()
160 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
161 source = self._dequeue()
162 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
163 if not self._handle_once():
164 File "/usr/lib/python2.7/threading.py", line 763, in run
165 self.__target(*self.__args, **self.__kwargs)
166 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
167 self.run()
168 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
169 self.__bootstrap_inner()
170
171Thread 6 (Thread 0x7ff042c98700 (LWP 5692)):
172Traceback (most recent call first):
173 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
174 site, queue = self._queues.unassigned.get(timeout=timeout)
175 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
176 self._acquire_new_site()
177 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
178 return self._dequeue()
179 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
180 source = self._dequeue()
181 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
182 if not self._handle_once():
183 File "/usr/lib/python2.7/threading.py", line 763, in run
184 self.__target(*self.__args, **self.__kwargs)
185 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
186 self.run()
187 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
188 self.__bootstrap_inner()
189
190Thread 5 (Thread 0x7ff043499700 (LWP 5691)):
191Traceback (most recent call first):
192 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 171, in _acquire_new_site
193 site, queue = self._queues.unassigned.get(timeout=timeout)
194 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 181, in _dequeue
195 self._acquire_new_site()
196 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 194, in _dequeue
197 return self._dequeue()
198 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 209, in _handle_once
199 source = self._dequeue()
200 File "/data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/workers.py", line 236, in _run
201 if not self._handle_once():
202 File "/usr/lib/python2.7/threading.py", line 763, in run
203 self.__target(*self.__args, **self.__kwargs)
204 File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
205 self.run()
206 File "/usr/lib/python2.7/threading.py", line 783, in __bootstrap
207 self.__bootstrap_inner()
208
209Thread 4 (Thread 0x7ff043f9a700 (LWP 5690)):
210Traceback (most recent call first):
211Traceback (most recent call last):
212 File "/root/py-bt.py", line 66, in invoke
213 r = gdbframe.read_register(reg)
214AttributeError: 'NoneType' object has no attribute 'read_register'
215
216Thread 3 (Thread 0x7ff04479b700 (LWP 5689)):
217Traceback (most recent call first):
218Traceback (most recent call last):
219 File "/root/py-bt.py", line 66, in invoke
220 r = gdbframe.read_register(reg)
221AttributeError: 'NoneType' object has no attribute 'read_register'
222
223Thread 2 (Thread 0x7ff044f9c700 (LWP 5688)):
224Traceback (most recent call first):
225Traceback (most recent call last):
226 File "/root/py-bt.py", line 66, in invoke
227 r = gdbframe.read_register(reg)
228AttributeError: 'NoneType' object has no attribute 'read_register'
229
230Thread 1 (Thread 0x7ff04f4b0780 (LWP 5687)):
231Traceback (most recent call first):
232Traceback (most recent call last):
233 File "/root/py-bt.py", line 66, in invoke
234 r = gdbframe.read_register(reg)
235AttributeError: 'NoneType' object has no attribute 'read_register'
236(gdb) thread apply 1 bt
237
238Thread 1 (Thread 0x7ff04f4b0780 (LWP 5687)):
239#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
240#1 0x00007ff04f08f479 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
241#2 0x00007ff04f08f2a0 in __GI___pthread_mutex_lock (mutex=0x6c5538 <uwsgi+6328>) at ../nptl/pthread_mutex_lock.c:79
242#3 0x0000000000417590 in wsgi_req_accept ()
243#4 0x000000000045f7a6 in simple_loop_run ()
244#5 0x0000000000463c78 in uwsgi_ignition ()
245#6 0x0000000000466fbd in uwsgi_worker_run ()
246#7 0x00000000004674de in uwsgi_run ()
247#8 0x000000000041699e in main ()
248(gdb) thread apply 2 bt
249
250Thread 2 (Thread 0x7ff044f9c700 (LWP 5688)):
251#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
252#1 0x00007ff04f08f479 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
253#2 0x00007ff04f08f2a0 in __GI___pthread_mutex_lock (mutex=0x6c5538 <uwsgi+6328>) at ../nptl/pthread_mutex_lock.c:79
254#3 0x0000000000417590 in wsgi_req_accept ()
255#4 0x000000000045f7a6 in simple_loop_run ()
256#5 0x00007ff04f08d064 in start_thread (arg=0x7ff044f9c700) at pthread_create.c:309
257#6 0x00007ff04d1c062d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
258(gdb) thread apply 3 bt
259
260Thread 3 (Thread 0x7ff04479b700 (LWP 5689)):
261#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
262#1 0x00007ff04f08f479 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
263#2 0x00007ff04f08f2a0 in __GI___pthread_mutex_lock (mutex=0x6c5538 <uwsgi+6328>) at ../nptl/pthread_mutex_lock.c:79
264#3 0x0000000000417590 in wsgi_req_accept ()
265#4 0x000000000045f7a6 in simple_loop_run ()
266#5 0x00007ff04f08d064 in start_thread (arg=0x7ff04479b700) at pthread_create.c:309
267#6 0x00007ff04d1c062d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
268(gdb) thread apply 4 bt
269
270Thread 4 (Thread 0x7ff043f9a700 (LWP 5690)):
271#0 0x00007ff04d1c0c03 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
272#1 0x0000000000456ed3 in event_queue_wait ()
273#2 0x0000000000417602 in wsgi_req_accept ()
274#3 0x000000000045f7a6 in simple_loop_run ()
275#4 0x00007ff04f08d064 in start_thread (arg=0x7ff043f9a700) at pthread_create.c:309
276#5 0x00007ff04d1c062d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This is the other process.

This is the other process.

Looks like all of them are idle. Might have messed some stuffs up and uwsgi got restarted, clearing the queue.

I see the uwsgi.log:

Sun Feb 23 23:27:52 2020 - worker 1 (pid: 5362) is taking too much time to die...NO MERCY !!!
DAMN ! worker 1 (pid: 5362) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 5753)
[...]
DAMN ! worker 2 (pid: 5687) died, killed by signal 5 :( trying respawn ...
Respawned uWSGI worker 2 (new pid: 5758)

I assume is taking too much time to die...NO MERCY !!! means uwsgi killed itself... why?
signal 5 is SIGTRAP, should be some gdb internals and should get caught by gdb... how did it get sent to the uwsgi?

Though, I don't see the uWSGI listen queue of socket ":8000" (fd: 7) full messages mentioned earlier.

Nothing unusual in the backtraces. In the one where there is actual activity, there's a PDF being parsed in one thread, ngrams being generated in another, and a URL being opened in another. These are all standard steps as part of a normal check/comparison.

is taking too much time to die...NO MERCY !!!

There is a limit on how many requests a single worker will serve before uwsgi restarts it, as a band-aid against memory leaks (this is one limit I turned down recently to lower memory usage, though I don't think it's made a huge difference). It's possible this message is printed if uwsgi can't kill the workers in time (perhaps because their requests are taking too long to complete), though I'm not sure. Alternatively, if someone caused uwsgi to restart/reload, it would do this.

uWSGI listen queue of socket ":8000" (fd: 7) full

This requires a lot of requests to get backed up and won't happen all the time, given how variable usage is, even if the underlying problem remains.

perhaps because their requests are taking too long to complete

Make sense. I was gdb-ing it for a while, effectively pausing the execution.

I guess a flame graph is the way to go then. Gotta find the bottleneck.

...The work of process 5885 is done. Seeya!
worker 2 killed successfully (pid: 5885)
Respawned uWSGI worker 2 (new pid: 5911)

random uwsgi killing? o.O

Anyways, it seems still responding. Will try to profile when it's badly unresponsive.

No, not random killing, this is the limit on the number of requests a single worker is allowed to handle before uwsgi restarts it that I mentioned earlier.

zhuyifei1999 changed the task status from Open to Stalled.Feb 24 2020, 7:39 AM

I see. Thanks for explaining.

I'm getting 502 Bad Gateway at the moment.

Diannaa added a comment.EditedFri, Mar 6, 11:58 PM

Today was an exceptionally bad day, with long outages and a few short windows of opportunity where the tool could be used. We really need this tool. :(

DavidBrooks added a comment.EditedSat, Mar 7, 12:28 AM

From the for-what-it's-worth department, I have thrown together a clean-room simplified implementation of only the "URL comparison" feature in a Windows exe creatively called "Copyvios". It has limited applicability (i.e. it scratches my own personal itch so far as the exterior URL is concerned) explained in its README. It does the comparison in a few seconds at most, and the result is broadly comparable with the online tool.

Yes, I know, people who follow The Earwig's talk page have already seen this promo.

It won't ever do the Google search feature.

Source at https://github.com/DavidWBrooks/Copyvios. Be gentle because I don't have experience with pull requests etc. If you trust me, a built binary is at https://1drv.ms/u/s!AtuCZY0YF4hGop43fHSS2GiSK2fPUw?e=goT44W (.net 4.5+ required). There's a screenshot in both locations.

zhuyifei1999 changed the task status from Stalled to Open.Sat, Mar 7, 2:42 AM

Can confirm that there are lots of messages like this in the logs today:

Fri Mar  6 22:13:34 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***

Waiting for the next time.

MusikAnimal added a comment.EditedSun, Mar 8, 7:35 AM

Can confirm that there are lots of messages like this in the logs today:
...
Waiting for the next time.

If you'd like I can add you to our UptimeRobot reports, where you get emailed as soon as there's downtime https://stats.uptimerobot.com/BN16RUOP5/784331770 (or, have Earwig add you to copyvios as a maintainer since they all gets emails). Indeed, on March 6 the copyvios tool was available for just 36% of the day!

Would it make sense or is it even possible to move the tool to the grid until we get the k8s issues sorted out? We could spin up a new test tool, copyvios-test, where we iron out k8s support before reattempting it on production.

I also still worry about bots. If we could get some UA logging we could at least get a better idea of what we're up against. In my other tools, bots significantly affected uptime. Copyvios is a super process heavy tool that's linked to from all over the wikis, making it a likely victim. Requiring OAuth I don't think is unreasonable for this super-user tool, and maybe key-based access to the API.

If you'd like I can add you to our UptimeRobot reports, where you get emailed as soon as theirs downtime

Yes please. I've been tailing uwsgi.log and every time it's down I miss it :(.

zhuyifei1999 [at] gmail [dot] com

If you'd like I can add you to our UptimeRobot reports, where you get emailed as soon as there's downtime

Yes please. I've been tailing uwsgi.log and every time it's down I miss it :(.

You just need to confirm your email and then you're all set. Also make sure alert@uptimerobot.com is on your contact list, since Gmail mistakes it for spam.

Earwig added a comment.Mon, Mar 9, 4:53 AM

I also still worry about bots. If we could get some UA logging we could at least get a better idea of what we're up against.

I have been logging user agents in uwsgi.log for a few weeks now. When I first turned it on, I was able to block several bots that had been getting through the previous blacklist. It seemed to solve the quota issues we had been having. I think most of the current traffic is real human traffic (or authorized bots, none of whom are using the tool for actual searches, afaict).

zhuyifei1999 added a comment.EditedWed, Mar 11, 8:20 PM

You just need to confirm your email and then you're all set. Also make sure alert@uptimerobot.com is on your contact list, since Gmail mistakes it for spam.

I've did both on Mar 8. Haven't received anything.

Clarification: What I did was to add a filter to have the address never tagged as spam. Did not even appear in spam box.

You just need to confirm your email and then you're all set. Also make sure alert@uptimerobot.com is on your contact list, since Gmail mistakes it for spam.

I've did both on Mar 8. Haven't received anything.
Clarification: What I did was to add a filter to have the address never tagged as spam. Did not even appear in spam box.

Sorry about that, I needed to re-confirm on my end after you confirmed your email. You should get them now, and if not check the spam box.

I've used the tool 5 mins ago, and a "normal" analysis takes 5 minutes+

I've did this analysis: https://tools.wmflabs.org/copyvios/?lang=de&project=wikipedia&title=&oldid=197860890&action=search&use_engine=1&use_links=1&turnitin=1

My bot using the tool is paused since some weeks.

Is it really slow even when it's not heavily loaded? In that case if you
could produce a test case I'll see if I can find what is taking the time.

Hm, I've raised the timeout on my bot to 300, now it's mostly working, haven't had bigger problems the last days.

Downtime today over 5 hours now according to uptime robot.

Thu Mar 26 04:40:44 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***

is happening.

strace: full of futex(2)...

Considering that previous attempts to figure out what is going on, using gdb, is very invasive and can kill uwsgi, I'm going to try py-spy

60 second, 25 samples per second profile result for both processes:


The former looks like spending a lot of time in pdfminer, the latter spens a lot of time in bs4 and pdfminer

I don't not able to identify if the graphs are unnatural -- I mean, parsing, which is what it is pending a lot of time on, is an expensive problem. Perhaps @Earwig can tell whether the graphs look expected? If so, then probably there is little I can do than say there is just not enough CPU power to solve the requests in a timely manner under the cpu-limiting cgroups k8s.

How long is too long? I really have no feel for what the official parsing logic does, but (with obviously apologies for still appearing to self-promote) my fairly simple-minded approach currently runs 3.5 seconds from the end of download to installing the highlighted text in the UI control, for two fairly large articles ("West Virginia" in both WP and EB1911), and last time I used earwig I was getting pretty comparable results. That's on my Surface 4, an Intel i5 2-core laptop. I mean, if that's significantly longer than the server, then I'll step back, but otherwise you're free to re-use the logic (I realize C# to py may be a problem).

There's another optimization that I'll upload later today and shave off another half second.

@DavidBrooks and others: In the Earwig's Copyvio Detector window it's written: "We are still investigating recent performance issues/timeouts on the new infrastructure."

On the new infrastructure? I wonder why Copyvios cannot be running on an old infrastructure as before until the performance issues and timeouts on the new infrastructure could be fixed. So we all can use it as before meanwhile.

Can you tell me more about?

Thank you