We've been having issues with 504 Gateway Time-out errors with this tool since about February 12. The statistics shown at https://stats.uptimerobot.com/BN16RUOP5/784331770 are deceptive; Even when the tool will load, I am getting a 504 gateway time out error almost every time I try to perform a comparison.
Description
Details
Related Objects
- Mentioned In
- T246592: Establish processes for quota increases in Toolforge Kubernetes
- Mentioned Here
- T246592: Establish processes for quota increases in Toolforge Kubernetes
P10481 (An Untitled Masterwork)
P10479 py-bt-linked.py
P10480 (An Untitled Masterwork)
P10476 (An Untitled Masterwork)
T244107: Copyvios tool webservice failed to start on new Kubernetes cluster
T243736: Earwig's Copyvio Detector down due to "The supplied API key is not configured for use from this IP address." from google-api-proxy
Event Timeline
This is a very widely used tool that assists in administration and legal compliance, and as such needs quick resolution. Thank you @Earwig for beginning to tackle this.
The 'priority' field in Phabricator means something, but it does not mean "the perceived importance of a bug by those affected by the bug" as as used in the quoted edit.
- https://www.mediawiki.org/wiki/Phabricator/Project_management#Priority_levels
- "Report status and priority fields summarize and reflect reality and do not cause it." -- https://www.mediawiki.org/wiki/Bug_management/Phabricator_etiquette
There are no namespace resource limits on the legacy Kubernetes cluster in Toolforge, so honestly I'm not completely surprised by a tool which has done a lot of manual tuning (T244107#5848724) having problems on the 2020 cluster. This is dangerously close to WP:BEANS, but on the legacy Kubernetes cluster the kernel's OOMKiller won't kick in until the entire host node is in memory trouble rather than when the Docker container's resource limit is reached.
The $HOME/uwsgi.log of the tool is flooded with lines like:
Tue Feb 18 14:11:33 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***
These lines mean that the uWSGI processes' HTTP client listener backlog is being exceeded. That in turn means that requests are coming in faster than they can be processed. From what I know of what copyvios does to process a request this is not entirely surprising. I think that the 101st request waiting in line would get a different error than a 503, but I have not setup a test environment to verify that yet.
The large accept backlog described above made me think of another possible problem however that would result in 503 responses.
In the nginx front proxy for tools.wmflabs.org (dynamicproxy, a.k.a. "yuviproxy"), we have configuration that says:
# Some projects have tools that take data in and process them # for a long time. While ideally they should be made async, this # is an interim solution that works for now. # This is doubly important for websockets proxy_read_timeout 3600s;
This gives any arbitrary tool 60 minutes to respond to a request. If no response happens in that time then the front proxy would return a 504 response to the caller. The default value for proxy_read_timeout is 60 seconds, which is quite likely not long enough for some (all?) copyvios requests to be handled.
A new thing in the 2020 Kubernetes cluster is the introduction of a second layer of nginx front proxy. This second proxy is a deployment of kubernetes/ingress-nginx which we are using as our bare-metal ingress layer. This nginx service uses a fairly complex combination of templates, ConfigMaps, and annotations on Ingress objects. I checked the live configuration of this proxy layer for the copyvios tool and found this generated config:
## start server copyvios.toolforge.org server { server_name copyvios.toolforge.org ; listen 8080 ; listen 443 ssl http2 ; set $proxy_upstream_name "-"; location / { set $namespace "tool-copyvios"; set $ingress_name "copyvios-subdomain"; set $service_name "copyvios"; set $service_port "{0 8000 }"; set $location_path "/"; rewrite_by_lua_block { lua_ingress.rewrite({ force_ssl_redirect = false, use_port_in_redirects = false, }) balancer.rewrite() plugins.run() } header_filter_by_lua_block { plugins.run() } body_filter_by_lua_block { } log_by_lua_block { balancer.log() monitor.call() plugins.run() } port_in_redirect off; set $balancer_ewma_score -1; set $proxy_upstream_name "tool-copyvios-copyvios-8000"; set $proxy_host $proxy_upstream_name; set $pass_access_scheme $scheme; set $pass_server_port $server_port; set $best_http_host $http_host; set $pass_port $pass_server_port; set $proxy_alternative_upstream_name ""; client_max_body_size 128M; proxy_set_header Host $best_http_host; # Pass the extracted client certificate to the backend # Allow websocket connections proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; proxy_set_header X-Request-ID $req_id; proxy_set_header X-Real-IP $the_real_ip; proxy_set_header X-Forwarded-For $the_real_ip; proxy_set_header X-Forwarded-Host $best_http_host; proxy_set_header X-Forwarded-Port $pass_port; proxy_set_header X-Forwarded-Proto $pass_access_scheme; proxy_set_header X-Original-URI $request_uri; proxy_set_header X-Scheme $pass_access_scheme; # Pass the original X-Forwarded-For proxy_set_header X-Original-Forwarded-For $http_x_forwarded_for; # mitigate HTTPoxy Vulnerability # https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/ proxy_set_header Proxy ""; # Custom headers to proxied server proxy_connect_timeout 5s; proxy_send_timeout 60s; proxy_read_timeout 60s; proxy_buffering off; proxy_buffer_size 4k; proxy_buffers 4 4k; proxy_request_buffering on; proxy_http_version 1.1; proxy_cookie_domain off; proxy_cookie_path off; # In case of errors try the next upstream server before returning an error proxy_next_upstream error timeout; proxy_next_upstream_timeout 0; proxy_next_upstream_tries 3; proxy_pass http://upstream_balancer; proxy_redirect off; } } ## end server copyvios.toolforge.org
Buried in there is the line proxy_read_timeout 60s;. That's default config that we need to change for all tools to preserve the extended timeout behavior that would have been seen when using the grid engine or legacy Kubernetes cluster. I will work on a configuration patch for that.
Change 573016 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy
This patch is unmerged at the time I write this comment, but the change it represents has been made on the live cluster, so this one difference from the legacy cluster usage should be resolved.
Mentioned in SAL (#wikimedia-cloud) [2020-02-19T00:59:41Z] <bd808> Live hacked the "nginx-configuration" ConfigMap for T245426 (done several hours ago, but I forgot to !log it)
Well, I'm not getting a gateway timeout now, but I've had a query spinning for something like twenty minutes now. Perhaps a different issue? My queries usually complete in 90-180 seconds. Update: Additional queries still seem to fail with Gateway Timeout.
Change 573016 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: increase nginx-ingress proxy_read_timeout to match dynamicproxy
Slow responses from the tool are likely different problem. And possibly the cause of the *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) *** log spam. @Earwig will need to figure that out, but he may need special help from me or another Toolforge admin to do that debugging based on comments in T245426#5891797
@zhuyifei1999 can I nerd snipe you into helping @Earwig debug a misbehaving Flask app on the 2020 Kubernetes cluster? I know you are better that kind of deep dive than I am. :)
Is anybody working on this? The tool is non-functional at present, timing out every time I try to perform a search. I am not getting the gateway time-out, but rather when attempting a comparison it just sits and spins for a lengthy period (5 minutes in the most recent one), eventually returning an error message such as "The URL http://www.isapindia.org/Newsletter/Feb16.html timed out before any data could be retrieved."
@Diannaa, I'm sincerely sorry for the recent problems. I've tried a few different configuration adjustments to get it to play nicely on the new infrastructure but nothing seems to help much. It seems there simply isn't enough available CPU with the current quota for the tool to do its job properly, or there is some other new limitation I can't figure out. This tells me the only solution will be a much more careful performance analysis that might require some substantial reworking of the code. I want to work on this, but I'm stuck due to some Real Life things that are eating my free time/energy for the moment. I can't promise a specific ETA.
@Earwig I was going to ask: Purely selfishly, how easy would it be to break out the "URL comparison" feature into a separate tool? Presumably it uses fewer resources, and (as I said, purely selfishly) it's the only feature I use.
But, as I was deciding to post this, I went back and found that the URL feature does, indeed, seem to be working, at least for a small number of tries. Is it possible that the changes you've made have, at least, fixed that?
Is this concerning the URL https://tools.wmflabs.org/copyvios/? If so, it is not giving me 504 right now. Mind pinging when it does?
Since there are mentions of some sockets above, so I checked if fds look sane. Installed lsof and strace inside the pod, and:
Nothing seems unusual.
I just tried running it. Although it took a long time to load the first time (30 to 60 seconds) it did load, I used it for a real-live issue and it worked. I tried loading again and this time loaded in a few seconds.
Can confirm. A lot of CPU is being used. gdb is taking forever to attach for some reason.
(gdb) bt #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x00007ff04b9a5068 in PyThread_acquire_lock (lock=lock@entry=0x1c36570, waitflag=waitflag@entry=1) at ../Python/thread_pthread.h:324 #2 0x00007ff04ba2c3e6 in PyEval_RestoreThread (tstate=0x1b9fc10) at ../Python/ceval.c:357 #3 0x00007ff04997a7d2 in _pysqlite_fetch_one_row () from /data/project/copyvios/www/python/venv/lib/python2.7/lib-dynload/_sqlite3.x86_64-linux-gnu.so #4 0x00007ff04997a631 in pysqlite_cursor_iternext () from /data/project/copyvios/www/python/venv/lib/python2.7/lib-dynload/_sqlite3.x86_64-linux-gnu.so #5 0x00007ff04ba2d93e in PyEval_EvalFrameEx (f=<p at remote 0x7ff03c007e88>, throwflag=1270237784) at ../Python/ceval.c:2510 #6 0x00007ff04ba3227d in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4119 #7 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054 #8 PyEval_EvalFrameEx (f=< at remote 0x1b9fc10>, throwflag=1224238336) at ../Python/ceval.c:2679 #9 0x00007ff04baa5190 in PyEval_EvalCodeEx (co=0x7ff0496022b0, globals=<unknown at remote 0x80>, locals=0x0, args=0x7ff0341060d8, argcount=3, kws=0x7ff01aff8bf8, kwcount=0, defs=0x0, defcount=0, closure=(<cell at remote 0x7ff01ad2ac58>,)) at ../Python/ceval.c:3265 #10 0x00007ff04ba32171 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4129 #11 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054 #12 PyEval_EvalFrameEx (f=<unknown at remote 0x7ff0341060e0>, throwflag=430033928) at ../Python/ceval.c:2679 #13 0x00007ff04baa5190 in PyEval_EvalCodeEx (co=0x7ff0492e0930, globals=<unknown at remote 0x80>, locals=0x0, args=0x1e13d60, argcount=3, kws=0x7ff01aff8bf8, kwcount=0, defs=0x7ff0492eaca8, defcount=1, closure=0x0) at ../Python/ceval.c:3265 #14 0x00007ff04ba32171 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4129 #15 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054
backtrace does not point to right python frame objects... o.O
(gdb) f #8 PyEval_EvalFrameEx (f=< at remote 0x1b9fc10>, throwflag=1224238336) at ../Python/ceval.c:2679 2679 in ../Python/ceval.c (gdb) info reg rax 0xfffffffffffffe00 -512 rbx 0x7ff048f86500 140669993116928 rcx 0x7ff04f093010 140670094880784 rdx 0x0 0 rsi 0x80 128 rdi 0x1c36570 29582704 rbp 0x7ff0495fbfac 0x7ff0495fbfac rsp 0x7fff99b48670 0x7fff99b48670 r8 0x3 3 r9 0x7ff01aff8bf8 140669221833720 r10 0x0 0 r11 0x246 582 r12 0x1b9fc10 28965904 r13 0x7ff019865f30 140669197115184 r14 0x3 3 r15 0x7ff03c007ca0 140669775543456 rip 0x7ff04ba3227d 0x7ff04ba3227d <PyEval_EvalFrameEx+22749> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) p (PyObject *)0x7ff03c007ca0 $11 = Frame 0x7ff03c007ca0, for file /data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/exclusions.py, line 189, in check (self=<ExclusionsDB(_dbfile='.earwigbot/exclusions.db', _logger=<Logger(name='earwigbot.wiki.exclusionsdb', parent=<Logger(name='earwigbot.wiki', parent=<Logger(name='earwigbot', parent=<RootLogger(name='root', parent=None, handlers=[], level=30, disabled=0, propagate=1, filters=[]) at remote 0x7ff04a4c2a10>, handlers=[<TimedRotatingFileHandler(utc=False, interval=86400, backupCount=7, suffix='%Y-%m-%d', stream=<file at remote 0x7ff045b11f60>, encoding=None, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7ff045983450>, _RLock__count=0) at remote 0x7ff045a8f290>, level=20, when='MIDNIGHT', _name=None, delay=False, rolloverAt=1582502400, baseFilename='/data/project/copyvios/git/copyvios/.earwigbot/logs/bot.log', mode='a', filters=[], extMatch=<_sre.SRE_Pattern at remote 0x7ff0459aac90>, formatter=<BotFormatter(_format=<instancemethod at r...(truncated)
Looking at the disassembly r15 isn't something I can trust :/
Can't get the frame object for quite a few frames by enumerating the registers. I can probably use PyFrameObject's f_back as a linked list but that is not something libpython.py gdb script supports. Time for custom code :/
@Earwig I'm unfamiliar with the code base. Getting a flame graph profiler will probably be a massive PITA. Do you see anything odd in the backtraces?
This is the other process.
Looks like all of them are idle. Might have messed some stuffs up and uwsgi got restarted, clearing the queue.
I see the uwsgi.log:
Sun Feb 23 23:27:52 2020 - worker 1 (pid: 5362) is taking too much time to die...NO MERCY !!! DAMN ! worker 1 (pid: 5362) died, killed by signal 9 :( trying respawn ... Respawned uWSGI worker 1 (new pid: 5753) [...] DAMN ! worker 2 (pid: 5687) died, killed by signal 5 :( trying respawn ... Respawned uWSGI worker 2 (new pid: 5758)
I assume is taking too much time to die...NO MERCY !!! means uwsgi killed itself... why?
signal 5 is SIGTRAP, should be some gdb internals and should get caught by gdb... how did it get sent to the uwsgi?
Though, I don't see the uWSGI listen queue of socket ":8000" (fd: 7) full messages mentioned earlier.
Nothing unusual in the backtraces. In the one where there is actual activity, there's a PDF being parsed in one thread, ngrams being generated in another, and a URL being opened in another. These are all standard steps as part of a normal check/comparison.
is taking too much time to die...NO MERCY !!!
There is a limit on how many requests a single worker will serve before uwsgi restarts it, as a band-aid against memory leaks (this is one limit I turned down recently to lower memory usage, though I don't think it's made a huge difference). It's possible this message is printed if uwsgi can't kill the workers in time (perhaps because their requests are taking too long to complete), though I'm not sure. Alternatively, if someone caused uwsgi to restart/reload, it would do this.
uWSGI listen queue of socket ":8000" (fd: 7) full
This requires a lot of requests to get backed up and won't happen all the time, given how variable usage is, even if the underlying problem remains.
perhaps because their requests are taking too long to complete
Make sense. I was gdb-ing it for a while, effectively pausing the execution.
...The work of process 5885 is done. Seeya! worker 2 killed successfully (pid: 5885) Respawned uWSGI worker 2 (new pid: 5911)
random uwsgi killing? o.O
Anyways, it seems still responding. Will try to profile when it's badly unresponsive.
No, not random killing, this is the limit on the number of requests a single worker is allowed to handle before uwsgi restarts it that I mentioned earlier.
Today was an exceptionally bad day, with long outages and a few short windows of opportunity where the tool could be used. We really need this tool. :(
From the for-what-it's-worth department, I have thrown together a clean-room simplified implementation of only the "URL comparison" feature in a Windows exe creatively called "Copyvios". It has limited applicability (i.e. it scratches my own personal itch so far as the exterior URL is concerned) explained in its README. It does the comparison in a few seconds at most, and the result is broadly comparable with the online tool.
Yes, I know, people who follow The Earwig's talk page have already seen this promo.
It won't ever do the Google search feature.
Source at https://github.com/DavidWBrooks/Copyvios. Be gentle because I don't have experience with pull requests etc. If you trust me, a built binary is at https://1drv.ms/u/s!AtuCZY0YF4hGop43fHSS2GiSK2fPUw?e=goT44W (.net 4.5+ required). There's a screenshot in both locations.
Can confirm that there are lots of messages like this in the logs today:
Fri Mar 6 22:13:34 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***
Waiting for the next time.
If you'd like I can add you to our UptimeRobot reports, where you get emailed as soon as there's downtime https://stats.uptimerobot.com/BN16RUOP5/784331770 (or, have Earwig add you to copyvios as a maintainer since they all gets emails). Indeed, on March 6 the copyvios tool was available for just 36% of the day!
Would it make sense or is it even possible to move the tool to the grid until we get the k8s issues sorted out? We could spin up a new test tool, copyvios-test, where we iron out k8s support before reattempting it on production.
I also still worry about bots. If we could get some UA logging we could at least get a better idea of what we're up against. In my other tools, bots significantly affected uptime. Copyvios is a super process heavy tool that's linked to from all over the wikis, making it a likely victim. Requiring OAuth I don't think is unreasonable for this super-user tool, and maybe key-based access to the API.
You just need to confirm your email and then you're all set. Also make sure alert@uptimerobot.com is on your contact list, since Gmail mistakes it for spam.
I also still worry about bots. If we could get some UA logging we could at least get a better idea of what we're up against.
I have been logging user agents in uwsgi.log for a few weeks now. When I first turned it on, I was able to block several bots that had been getting through the previous blacklist. It seemed to solve the quota issues we had been having. I think most of the current traffic is real human traffic (or authorized bots, none of whom are using the tool for actual searches, afaict).
I've did both on Mar 8. Haven't received anything.
Clarification: What I did was to add a filter to have the address never tagged as spam. Did not even appear in spam box.
Sorry about that, I needed to re-confirm on my end after you confirmed your email. You should get them now, and if not check the spam box.
I've used the tool 5 mins ago, and a "normal" analysis takes 5 minutes+
I've did this analysis: https://tools.wmflabs.org/copyvios/?lang=de&project=wikipedia&title=&oldid=197860890&action=search&use_engine=1&use_links=1&turnitin=1
My bot using the tool is paused since some weeks.
Is it really slow even when it's not heavily loaded? In that case if you
could produce a test case I'll see if I can find what is taking the time.
Hm, I've raised the timeout on my bot to 300, now it's mostly working, haven't had bigger problems the last days.
Thu Mar 26 04:40:44 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***
is happening.
strace: full of futex(2)...
Considering that previous attempts to figure out what is going on, using gdb, is very invasive and can kill uwsgi, I'm going to try py-spy
60 second, 25 samples per second profile result for both processes:
The former looks like spending a lot of time in pdfminer, the latter spens a lot of time in bs4 and pdfminer
I don't not able to identify if the graphs are unnatural -- I mean, parsing, which is what it is pending a lot of time on, is an expensive problem. Perhaps @Earwig can tell whether the graphs look expected? If so, then probably there is little I can do than say there is just not enough CPU power to solve the requests in a timely manner under the cpu-limiting cgroups k8s.
How long is too long? I really have no feel for what the official parsing logic does, but (with obviously apologies for still appearing to self-promote) my fairly simple-minded approach currently runs 3.5 seconds from the end of download to installing the highlighted text in the UI control, for two fairly large articles ("West Virginia" in both WP and EB1911), and last time I used earwig I was getting pretty comparable results. That's on my Surface 4, an Intel i5 2-core laptop. I mean, if that's significantly longer than the server, then I'll step back, but otherwise you're free to re-use the logic (I realize C# to py may be a problem).
There's another optimization that I'll upload later today and shave off another half second.
@DavidBrooks and others: In the Earwig's Copyvio Detector window it's written: "We are still investigating recent performance issues/timeouts on the new infrastructure."
On the new infrastructure? I wonder why Copyvios cannot be running on an old infrastructure as before until the performance issues and timeouts on the new infrastructure could be fixed. So we all can use it as before meanwhile.
Can you tell me more about?
Thank you
Yes, the 2020 Kubernetes cluster has per-namespace quotas. These quotas can be adjusted however.
I read this comment as ruling out quotas as the problem however. Are you now convinced otherwise?
@zhuyifei1999 I didn't comment earlier, but I don't see anything unusual in these graphs.
@bd808 I think we've ruled out the memory quota but not the CPU quota. Memory usage remains comfortably quite low per grafana but CPU frequently tops out at the limit.
@zhuyifei1999 I didn't comment earlier, but I don't see anything unusual in these graphs.
Then I think there is simply an impossibility of satisfying the timing constraint (timeout) with the given quota and needed commutation.
Let's try to bump the quota then and see if a 2 cpu cap works? We don't have a solid process yet for asking for this (T246592: Establish processes for quota increases in Toolforge Kubernetes), but we have the ability to make the change.
Mentioned in SAL (#wikimedia-cloud) [2020-04-02T23:41:32Z] <bd808> Increased limits.cpu and requests.cpu to 4 (T245426)
$ ssh tools-k8s-control-1.tools.eqiad.wmflabs $ kubectl edit resourcequota -n tool-copyvios tool-copyvios resourcequota/tool-copyvios edited $ kubectl describe resourcequota tool-copyvios -n tool-copyvios Name: tool-copyvios Namespace: tool-copyvios Resource Used Hard -------- ---- ---- configmaps 1 10 limits.cpu 1 4 limits.memory 4Gi 8Gi persistentvolumeclaims 0 3 pods 1 4 replicationcontrollers 0 1 requests.cpu 500m 4 requests.memory 2147483648 6Gi secrets 1 10 services 1 1 services.nodeports 0 0
$ sudo become copyvios $ webservice stop; webservice --backend=kubernetes --mem 4Gi --cpu 2 python2 start DEPRECATED: 'python2' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Stopping webservice DEPRECATED: 'python2' type is deprecated. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Kubernetes for currently supported types. Starting webservice............... $ kubectl get pods -o wide No resources found.
I did something wrong here obviously. I restarted things with --cpu 1.
@Bstorm can I bug you to help figure out what part I messed up?
The tool has been functioning much better the last 2-3 days, with lengthy periods where it's functioning well and far shorter timeouts.
Uptime stats look good! https://stats.uptimerobot.com/BN16RUOP5/784331770
Whatever you did, it worked.
Good to hear, though I haven't changed anything on my end and it looks like @bd808's change did not stick due to some unknown issue?