There is currently no text in this page. You can search for this page title in other pages, or search the related logs.
Sun, Mar 29
Sat, Mar 28
Hence this Project focuses on building such a tool named “WikiCommons Image Verification Tool”
Probably the most practical thing to do would be to apply this to beta cluster to test it.
Ah I see. Wrong path.
@Bstorm @herron I can't see T175964, but I'm pretty sure https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/379239/ somehow broke it.
root@tools-mail-02:~# puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Info: Caching catalog for tools-mail-02.tools.eqiad.wmflabs Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files Info: Applying configuration version '(1a925f799b) Bstorm - Add rate limiting to profile::toolforge::mailrelay with warn action' Notice: The LDAP client stack for this host is: classic/sudoldap Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: classic/sudoldap' Error: /Stage[main]/Profile::Toolforge::Mailrelay/File[/etc/exim4/ratelimits/sender_hourly_limits]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/toolforge/mailrelay/ratelimits/sender_hourly_limits Error: /Stage[main]/Profile::Toolforge::Mailrelay/File[/etc/exim4/ratelimits/host_hourly_limits]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/profile/toolforge/mailrelay/ratelimits/host_hourly_limits Notice: /Stage[main]/Profile::Toolforge::Mailrelay/Letsencrypt::Cert::Integrated[tools_mail]/Exec[acme-setup-acme-tools_mail]/returns: executed successfully Info: Class[Profile::Toolforge::Mailrelay]: Unscheduling all events on Class[Profile::Toolforge::Mailrelay] Info: Stage[main]: Unscheduling all events on Stage[main] Notice: Applied catalog in 12.70 seconds
failed to open /etc/exim4/ratelimits/host_hourly_limits for linear search: No such file or directory
I see lots of fails that look like config issue in exim4 mainlog:
DNS looks fine:
Thu, Mar 26
I don't not able to identify if the graphs are unnatural -- I mean, parsing, which is what it is pending a lot of time on, is an expensive problem. Perhaps @Earwig can tell whether the graphs look expected? If so, then probably there is little I can do than say there is just not enough CPU power to solve the requests in a timely manner under the cpu-limiting cgroups k8s.
60 second, 25 samples per second profile result for both processes:
strace: full of futex(2)...
Thu Mar 26 04:40:44 2020 - *** uWSGI listen queue of socket ":8000" (fd: 7) full !!! (101/100) ***
Fri, Mar 20
How does project this relate to Wikimedia?
Thu, Mar 19
Is it really slow even when it's not heavily loaded? In that case if you
could produce a test case I'll see if I can find what is taking the time.
Tue, Mar 17
a simple request system at https://www.mediawiki.org/wiki/Talk:Quarry
@zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer?
Sat, Mar 14
Fri, Mar 13
The query was executing for too long then.
Thu, Mar 12
Wed, Mar 11
Sorry, was extremely busy last two weeks. I think if it's a bug it should stay open. I'll work on it next week.
Sun, Mar 8
zhuyifei1999 [at] gmail [dot] com
Sat, Mar 7
I don't see quarry's killer doing anything. The last command at T246970#5946798 still yields nothing
Can confirm that there are lots of messages like this in the logs today:
Fri, Mar 6
Hmm I can see that quarry s indeed running on web rather than analytics.
Thu, Mar 5
This isn't just a quarry issue: (logs)
Wed, Mar 4
However, we can consider adding some caching for out cdnjs proxy for better response times. @Bstorm thoughts?
Feb 28 2020
Whichever is simplest.
I will be running the script with pdb + save all sseclient trace over the weekend.
Feb 27 2020
Can't reproduce. Probably related to NFS maintenance a few hours ago.
Feb 26 2020
I don't see any local crats, and nom has no user groups on zhwp.
Yes, what I was saying was, the first and the third and two separate consumers, so events on first should also be received on the third. If there were something fundamentally wrong with the event data, then both would crash. Since this is not the case, there is nothing fundamentally wrong with the event data and therefore the error must be in other places, such as transmission / decoding, which leads to the linked bug report / PR.
Feb 25 2020
So while the event data are loaded from json, hex escaping non-acsii are optional:
Would you mind posting the code of the 'minimal test case' somewhere?
Feb 24 2020
I see. Thanks for explaining.
...The work of process 5885 is done. Seeya! worker 2 killed successfully (pid: 5885) Respawned uWSGI worker 2 (new pid: 5911)
I guess a flame graph is the way to go then. Gotta find the bottleneck.
perhaps because their requests are taking too long to complete
Feb 23 2020
Though, I don't see the uWSGI listen queue of socket ":8000" (fd: 7) full messages mentioned earlier.
I see the uwsgi.log:
This is the other process.
This is the other process.
@Earwig I'm unfamiliar with the code base. Getting a flame graph profiler will probably be a massive PITA. Do you see anything odd in the backtraces?
Finally. Reliable backtrace:
Can't get the frame object for quite a few frames by enumerating the registers. I can probably use PyFrameObject's f_back as a linked list but that is not something libpython.py gdb script supports. Time for custom code :/
(gdb) f #8 PyEval_EvalFrameEx (f=< at remote 0x1b9fc10>, throwflag=1224238336) at ../Python/ceval.c:2679 2679 in ../Python/ceval.c (gdb) info reg rax 0xfffffffffffffe00 -512 rbx 0x7ff048f86500 140669993116928 rcx 0x7ff04f093010 140670094880784 rdx 0x0 0 rsi 0x80 128 rdi 0x1c36570 29582704 rbp 0x7ff0495fbfac 0x7ff0495fbfac rsp 0x7fff99b48670 0x7fff99b48670 r8 0x3 3 r9 0x7ff01aff8bf8 140669221833720 r10 0x0 0 r11 0x246 582 r12 0x1b9fc10 28965904 r13 0x7ff019865f30 140669197115184 r14 0x3 3 r15 0x7ff03c007ca0 140669775543456 rip 0x7ff04ba3227d 0x7ff04ba3227d <PyEval_EvalFrameEx+22749> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 (gdb) p (PyObject *)0x7ff03c007ca0 $11 = Frame 0x7ff03c007ca0, for file /data/project/copyvios/git/earwigbot/earwigbot/wiki/copyvios/exclusions.py, line 189, in check (self=<ExclusionsDB(_dbfile='.earwigbot/exclusions.db', _logger=<Logger(name='earwigbot.wiki.exclusionsdb', parent=<Logger(name='earwigbot.wiki', parent=<Logger(name='earwigbot', parent=<RootLogger(name='root', parent=None, handlers=, level=30, disabled=0, propagate=1, filters=) at remote 0x7ff04a4c2a10>, handlers=[<TimedRotatingFileHandler(utc=False, interval=86400, backupCount=7, suffix='%Y-%m-%d', stream=<file at remote 0x7ff045b11f60>, encoding=None, lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7ff045983450>, _RLock__count=0) at remote 0x7ff045a8f290>, level=20, when='MIDNIGHT', _name=None, delay=False, rolloverAt=1582502400, baseFilename='/data/project/copyvios/git/copyvios/.earwigbot/logs/bot.log', mode='a', filters=, extMatch=<_sre.SRE_Pattern at remote 0x7ff0459aac90>, formatter=<BotFormatter(_format=<instancemethod at r...(truncated)
(gdb) bt #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x00007ff04b9a5068 in PyThread_acquire_lock (lock=lock@entry=0x1c36570, waitflag=waitflag@entry=1) at ../Python/thread_pthread.h:324 #2 0x00007ff04ba2c3e6 in PyEval_RestoreThread (tstate=0x1b9fc10) at ../Python/ceval.c:357 #3 0x00007ff04997a7d2 in _pysqlite_fetch_one_row () from /data/project/copyvios/www/python/venv/lib/python2.7/lib-dynload/_sqlite3.x86_64-linux-gnu.so #4 0x00007ff04997a631 in pysqlite_cursor_iternext () from /data/project/copyvios/www/python/venv/lib/python2.7/lib-dynload/_sqlite3.x86_64-linux-gnu.so #5 0x00007ff04ba2d93e in PyEval_EvalFrameEx (f=<p at remote 0x7ff03c007e88>, throwflag=1270237784) at ../Python/ceval.c:2510 #6 0x00007ff04ba3227d in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4119 #7 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054 #8 PyEval_EvalFrameEx (f=< at remote 0x1b9fc10>, throwflag=1224238336) at ../Python/ceval.c:2679 #9 0x00007ff04baa5190 in PyEval_EvalCodeEx (co=0x7ff0496022b0, globals=<unknown at remote 0x80>, locals=0x0, args=0x7ff0341060d8, argcount=3, kws=0x7ff01aff8bf8, kwcount=0, defs=0x0, defcount=0, closure=(<cell at remote 0x7ff01ad2ac58>,)) at ../Python/ceval.c:3265 #10 0x00007ff04ba32171 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4129 #11 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054 #12 PyEval_EvalFrameEx (f=<unknown at remote 0x7ff0341060e0>, throwflag=430033928) at ../Python/ceval.c:2679 #13 0x00007ff04baa5190 in PyEval_EvalCodeEx (co=0x7ff0492e0930, globals=<unknown at remote 0x80>, locals=0x0, args=0x1e13d60, argcount=3, kws=0x7ff01aff8bf8, kwcount=0, defs=0x7ff0492eaca8, defcount=1, closure=0x0) at ../Python/ceval.c:3265 #14 0x00007ff04ba32171 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=<optimized out>, func=<optimized out>) at ../Python/ceval.c:4129 #15 call_function (oparg=<optimized out>, pp_stack=<optimized out>) at ../Python/ceval.c:4054
Can confirm. A lot of CPU is being used. gdb is taking forever to attach for some reason.
rcworker.py would just be reduced to a few lines
(Waiting for failure)
Feb 22 2020
What is the traceback?
Since there are mentions of some sockets above, so I checked if fds look sane. Installed lsof and strace inside the pod, and:
Is this concerning the URL https://tools.wmflabs.org/copyvios/? If so, it is not giving me 504 right now. Mind pinging when it does?
screen: Ctrl-A D
ssh: Enter ~ .
docker: Ctrl-P Ctrl-Q
Feb 19 2020
Thought of using https://wikitech.wikimedia.org/wiki/Nova_Resource:Video ?
Feb 18 2020
Yeah thanks. I 'fixed' it by deleting apicache directory
Looks like API cache have the namespace dict.
Looks like the localization of namepace names is broken:
Feb 15 2020
I think too long is fine because it will soon be unsupported. Annoyance is sometimes a good thing :)
Feb 2 2020
I can do some profiling. Do you have the data that is passed into the service right before it fails?
Jan 31 2020
for x in range(sumdelegates['head']): poslist['head'].append( [5.0+blocksize*(x+optionlist['spacing']/2), centertop]) # Cross-bench parties are 5 from the edge, vertically centered: for x in range(optionlist['centercols']): # How many rows in this column of the cross-bench thiscol = int(min(centerrows, sumdelegates['center']-x*centerrows)) for y in range(thiscol): poslist['center'].append([svgwidth-5.0-(optionlist['centercols']-x-optionlist['spacing']/2) * blocksize, ((svgheight-thiscol*blocksize)/2)+blocksize*(y+optionlist['spacing']/2)]) poslist['center'].sort(key=lambda point: point) # Left parties are in the top block: for x in range(wingcols): for y in range(optionlist['wingrows']['left']): poslist['left'].append( [5+(leftoffset+x+optionlist['spacing']/2)*blocksize, centertop-(1.5+y)*blocksize]) # Right parties are in the bottom block: for x in range(wingcols): for y in range(optionlist['wingrows']['right']): poslist['right'].append( [5+(leftoffset+x+optionlist['spacing']/2)*blocksize, centertop+(1.5+y)*blocksize])
So you mean read accesses should ignore maxlag? hmm
maxvemem is above 4GiB and h_vmem=4G, so yes, this is killed by VMS limit exceeded. Are you mapping files into memory?
sge_status(5): failed status 37 means qmaster enforced h_rt, h_cpu, or h_vmem limit
Checking last entry, adapting code from https://phabricator.wikimedia.org/source/tool-grid-jobs/browse/master/grid_jobs/__init__.py$27:
10:18:08 0 ✓ zhuyifei1999@tools-sgebastion-09: ~$ grep parliamentdiagram /data/project/.system_sge/gridengine/default/common/accounting task:tools-sgeexec-0933.tools.eqiad.wmflabs:tools.parliamentdiagram:tools.parliamentdiagram:cron-tools.parliamentdiagram-1:143498:sge:0:1580021582:1580021592:1580021593:0:0:1:0.004000:0.012000:3852.000000:0:0:0:0:983:0:0:8.000000:8:0:0:0:457:10:NONE:defaultdepartment:NONE:1:0:0.016000:0.000000:0.000000:-q task -l h_vmem=524288k:0.000000:NONE:0.000000:0:0 [...] webgrid-lighttpd:tools-sgewebgrid-lighttpd-0916.tools.eqiad.wmflabs:tools.parliamentdiagram:tools.parliamentdiagram:lighttpd-parliamentdiagram:460252:sge:0:1579694757:1579694764:1580077607:37:0:382843:0.016000:0.004000:6456.000000:0:0:0:0:857:28:0:6280.000000:0:0:0:0:72:9:NONE:defaultdepartment:NONE:1:0:191.790000:72.662859:9.654144:-q webgrid-lighttpd -l h_vmem=4G:0.000000:NONE:4779155456.000000:0:0 [...] webgrid-lighttpd:tools-sgewebgrid-lighttpd-0912.tools.eqiad.wmflabs:tools.parliamentdiagram:tools.parliamentdiagram:lighttpd-parliamentdiagram:237421:sge:0:1580161168:1580161182:1580401877:37:0:240695:0.020000:0.004000:6348.000000:0:0:0:0:863:2:0:24.000000:8:0:0:0:10:8:NONE:defaultdepartment:NONE:1:0:1641.300000:34.918114:4.651640:-q webgrid-lighttpd -l h_vmem=4G:0.000000:NONE:4420464640.000000:0:0 [...] webgrid-lighttpd:tools-sgewebgrid-lighttpd-0922.tools.eqiad.wmflabs:tools.parliamentdiagram:tools.parliamentdiagram:lighttpd-parliamentdiagram:391323:sge:0:1580412371:1580412372:1580474692:37:0:62320:0.068000:0.036000:6216.000000:0:0:0:0:820:0:0:0.000000:0:0:0:0:9:15:NONE:defaultdepartment:NONE:1:0:239.950000:233.663917:0.889195:-q webgrid-lighttpd -l h_vmem=4G:0.000000:NONE:4334555136.000000:0:0 [...]
So this is https://www.mediawiki.org/wiki/Manual:Maxlag_parameter I guess in theory you could increase it to make it more aggressive, but I think it's better to get the lag fixed.
Do you have some pywikibot-only test case?
Yes. You need to file a ticket for that. https://wikitech.wikimedia.org/wiki/Help:Shared_storage