Is this normal load? Is something up with PoolCounter changes?
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Bumps mwparserfromhell to 0.5.2 | research/ores/wheels | master | +3 -3 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Halfak | T206654 ORES workers using dramatically higher CPU, increasing linearly with time | |||
Invalid | Ladsgroup | T206763 Upgrade mwparserfromhell in production. |
Event Timeline
Looks like the trouble started on Sept 22nd, 16:00 UTC.
Restarting celery workers on each codfw box reset the usage back to zero, which buys us some time. I'm assuming that it'll behave the same way but on eqiad now. We'll see!
Some detailed graph of the jump: https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=5&fullscreen&orgId=1&from=1537477850460&to=1537720679875
It's Saturday and nothing happened around the time.
Poolcounter probably is not at fault here, it was deployed three days before this jump.
One weird thing is that ores1009 has high cpu usage right now but eqiad should not get much requests.
Might be related: https://github.com/celery/celery/issues/1558
Our current kombu version is 3.0.37, people reported such issues with that version and said upgrading to kombu 4.0.1 would solve it
And also https://github.com/celery/celery/issues/2142 Both of them basically mean we need to upgrade kombu, celery 3.1.26 (latest pre 4 version) doesn't use new version of kombu: https://github.com/celery/celery/blob/11e98c48357e589a045c5397bf3d8f5e7f11603e/requirements/default.txt i.e. This will be resolved if we upgrade to celery 4
Both issues mention a poll syscall returning POLLNVAL that we should be able to see in the strace if it's the same problem. Is it the worker manager or the worker threads which go 100% CPU? Do the straces match this profile?
Here's an interesting comment on a related celery bug: https://github.com/celery/celery/issues/2906#issuecomment-153082379
Are we still killing the workers after serving every N tasks?
The bugs you found suggest that it's the connection to Redis which is causing the grief, so I tried to correlate Redis glitches with the CPU spikes with no luck. However, look at these strange cut-outs, which might indicate network flakiness:
https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=31&fullscreen&orgId=1&from=1537647142541&to=1537699593504
https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=10&fullscreen&orgId=1&from=1537647142541&to=1537699593504
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 munmap(0x7fbfd88e7000, 262144) = 0 munmap(0x7fbfd89a7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd89a7000 munmap(0x7fbfd89a7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd89a7000 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 munmap(0x7fbfd88e7000, 262144) = 0 munmap(0x7fbfd89a7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd89a7000 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd8867000 munmap(0x7fbfd8867000, 262144) = 0 munmap(0x7fbfd88e7000, 262144) = 0 munmap(0x7fbfd89a7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd89a7000 munmap(0x7fbfd89a7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd89a7000 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000 munmap(0x7fbfd88e7000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd88e7000
Here's the strace profile
And gdb bt output
#0 0x00005628905a4140 in ?? () #1 0x00007fbf68c51b8b in Tokenizer_push_textbuffer (self=self@entry=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_support.c:61 #2 0x00007fbf68c51d3e in Tokenizer_pop_keeping_context (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_support.c:123 #3 0x00007fbf68c518fd in Tokenizer_handle_table_cell (self=self@entry=0x7fbfd9867718, markup=markup@entry=0x7fbf68c53105 "|", tag=tag@entry=0x7fbf68c52f63 "td", line_context=line_context@entry=34359738368) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2349 #4 0x00007fbf68c4f0aa in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2782 #5 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #6 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #7 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #8 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #9 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #10 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #11 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #12 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #13 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #14 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #15 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #16 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #17 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #18 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #19 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #20 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #21 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #22 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #23 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #24 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #25 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #26 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #27 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #28 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #29 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #30 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #31 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #32 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #33 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #34 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #35 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #36 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #37 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #38 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #39 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #40 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #41 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #42 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #43 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #44 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #45 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #46 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #47 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #48 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 ---Type <return> to continue, or q <return> to quit--- #49 0x00007fbf68c4ee2e in Tokenizer_parse_italics_and_bold (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1886 #50 Tokenizer_parse_style (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1967 #51 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2703 #52 0x00007fbf68c50a1d in Tokenizer_parse_wikilink (self=self@entry=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:357 #53 0x00007fbf68c4d86d in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2639 #54 0x00007fbf68c4ee2e in Tokenizer_parse_italics_and_bold (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1886 #55 Tokenizer_parse_style (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1967 #56 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2703 #57 0x00007fbf68c50a1d in Tokenizer_parse_wikilink (self=self@entry=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:357 #58 0x00007fbf68c4d86d in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2639 #59 0x00007fbf68c518fd in Tokenizer_handle_table_cell (self=self@entry=0x7fbfd9867718, markup=markup@entry=0x7fbf68c53105 "|", tag=tag@entry=0x7fbf68c52f63 "td", line_context=line_context@entry=34359738368) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2349 #60 0x00007fbf68c4f0aa in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2782 #61 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #62 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #63 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #64 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #65 0x00007fbf68c4ee2e in Tokenizer_parse_italics_and_bold (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1886 #66 Tokenizer_parse_style (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1967 #67 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2703 #68 0x00007fbf68c50a1d in Tokenizer_parse_wikilink (self=self@entry=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:357 #69 0x00007fbf68c4d86d in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2639 #70 0x00007fbf68c4ee2e in Tokenizer_parse_italics_and_bold (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1886 #71 Tokenizer_parse_style (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:1967 #72 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2703 #73 0x00007fbf68c50a1d in Tokenizer_parse_wikilink (self=self@entry=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:357 #74 0x00007fbf68c4d86d in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2639 #75 0x00007fbf68c518fd in Tokenizer_handle_table_cell (self=self@entry=0x7fbfd9867718, markup=markup@entry=0x7fbf68c53105 "|", tag=tag@entry=0x7fbf68c52f63 "td", line_context=line_context@entry=34359738368) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2349 #76 0x00007fbf68c4f0aa in Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2782 #77 0x00007fbf68c4f193 in Tokenizer_handle_table_row (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2311 #78 Tokenizer_parse (self=0x7fbfd9867718, context=<optimized out>, push=<optimized out>) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2776 #79 0x00007fbf68c4e33e in Tokenizer_parse_table (self=0x7fbfd9867718) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2251 #80 Tokenizer_parse (self=self@entry=0x7fbfd9867718, context=<optimized out>, push=push@entry=1) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2728 #81 0x00007fbf68c52640 in Tokenizer_tokenize (self=0x7fbfd9867718, args=0x7fbf540a95e8) at mwparserfromhell/parser/ctokenizer/tokenizer.c:167 #82 0x000056289059d6df in PyCFunction_Call () #83 0x000056289055d6e9 in PyEval_EvalFrameEx () #84 0x0000562890562286 in ?? () #85 0x000056289055e0a9 in PyEval_EvalFrameEx () #86 0x0000562890562286 in ?? () #87 0x000056289055e0a9 in PyEval_EvalFrameEx () #88 0x00005628905630df in PyEval_EvalCodeEx () #89 0x000056289059f7bf in ?? () #90 0x00005628905e6647 in PyObject_Call () #91 0x000056289055b180 in PyEval_EvalFrameEx () #92 0x00005628905630df in PyEval_EvalCodeEx () #93 0x000056289059f5d3 in ?? () #94 0x00005628905e6647 in PyObject_Call () #95 0x000056289050694e in ?? () #96 0x00005628905e6647 in PyObject_Call () #97 0x00005628905a6c46 in ?? () #98 0x00005628905e6647 in PyObject_Call () #99 0x000056289055b180 in PyEval_EvalFrameEx () #100 0x0000562890562286 in ?? () #101 0x000056289055e0a9 in PyEval_EvalFrameEx () #102 0x0000562890562286 in ?? () #103 0x000056289055e0a9 in PyEval_EvalFrameEx () #104 0x0000562890562286 in ?? () ---Type <return> to continue, or q <return> to quit--- #105 0x000056289055e0a9 in PyEval_EvalFrameEx () #106 0x0000562890562286 in ?? () #107 0x000056289055e0a9 in PyEval_EvalFrameEx () #108 0x0000562890562286 in ?? () #109 0x000056289055e0a9 in PyEval_EvalFrameEx () #110 0x0000562890562286 in ?? () #111 0x000056289055e0a9 in PyEval_EvalFrameEx () #112 0x00005628905a0ccc in ?? () #113 0x00005628905d8de4 in ?? () #114 0x00005628905ae5c4 in ?? () #115 0x00005628905e6647 in PyObject_Call () #116 0x000056289055dee1 in PyEval_EvalFrameEx () #117 0x0000562890562286 in ?? () #118 0x000056289055e0a9 in PyEval_EvalFrameEx () #119 0x0000562890562286 in ?? () #120 0x000056289055e0a9 in PyEval_EvalFrameEx () #121 0x00005628905630df in PyEval_EvalCodeEx () #122 0x000056289059f7bf in ?? () #123 0x00005628905e6647 in PyObject_Call () #124 0x000056289055b180 in PyEval_EvalFrameEx () #125 0x00005628905627bf in ?? () #126 0x000056289055e4c9 in PyEval_EvalFrameEx () #127 0x0000562890562286 in ?? () #128 0x000056289055e4c9 in PyEval_EvalFrameEx () #129 0x0000562890563817 in PyEval_EvalCodeEx () #130 0x000056289059f7bf in ?? () #131 0x00005628905e6647 in PyObject_Call () #132 0x000056289055b180 in PyEval_EvalFrameEx () #133 0x0000562890563817 in PyEval_EvalCodeEx () #134 0x000056289059f7bf in ?? () #135 0x00005628905e6647 in PyObject_Call () #136 0x000056289050694e in ?? () #137 0x00005628905e6647 in PyObject_Call () #138 0x00005628905a6c46 in ?? () #139 0x00005628905e6647 in PyObject_Call () #140 0x000056289055b180 in PyEval_EvalFrameEx () #141 0x00005628905627bf in ?? () #142 0x000056289055e0a9 in PyEval_EvalFrameEx () #143 0x00005628905630df in PyEval_EvalCodeEx () #144 0x000056289059f7bf in ?? () #145 0x00005628905e6647 in PyObject_Call () #146 0x000056289055b180 in PyEval_EvalFrameEx () #147 0x00005628905627bf in ?? () #148 0x000056289055e0a9 in PyEval_EvalFrameEx () #149 0x00005628905627bf in ?? () #150 0x000056289055e4c9 in PyEval_EvalFrameEx () #151 0x000056289055d93f in PyEval_EvalFrameEx () #152 0x00005628905630df in PyEval_EvalCodeEx () #153 0x000056289059f5d3 in ?? () #154 0x00005628905e6647 in PyObject_Call () #155 0x000056289050694e in ?? () #156 0x00005628905e6647 in PyObject_Call () #157 0x00005628905a6aa9 in ?? () #158 0x00005628905ae5c4 in ?? () #159 0x00005628905e6647 in PyObject_Call () #160 0x000056289055dee1 in PyEval_EvalFrameEx () #161 0x000056289055d93f in PyEval_EvalFrameEx () #162 0x000056289055d93f in PyEval_EvalFrameEx () #163 0x00005628905627bf in ?? () #164 0x000056289055e4c9 in PyEval_EvalFrameEx () #165 0x000056289055d93f in PyEval_EvalFrameEx () #166 0x000056289055d93f in PyEval_EvalFrameEx () #167 0x000056289055d93f in PyEval_EvalFrameEx () #168 0x00005628905630df in PyEval_EvalCodeEx () #169 0x000056289059f5d3 in ?? () #170 0x00005628905e6647 in PyObject_Call () #171 0x000056289055b180 in PyEval_EvalFrameEx () #172 0x00005628905a0ccc in ?? () #173 0x00005628905649d1 in ?? () #174 0x000056289059d6df in PyCFunction_Call () #175 0x000056289055d6e9 in PyEval_EvalFrameEx () #176 0x00005628905630df in PyEval_EvalCodeEx () #177 0x000056289059f648 in ?? () ---Type <return> to continue, or q <return> to quit--- #178 0x00005628905e6647 in PyObject_Call () #179 0x000056289055b180 in PyEval_EvalFrameEx () #180 0x000056289055d93f in PyEval_EvalFrameEx () #181 0x000056289055d93f in PyEval_EvalFrameEx () #182 0x000056289055d93f in PyEval_EvalFrameEx () #183 0x000056289055d93f in PyEval_EvalFrameEx () #184 0x000056289055d93f in PyEval_EvalFrameEx () #185 0x000056289055d93f in PyEval_EvalFrameEx () #186 0x00005628905630df in PyEval_EvalCodeEx () #187 0x000056289059f7bf in ?? () #188 0x00005628905e6647 in PyObject_Call () #189 0x000056289055b180 in PyEval_EvalFrameEx () #190 0x00005628905630df in PyEval_EvalCodeEx () #191 0x000056289059f7bf in ?? () #192 0x00005628905e6647 in PyObject_Call () #193 0x000056289050694e in ?? () #194 0x00005628905e6647 in PyObject_Call () #195 0x00005628905a6c46 in ?? () #196 0x00005628905e6647 in PyObject_Call () #197 0x000056289055b180 in PyEval_EvalFrameEx () #198 0x0000562890562286 in ?? () #199 0x000056289055e0a9 in PyEval_EvalFrameEx () #200 0x0000562890562286 in ?? () #201 0x000056289055e0a9 in PyEval_EvalFrameEx () #202 0x000056289055d93f in PyEval_EvalFrameEx () #203 0x0000562890562286 in ?? () #204 0x000056289055e0a9 in PyEval_EvalFrameEx () #205 0x00005628905627bf in ?? () #206 0x000056289055e0a9 in PyEval_EvalFrameEx () #207 0x0000562890562286 in ?? () #208 0x000056289055e0a9 in PyEval_EvalFrameEx () #209 0x000056289055d93f in PyEval_EvalFrameEx () #210 0x0000562890562286 in ?? () #211 0x0000562890562f9f in PyEval_EvalCode () #212 0x00005628906308f2 in ?? () #213 0x0000562890632e1d in PyRun_FileExFlags () #214 0x00005628906335be in PyRun_SimpleFileExFlags () #215 0x00005628906614d7 in Py_Main () #216 0x00005628904f1c01 in main ()
Unfortunately due to the venv I could not get (yet) the niceties of py-bt and py-list working but there's already an indication that this is busy looping parsing text. The ltrace output reinforces that idea
memset(0x7fbfd88ca7c8, '\0', 56) = 0x7fbfd88ca7c8 malloc(224) = 0x5628e48e4e10 free(0x5628e48e4d20) = <void> realloc(0, 32) = 0x5628e2db1390 memset(0x7fbfd88ca818, '\0', 56) = 0x7fbfd88ca818 malloc(224) = 0x5628e48e4d20 free(0x5628e48e4c30) = <void> realloc(0, 40) = 0x5628e36fda40 memmove(0x5628e36fda50, "", 0) = 0x5628e36fda50 free(0x5628e2db1390) = <void> malloc(224) = 0x5628e48e4c30 strlen("text") = 4 memcpy(0x7fbfd88c9258, "text", 4) = 0x7fbfd88c9258 memcmp(0x7fbfe0dcaed0, 0x7fbfd88c9258, 4, 64) = 0 memset(0x7fbfd88ca868, '\0', 56) = 0x7fbfd88ca868 malloc(224) = 0x5628e48e4f00 free(0x5628e48e4c30) = <void> realloc(0, 32) = 0x5628e2db1390 malloc(224) = 0x5628e48e4c30 strlen("|") = 1 strlen("wiki_markup") = 11 memcpy(0x7fbfd9b1ede0, "wiki_markup", 11) = 0x7fbfd9b1ede0 memcmp(0x7fbf7410df20, 0x7fbfd9b1ede0, 11, 64) = 0 memset(0x7fbfd88ca8b8, '\0', 56) = 0x7fbfd88ca8b8 malloc(224) = 0x5628e48e4ff0 free(0x5628e48e4c30) = <void> malloc(224) = 0x5628e48e4c30 strlen("text") = 4 memcpy(0x7fbfd88c9290, "text", 4) = 0x7fbfd88c9290 memcmp(0x7fbfe0dcaed0, 0x7fbfd88c9290, 4, 64) = 0 memset(0x7fbfd88ca908, '\0', 56) = 0x7fbfd88ca908 malloc(224) = 0x5628e48e50e0 free(0x5628e48e4c30) = <void> memmove(0x5628e48e4a20, "", 0) = 0x5628e48e4a20 free(0x5628e36fda40) = <void> malloc(224) = 0x5628e48e4c30 strlen("|") = 1 strlen("wiki_markup") = 11 memcpy(0x7fbfd9b1ede0, "wiki_markup", 11) = 0x7fbfd9b1ede0 memcmp(0x7fbf7410df20, 0x7fbfd9b1ede0, 11, 64) = 0 strlen("padding") = 7 memcpy(0x7fbfd88c9290, "padding", 7) = 0x7fbfd88c9290 memcmp(0x7fbfdd6c5760, 0x7fbfd88c9290, 7, 64) = 0 memset(0x7fbfd88ca958, '\0', 56) = 0x7fbfd88ca958 malloc(224) = 0x5628e48e51d0 free(0x5628e48e4c30) = <void> memmove(0x5628e48e4a30, "", 0) = 0x5628e48e4a30 free(0x5628e2db1390) = <void> malloc(224) = 0x5628e48e4c30 strlen("") = 0 strlen("wiki_markup") = 11 memcpy(0x7fbfd9b1ede0, "wiki_markup", 11) = 0x7fbfd9b1ede0 memcmp(0x7fbf7410df20, 0x7fbfd9b1ede0, 11, 64) = 0 memset(0x7fbfd88ca9a8, '\0', 56) = 0x7fbfd88ca9a8 malloc(224) = 0x5628e48e52c0 free(0x5628e48e4c30) = <void> malloc(224) = 0x5628e48e4c30 strlen("text")
I found this config on production, so in theory our workers should be restarting themselves:
CELERYD_MAX_TASKS_PER_CHILD: 100
Following up from IRC, I set loggers.celery.level to DEBUG in logging_config.yaml, and was able to demonstrate locally that celery debug logs would appear in /srv/log/ores/app.log.
I might be wrong but IIRC this works by the master counting the number of tasks a child completes (successfully or unsuccessfully, it's irrelevant) and killing them when that count reaches 100, starting a new process after that. It actually works, we 've witnessed that and can be easily witnessed by the age of celery workers on a machine (it's usually a few mins up to a few tens of mins).
But if a child goes into a forever loop and never completes ? This setting ain't gonna help.
Conceptually speaking, one solution to that would be a heartbeat between the celery master and the child and the master killing the child if it hasn't received a heartbeat in a preconfigured and sensible threshold. I do see celery supporting heartbeats per http://docs.celeryproject.org/en/2.1-archived/internals/reference/celery.worker.heartbeat.html since version 2.1 at least. Maybe it could help ?
Celery has timeouts and it should catch them. I guess revoking tasks might be at fault here.
One other observation, CPU usage doesn't go up linearly, it actually goes up with jumps. Since we restarted. Some nodes stayed stable but ores2004, ores2003 and ores2009 had the jump on midnight and stayed like this since then. If you look at other cases, it's all sudden jumps.
From @akosiaris in chat
[06:45:33] <akosiaris> mwparserfromhell honors it's name [06:45:43] <akosiaris> (gdb) py-bt [06:45:43] <akosiaris> Traceback (most recent call first): [06:45:44] <akosiaris> <built-in method tokenize of _tokenizer.CTokenizer object at remote 0x7fbfd9867718> [06:45:44] <akosiaris> Python Exception <class 'RuntimeError'> Type does not have a target.: [06:45:46] <akosiaris> sigh [06:46:12] <akosiaris> good luck figuring out now the python stacktrace
I think that we ought to (1) monitor CPU usage. Should never go above 50% for any sustained period of time and (2) update mwparserfromhell in hopes that the issue has already been addressed.
@Ladsgroup just noted in IRC that we are running the most recent version of mwparserfromhell already (0.5.1). So I'm looking into why these processes are not timing out.
We are *supposedly* using Unix signals to time out a processing job that takes too long. See https://github.com/wikimedia/ores/blob/master/ores/util.py#L25
See this for a discussion of threading/signal timeouts: https://pypi.org/project/stopit/#comparing-thread-based-and-signal-based-timeout-control
I've demonstrated that ORES timeout mechanism works for an old mwparserfromhell recursion bug in version 0.4.4 based on this report: https://github.com/earwig/mwparserfromhell/issues/190
See my work here: https://gist.github.com/halfak/281a692873d538e77c6096f5042e8be2
I also checked if ores extension has changed the way it does requests and jobs. Nothing of interest came out.
Is this however being used when calling out to mwparserfromhell ? The only usage of it I tracked down is https://github.com/wikimedia/ores/blob/master/ores/scoring_systems/scoring_system.py#L163. I don't know the code enough to understand if it's sufficient though.
I think I 've managed to reproduce the problem. After fetching and trying the last 500 revisions from https://es.wikipedia.org/wiki/Usuario:Danielalfredo/Taller and failing to reproduce it, I returned to the gdb process. After a very long fight understanding Tokenizer and with how the data is stored internally in Tokenizer values I 've managed to obtain the following
(gdb) p *(Tokenizer *) self $37 = { ob_base = { ob_refcnt = 2, ob_type = 0x7fa7df1004a0 <TokenizerType> }, text = { object = 0x56178bc3ab40, length = 633838, kind = 2, data = 0x56178bc3ab88 }, topstack = 0x7fa84fc9d720, head = 633087, global = 0, depth = 39, route_state = 0, route_context = 0, bad_routes = 0x7fa849fd91f8, skip_style_tags = 0 }
which points out we have a text unicode object, with kind=2, that is UCS2 (since pep393 unicode strings in python are size variable, and that's the way it's implemeted AFAICT), of length 633838 starting at address 0x56178bc3ab88. . Calculating 0x56178bc3ab88 + 633838*2 bytes for the end address we have 0x56178BD70364. I 've dumped then that memory.
dump memory esbugmem 0x56178bc3ab88 0x56178BD70364
And crafted the following simple python script to parse it
import mwparserfromhell import time import os import glob # The file is binary memory dump, so read it as binary and then decode it as utf16 which for the purposes of this reproduction is close enough to UCS2. with open('esbugmem', 'rb') as f: data = f.read() text = data.decode('utf16') mwparserfromhell.parse(text)
using the virtualenv environment we have on ores hosts in order to reproduce fully the environment
Sure enough the thing has gone into a forever loop. It also is fully unresponsive to Ctrl-C making me think there is some catching of the SIGINT signal somewhere in mwparserfromhell.
For purposes of reproduction the memory dump is at
Now to figure out what is going on in that file.
The dumped memory when converted back to utf8 has the exact same MD5 as https://es.wikipedia.org/w/index.php?title=Usuario:Danielalfredo/Taller&oldid=111186880
which DOES NOT trigger the bug in its own.
This is weird.
OK trying to replicate... I have found that you absolutely *must* use the file provided by @akosiaris directly. Even re-saving it UTF-16 will not allow one to replicate the issue.
Here's some notes from IRC:
[09:34:05] <halfak> Aha! I wasn't using the right file. I have also managed to replicate! [09:34:38] <halfak> Indeed the signal does not work to stop it. [09:34:55] <halfak> Confirmed that ^C does nothing. [09:35:40] <halfak> OK. So now, I want to see if I can process this revision in labs without anything going crazy. [09:37:53] <halfak> Confirmed. I can reproduce on ores-staging with http://ores-staging.wmflabs.org/v3/scores/eswiki/111186880/damaging [09:38:54] <halfak> OK! We're getting somewhere. [09:39:06] <halfak> Now can I reproduce on my localhost in ORES? [09:49:01] <halfak> Aha! So, the problem persists on my local machine too (ubuntu 16.04) [09:49:45] <halfak> I need to kill -9 the process to get it to stop
So, now the trick is to figure out how the text can work and not work at the same time. It seems like celery (maybe?) is doing some memory gymnastics that are causing this to be totally un-parseable.
I tried simply pickling and un-pickling the text, but that doesn't seem to cause an issue.
I've shown that the timeout still works with the same revision in mwparserfromhell==0.5.0 So the problem seems to exist with mwparserfromhell==0.5.1
I can reproduce it with the version downloaded from wikipedia:
$ wget "https://es.wikipedia.org/w/index.php?title=Usuario:Danielalfredo/Taller&oldid=111186880&action=raw" -O 111186880.txt
Plus the code of P7674 which is only a tiny variation from the original P7672.
Maybe when moving the wikitext from the wikipedia revision to a local file something 'fixed' it?
$ md5sum 111186880.txt e306383a70dae20a0d6451a619ee5af2 111186880.txt
OK we have a good working demo script and an issue filed against mwparserfromhell. See https://github.com/earwig/mwparserfromhell/issues/206
Turns out version 0.5.0 and 0.5.1 don't matter. I was able to get the timeout to work in 0.5.0 because the compilation was failing and I was using a pure-python version of mwparserfromhell. When solving the compiling issue, it still refused to be interrupted.
OK so my proposal is that we deploy the python-only build of mwparserfromhell so that we can timeout effectively. I've run a basic performance test using the following code on Barack Obama -- a very large and difficult to parse article.
import time import mwapi from mwparserfromhell import parse result = mwapi.Session("https://en.wikipedia.org", user_agent="mwp test").get( action="query", prop="revisions", revids=863662486, rvprop=["content"], rvslots=["main"], formatversion=2) wikitext = result['query']['pages'][0]['revisions'][0]['slots']['main']['content'] start = time.time() for i in range(50): parse(wikitext) print("Parsing took {0} seconds on average.".format((time.time() - start) / 50))
With the binary C parser:
$ python demo_parser_performance.py Parsing took 0.16366254806518554 seconds on average.
With the pure python parser:
$ python demo_parser_performance.py Parsing took 0.3874843120574951 seconds on average.
So we're about 2x slower. I think that is acceptable when trading for robustness until this bug is fixed.
Thanks for that. I did figure out why I could not reproduce it before. My locally downloaded version of the revision had a single newline at the end of file.
--- 111186880.txt 2018-10-12 21:19:24.246644446 +0000 +++ rev111186880 2018-10-12 09:19:27.321432415 +0000 @@ -20597,4 +20597,4 @@ | '''{{bandera2|Montenegro}}''' | '''{{bandera2|San Marino}}''' | '''{{bandera2|Australia}}''' -|- \ No newline at end of file +|-
I have no idea why that newline makes it work, but here it is.
Overall probably yes, but keep in mind this specific article has revisions that go into the 2+ secs of parsing time with mwparserfromhell. Sure, it may very well be an edge case, just keep it in mind.
python3 lala.py 'rev1*' revision: rev110083490, seconds 2.119253635406494 revision: rev110266010, seconds 2.717501401901245 revision: rev109308712, seconds 1.0890047550201416 revision: rev110241454, seconds 2.605189323425293 revision: rev109663898, seconds 1.7907164096832275 revision: rev109181061, seconds 1.0200517177581787 revision: rev109070247, seconds 1.0225756168365479 revision: rev110530543, seconds 2.9767608642578125 revision: rev109273505, seconds 1.1006214618682861 revision: rev110335113, seconds 2.7629189491271973 revision: rev109005759, seconds 0.916459321975708 revision: rev109204713, seconds 1.0664558410644531 revision: rev109110956, seconds 1.0236058235168457 revision: rev109096194, seconds 1.0882329940795898 revision: rev110550913, seconds 2.8583431243896484 revision: rev109504701, seconds 1.1853852272033691 revision: rev110083820, seconds 2.0748698711395264 revision: rev109665173, seconds 1.7461316585540771 revision: rev109235455, seconds 1.0167853832244873 revision: rev110105440, seconds 2.2856404781341553 revision: rev110265710, seconds 2.5583033561706543 revision: rev110310626, seconds 2.757080316543579 revision: rev109111816, seconds 1.0812571048736572 revision: rev109265752, seconds 1.1079812049865723 revision: rev110084923, seconds 2.384281635284424 revision: rev109099274, seconds 1.000997543334961 revision: rev109104295, seconds 0.9784274101257324 revision: rev109006184, seconds 0.8562686443328857 revision: rev109988094, seconds 2.334792137145996 revision: rev109071346, seconds 0.879063606262207 revision: rev109082529, seconds 0.9722130298614502
Also note that the wikitext of this page ranges in the 600k bytes whereas the wikitext for the Barack Obama page is ~300k
We're still waiting on a release of mwparsefromhell to allow us to interrupt long jobs like this. If this revision gets scored again, it's going to lock up a worker.
I just pinged @Earwig again here: https://github.com/earwig/mwparserfromhell/issues/206#issuecomment-435079037
Sorry for a bunch of pings, @Earwig! Thanks for your help. :)
Change 477310 had a related patch set uploaded (by Halfak; owner: Halfak):
[research/ores/wheels@master] Bumps mwparserfromhell to 0.5.2