Ores hosts: mwparserfromhell tokenizer random segfault
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	May 9 2019, 10:03 AM

Description

Debugging something completely different I've noticed that the ores hosts are logging few segfault errors in syslog at random.
It seems that they are all the same and reference _tokenizer.cpython-35m-x86_64-linux-gnu.so that seems to come from the mwparserfromhell Python package included in ores's virtualenvs.

$ sudo cumin -x 'A:ores' "zgrep -c 'segfault' /var/log/syslog.* | grep -v ':0$'"
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
18 hosts will be targeted:
ores[2001-2009].codfw.wmnet,ores[1001-1009].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) ores1002.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.7.gz:2
===== NODE GROUP =====
(2) ores2009.codfw.wmnet,ores1003.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.1:1
===== NODE GROUP =====
(1) ores2007.codfw.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.1:1
/var/log/syslog.3.gz:1
/var/log/syslog.4.gz:1
/var/log/syslog.5.gz:1
===== NODE GROUP =====
(1) ores1009.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.1:1
/var/log/syslog.4.gz:1
===== NODE GROUP =====
(1) ores2001.codfw.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.7.gz:1
===== NODE GROUP =====
(1) ores2003.codfw.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.4.gz:1
/var/log/syslog.7.gz:1
===== NODE GROUP =====
(1) ores1006.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.4.gz:1
/var/log/syslog.6.gz:1
===== NODE GROUP =====
(1) ores2004.codfw.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.6.gz:1
===== NODE GROUP =====
(1) ores2002.codfw.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.2.gz:1
===== NODE GROUP =====
(1) ores1004.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.2.gz:1
/var/log/syslog.3.gz:1
===== NODE GROUP =====
(2) ores2005.codfw.wmnet,ores1005.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.5.gz:1
===== NODE GROUP =====
(1) ores1001.eqiad.wmnet
----- OUTPUT of 'zgrep -c 'segfau... | grep -v ':0$'' -----
/var/log/syslog.5.gz:1
/var/log/syslog.7.gz:2
================
PASS:  |██████████████████████████████████████████████████████████████████████████████████████████████████| 100% (18/18) [00:00<00:00, 20.95hosts/s]
FAIL:  |                                                                                                           |   0% (0/18) [00:00<?, ?hosts/s]
100.0% (18/18) success ratio (>= 100.0% threshold) for command: 'zgrep -c 'segfau... | grep -v ':0$''.
100.0% (18/18) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

This is a specific log example:

May  8 22:57:12 ores2007 kernel: [4695113.152807] celery[24047]: segfault at 8 ip 00007f0b48d8af10 sp 00007ffeb72bb610 error 4 in _tokenizer.cpython-35m-x86_64-linux-gnu.so[7f0b48d84000+d000]
May  8 22:57:12 ores2007 celery-ores-worker[13749]: [2019-05-08 22:57:12,361: ERROR/MainProcess] Process 'ForkPoolWorker-14771' pid:24047 exited with 'signal 11 (SIGSEGV)'

I didn't find any related bug upstream that shouldn't be already included into our deployed version of mwparserfromhell.
Unless is something known we might need to enable core dumps and wait for a repro.

Details

Subject	Repo	Branch	Lines +/-
Updates wheels for revscoring 2.4.0	research/ores/wheels	master	+66 -57
ores::worker: allow celery to emit a core dump upon segfault	operations/puppet	production	+3 -0
ores::web: allow uwsgi to dump a core after a segfault	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T227094 Update RC Filters for new ORES capacities (July, 2019)
Resolved	SBisson	T225561 Update ORES thresholds for nlwiki
Open	None	T223273 Update srwiki thresholds for goodfaith model
Resolved	SBisson	T225562 Deploy ORES filters for zhwiki
Open	None	T225563 Deploy ORES filters for jawiki
Resolved	Halfak	T224484 ORES deployment: Early June
Resolved	Halfak	T222866 Ores hosts: mwparserfromhell tokenizer random segfault

Event Timeline

Volans created this task.May 9 2019, 10:03 AM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptMay 9 2019, 10:03 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

elukey subscribed.May 9 2019, 10:22 AM

Change 509059 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ores::web: allow uwsgi to dump a core after a segfault

https://gerrit.wikimedia.org/r/509059

Change 509059 abandoned by Elukey:
ores::web: allow uwsgi to dump a core after a segfault

https://gerrit.wikimedia.org/r/509059

Change 509060 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] ores::worker: allow celery to emit a core dump upon segfault

https://gerrit.wikimedia.org/r/509060

Change 509060 merged by Elukey:
[operations/puppet@production] ores::worker: allow celery to emit a core dump upon segfault

https://gerrit.wikimedia.org/r/509060

Mentioned in SAL (#wikimedia-operations) [2019-05-13T07:08:24Z] <elukey> slow roll restart of celery on ores* nodes to allow cores to be generated upon segfault - T222866

Legoktm added a subscriber: Earwig.May 13 2019, 7:29 AM

Legoktm subscribed.

Harej triaged this task as Medium priority.May 13 2019, 6:20 PM

Harej moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

On ores2009 there is already a stacktrace. In order to get the python bt, I had to do some quirks:

install python3-dbg (that gives a lot of tools like py-bt in gdb)
manually fix /usr/share/gdb/auto-load/usr/bin/python3.5-gdb.py like indicated in https://stackoverflow.com/a/53139161 (was getting the same weird exception)
modify MAX_OUTPUT_LEN in /usr/share/gdb/auto-load/usr/bin/python3.5-gdb.py to allow printing long strings.
execute source /usr/share/gdb/auto-load/usr/bin/python3.5-gdb.py in gdb to be able to use py-bt

sudo gdb /srv/deployment/ores/deploy/venv/bin/python /var/tmp/core/core.ores2009.celery.29574.1557752410
(gdb) source /usr/share/gdb/auto-load/usr/bin/python3.5-gdb.py
(gdb) py-bt
Traceback (most recent call first):
  <built-in method tokenize of _tokenizer.CTokenizer object at remote 0x7f5876720c60>
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/mwparserfromhell/parser/__init__.py", line 93, in parse
    tokens = self._tokenizer.tokenize(text, context, skip_style_tags)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/mwparserfromhell/utils.py", line 58, in parse_anything
    return Parser().parse(value, context, skip_style_tags)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/features/wikitext/datasources/parsed.py", line 210, in _process_wikicode
    return mwparserfromhell.parse(text)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/dependent.py", line 54, in __call__
    return self.process(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 244, in _solve
    value = dependent(*args)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 237, in _solve
    profile=profile)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 237, in _solve
    profile=profile)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 237, in _solve
    profile=profile)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 237, in _solve
    profile=profile)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 237, in _solve
    profile=profile)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 269, in _solve_many
    profile=profile)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/ores/scoring_context.py", line 101, in _solve_features
    return list(self.extractor.solve(features, cache=dependency_cache))
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/ores/scoring_context.py", line 121, in _process_score
    feature_values = self._solve_features(model_name, dependency_cache)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/ores/scoring_context.py", line 85, in process_model_scores
    self._process_score(model_name, dependency_cache=root_cache)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/ores/util.py", line 25, in timeout
    result = func(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/ores/scoring_systems/scoring_system.py", line 166, in _process_score_map
    seconds=self.timeout)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/ores/scoring_systems/celery_queue.py", line 71, in _process_score_map
    root_cache=root_cache)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/celery/app/trace.py", line 540, in _fast_trace_task
    uuid, args, kwargs, request,
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/billiard/pool.py", line 358, in workloop
    result = (True, prepare_result(fun(*args, **kwargs)))
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/billiard/pool.py", line 289, in __call__
    sys.exit(self.workloop(pid=pid))
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/billiard/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/52e9759ac46cbcde58078d444fbb804f50a4ecc0/venv/lib/python3.5/site-packages/billiard/process.py", line 327, in _bootstrap
[..]

And now the C backtrace:

(gdb) bt
#0  Tokenizer_parse_heading (self=0x7f5876720c60) at mwparserfromhell/parser/ctokenizer/tok_parse.c:817
#1  Tokenizer_parse (self=self@entry=0x7f5876720c60, context=<optimized out>, push=push@entry=1) at mwparserfromhell/parser/ctokenizer/tok_parse.c:2665
#2  0x00007f589450e6a0 in Tokenizer_tokenize (self=0x7f5876720c60, args=
    ("{{Wikidata list\n|sparql=SELECT ?item WHERE { ?item wdt:P31 wd:Q39715 }\n|section=131\n|sort=label\n|columns=label:Artikel,P17:Land,P625:Koordinaten,P18\n|thumb=128\n|min_section=2\n}}\n== <no value> ==\n{| class='wikitable sortable' style='width:100%'\n!Artikel\n!Land\n!Koordinaten\n!Bild\n|-\n| ''[[:d:Q38821070|Finngrundet]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=60.984361111111|EW=18.605277777778|name=Q38821070|simple=y|type=landmark|region=x}}\n| \n|-\n| ''[[:d:Q10495401|Finnish lion]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=59.4727|EW=20.81306667|name=Q10495401|simple=y|type=landmark|region=x}}\n| \n|-\n| ''[[:d:Q62103847|Finnish lion]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=59.472666666667|EW=20.813138888889|name=Q62103847|simple=y|type=landmark|region=x}}\n| [[Datei:Marker light.jpg|center|128px]]\n|-\n| ''[[:d:Q10684952|Svenska Björn]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=59.54833333|EW=20.01944444|name=Q10684952|simple=y|type=landmark|region=x}}\n| \n|-\n| ''[[:d:Q34187519|Tröskeln Östra]]'", 0, False)) at mwparserfromhell/parser/ctokenizer/tokenizer.c:167
#3  0x000056500b9af25f in PyCFunction_Call () at ../Objects/methodobject.c:109
#4  0x000056500b979d0f in call_function (oparg=<optimized out>, pp_stack=0x7ffd453e65d0) at ../Python/ceval.c:4720
#5  PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#6  0x000056500b97eb06 in _PyEval_EvalCodeWithName.lto_priv.1929 () at ../Python/ceval.c:4033
#7  0x000056500b97a639 in fast_function (nk=<optimized out>, na=4, n=<optimized out>, pp_stack=0x7ffd453e67e0, func=<optimized out>) at ../Python/ceval.c:4828
#8  call_function (oparg=<optimized out>, pp_stack=0x7ffd453e67e0) at ../Python/ceval.c:4745
#9  PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#10 0x000056500b97eb06 in _PyEval_EvalCodeWithName.lto_priv.1929 () at ../Python/ceval.c:4033
#11 0x000056500b97a639 in fast_function (nk=<optimized out>, na=1, n=<optimized out>, pp_stack=0x7ffd453e69f0, func=<optimized out>) at ../Python/ceval.c:4828
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffd453e69f0) at ../Python/ceval.c:4745
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:3251
#14 0x000056500b96716f in _PyEval_EvalCodeWithName.lto_priv.1929 (qualname=0x0, name=<optimized out>, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=<optimized out>, argcount=<optimized out>, args=<optimized out>,
    locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4033
#15 PyEval_EvalCodeEx () at ../Python/ceval.c:4054
#16 0x000056500b9b11ef in function_call.lto_priv () at ../Objects/funcobject.c:627
#17 0x000056500b9fae17 in PyObject_Call () at ../Objects/abstract.c:2166
#18 0x000056500b97784e in ext_do_call (nk=<optimized out>, na=<optimized out>, flags=<optimized out>, pp_stack=0x7ffd453e6ca8, func=<function at remote 0x7f589475ed08>) at ../Python/ceval.c:5049
#19 PyEval_EvalFrameEx () at ../Python/ceval.c:3290
#20 0x000056500b96716f in _PyEval_EvalCodeWithName.lto_priv.1929 (qualname=0x0, name=<optimized out>, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=<optimized out>, argcount=<optimized out>, args=<optimized out>,
    locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4033
#21 PyEval_EvalCodeEx () at ../Python/ceval.c:4054
#22 0x000056500b9b10e3 in function_call.lto_priv () at ../Objects/funcobject.c:627
#23 0x000056500b9fae17 in PyObject_Call () at ../Objects/abstract.c:2166
#24 0x000056500b91834e in method_call.lto_priv () at ../Objects/classobject.c:330
#25 0x000056500b9fae17 in PyObject_Call () at ../Objects/abstract.c:2166
#26 0x000056500b9a7a66 in slot_tp_call () at ../Objects/typeobject.c:6072
#27 0x000056500b9fae17 in PyObject_Call () at ../Objects/abstract.c:2166
#28 0x000056500b97784e in ext_do_call (nk=<optimized out>, na=<optimized out>, flags=<optimized out>, pp_stack=0x7ffd453e7048,
    func=<Datasource(calls=1, name='wikitext.revision.parent.wikicode', dependencies=[<Datasource(calls=0, name='revision.parent.text', dependencies=[], process=<function at remote 0x7f596c0f1e18>) at remote 0x7f5892058048>], process=<function at remote 0x7f589475ed08>) at remote 0x7f5892058c50>) at ../Python/ceval.c:5049
#29 PyEval_EvalFrameEx () at ../Python/ceval.c:3290

Setting this task to custom visibility (WMF-NDA) since the backtrace contains a way to make ores segfault (not sure if needed but we can remove it later on).

We are running version 0.5.2, but the failing code is still the same (checked downloading the source of 0.5.2 from GH):

https://github.com/earwig/mwparserfromhell/blob/develop/mwparserfromhell/parser/ctokenizer/tok_parse.c#L817

(gdb) frame
#0  Tokenizer_parse_heading (self=0x7f5876720c60) at mwparserfromhell/parser/ctokenizer/tok_parse.c:817
817	in mwparserfromhell/parser/ctokenizer/tok_parse.c
(gdb) print heading
$1 = (HeadingData *) 0x0

elukey changed the visibility from "Custom Policy" to "Custom Policy".May 14 2019, 8:08 AM

I 've dumped the memory of that variable.

p *(Tokenizer *) self
$1 = {
  ob_base = {
    ob_refcnt = 2, 
    ob_type = 0x7f58947114e0 <TokenizerType>
  }, 
  text = {
    object = "{{Wikidata list\n|sparql=SELECT ?item WHERE { ?item wdt:P31 wd:Q39715 }\n|section=131\n|sort=label\n|columns=label:Artikel,P17:Land,P625:Koordinaten,P18\n|thumb=128\n|min_section=2\n}}\n== <no value> ==\n{| class='wikitable sortable' style='width:100%'\n!Artikel\n!Land\n!Koordinaten\n!Bild\n|-\n| ''[[:d:Q38821070|Finngrundet]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=60.984361111111|EW=18.605277777778|name=Q38821070|simple=y|type=landmark|region=x}}\n| \n|-\n| ''[[:d:Q10495401|Finnish lion]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=59.4727|EW=20.81306667|name=Q10495401|simple=y|type=landmark|region=x}}\n| \n|-\n| ''[[:d:Q62103847|Finnish lion]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=59.472666666667|EW=20.813138888889|name=Q62103847|simple=y|type=landmark|region=x}}\n| [[Datei:Marker light.jpg|center|128px]]\n|-\n| ''[[:d:Q10684952|Svenska Björn]]''\n| <no value>\n| {{Coordinate|text=DMS|NS=59.54833333|EW=20.01944444|name=Q10684952|simple=y|type=landmark|region=x}}\n| \n|-\n| ''[[:d:Q34187519|Tröskeln Östra]]'", 
    length = 1639948, 
    kind = 2, 
    data = 0x565061469788
  }, 
  topstack = 0x7f5968ff2210, 
  head = 237028, 
  global = 1, 
  depth = 9, 
  route_state = 0, 
  route_context = 128, 
  bad_routes = 0x7f5964eda248, 
  skip_style_tags = 0
}

UCS2 string (because of kind == 2) with a length of 1639948 chars (x2 for bytes), starting address 0x565061469788.

Then used the dump memory command with the calculated START/STOP parameters as per T206654#4661442

File is attached, but I am unable to reproduce using the following, despite using the exact same venv

import mwparserfromhell
import time
import os
import glob

# The file is binary memory dump, so read it as binary and then decode it as utf16 which for the purposes of this reproduction is close enough to UCS2.
with open('T222866', 'rb') as f:
    data = f.read()
    text = data.decode('utf16')
    mwparserfromhell.parse(text)

T2228663 MBDownload

akosiaris added subscribers: Halfak, Ladsgroup.May 14 2019, 9:42 AM

I've just released mwparserfromhell 0.5.4 with a fix for this specific bug (guarding that read with a NULL check and propagating the error instead). The interesting thing is that the conditions that lead to this crash should be very rare: the only situations I can think of are running out of memory or an exception being raised (like a KeyboardInterrupt) while we are in the middle of parsing a heading. I guess the latter is probably the cause, due to the timeout logic mentioned in T206654, which would also explain the reproducibility issue. (On my machine this page parses correctly in a couple seconds, but perhaps it's slow enough on ORES to trigger the timeout?)

Anyway, I've scanned through the code looking for similar issues and only found one other which should be even rarer. If there are frequent segfaults, they might be explained by other bugs... how often are we seeing them? (% of pages parsed?) I haven't aggressively tested the parser's behavior when exceptions are thrown arbitrarily in the middle of parsing; while it should work, it's difficult to test and I'm unfortunately not surprised that a bug like this slipped through.

@Earwig thanks a lot for the quick fix! The segfaults for this specific use case are very rare, once every day maximum for every ores host.. @Volans noticed them by chance while looking on ores hosts for a completely unrelated reason.

Next step is then to upgrade the library to 0.5.4.. @Halfak, @Ladsgroup I guess that the ball is in your court :)

In T222866#5183433, @Earwig wrote:

I've just released mwparserfromhell 0.5.4 with a fix for this specific bug (guarding that read with a NULL check and propagating the error instead). The interesting thing is that the conditions that lead to this crash should be very rare: the only situations I can think of are running out of memory or an exception being raised (like a KeyboardInterrupt) while we are in the middle of parsing a heading. I guess the latter is probably the cause, due to the timeout logic mentioned in T206654, which would also explain the reproducibility issue. (On my machine this page parses correctly in a couple seconds, but perhaps it's slow enough on ORES to trigger the timeout?)

Spot on! Good catch.

$ python3 T222866.py 
^CSegmentation fault
$ sudo dmesg | tail -1
[5246130.532404] python3[39407]: segfault at 8 ip 00007f07cea7ef10 sp 00007ffd9857a040 error 4 in _tokenizer.cpython-35m-x86_64-linux-gnu.so[7f07cea78000+d000]

Which is the same (minus the address differences which are expected) as the one we are investigating.

[Mon May 13 12:52:48 2019] celery[29574]: segfault at 8 ip 00007f589450af10 sp 00007ffd453e6410 error 4 in _tokenizer.cpython-35m-x86_64-linux-gnu.so[7f5894504000+d000]

For what is worth in production we do indeed see a pretty high time parsing that revision so indeed the timeout signal sending behavior could explain it.

time python3 T222866.py 

real	0m15.716s
user	0m15.228s
sys	0m0.488s

We currently have mwparserfromhell 0.5.2 deployed. So it looks like we're due for an update. I'll get on that.

In T222866#5185764, @Halfak wrote:

We currently have mwparserfromhell 0.5.2 deployed. So it looks like we're due for an update. I'll get on that.

@Halfak any plan to upgrade mwparserfromhell ?

Yeah. We're gearing up for a new deployment that will have a few updates to our core library and that will bring a dependency upgrade for mwparserfromhell. We're understaffed at the moment so I'm mostly blocked on review. :|

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.May 29 2019, 4:15 PM

https://gerrit.wikimedia.org/r/#/c/research/ores/wheels/+/513159

Halfak mentioned this in T224484: ORES deployment: Early June.May 29 2019, 6:12 PM

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.May 29 2019, 6:14 PM

Halfak mentioned this in rORESWHEELS3d0f8101d075: Updates wheels for revscoring 2.4.0.May 29 2019, 10:51 PM

Halfak mentioned this in rORESWHEELS96a7587dc344: Updates wheels for revscoring 2.4.0.

Halfak moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Jun 10 2019, 4:08 PM

Halfak added a parent task: T224484: ORES deployment: Early June.Jun 11 2019, 7:36 PM

Halfak moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.Jun 18 2019, 1:38 PM

Halfak closed this task as Resolved.Jun 18 2019, 1:39 PM

@Halfak @akosiaris OK to make this task public now?