Page MenuHomePhabricator

Investigate why current es2r daemon is randomly hanging
Open, LowPublic

Description

es2r (even streams to redis) is component of XmlRcs which is fetching events and puts them to redis. It's a 24 / 7 daemon, with trivial source code based on sample code: https://github.com/huggle/XMLRCS/blob/master/src/es2r/es2r.py

It however does hang-up for unknown reasons, sometimes after several days, sometimes after weeks and it doesn't show any obvious errors. It's necessary to fix it so that connection to EventStream is persistent and capable to handle any kind of connection problems.

Event Timeline

An example python code from wikimedia stream team which is capable to hold a working connection forever and handles all kinds of connection errors would be great.

Petrb triaged this task as High priority.Nov 11 2017, 10:28 AM

Just in case for XmlRcs maintainers the workaround right now in case the python script hangs, from labs:

ssh xmlrcs
petrb@xmlrcs:~$ ps -ef | grep xml
xmlrcs    3539     1  2  2017 ?        3-06:14:50 ././xmlrcsd -vvvd
xmlrcs   10811     1  0  2017 ?        00:00:01 /bin/sh ./start
xmlrcs   28573 10811  1  2017 ?        21:11:37 /usr/bin/python ./es2r.py
petrb    31268 31181  0 03:02 pts/1    00:00:00 grep --color=auto xml
petrb@xmlrcs:~$ sudo kill 28573
zhuyifei1999 added a subscriber: zhuyifei1999.

sseclient hanging was also observed in another consumer of mine (based on pywikibot). I was unable to get a full traceback last time this happened, but I will investigate this.

I think you should set a timeout for the sse client to prevent hangings. You also could use pywikibot’s EventStreams which provides additional features like automatic re-establishing of the connection. Since few weeks requests >= 2.9 and sseclient >= 0.0.18 is required for recentchange stream.

Can you show me pywiki bot EventStream so that I don't need to spend 2 weeks searching its source code tree? :)

Sure. Either you may use pagegenerators -live recentchanges option or the module itself which is in pywikibot.comms.eventstreams

Unlicking this- wasn't able to reproduce in a month's continuous running of three sseclients.

it just happened now, the problem with this issue is that it happens randomly like once in a year :( so it's not easy to reproduce

in stdout I see

Traceback (most recent call last):
  File "./es2r.py", line 18, in <module>
    for event in EventSource(url):
  File "build/bdist.linux-x86_64/egg/sseclient.py", line 39, in __init__
  File "build/bdist.linux-x86_64/egg/sseclient.py", line 52, in _connect
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 893, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://stream.wikimedia.org/v2/stream/recentchange

But there is lot of these errors, it's possible that when it locks up there is no error msg, it just stops sending any data to redis

Indeed, both of the daemons I kept running earlier is affected. The 60-second time SIGALRM is at 2018-03-29T07:44:35.389710 & 2018-03-29T07:44:32.640241, so the last message happened around 60 seconds prior to the timestamps.

Last events received:

>>> event
<sseclient.Event object at 0x7fd948994310>
>>> dir(event)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'data', 'dump', 'event', 'id', 'parse', 'retry', 'sse_line_pattern']
>>> event.id
u'[{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":27021971},{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":802547042}]'
>>> event.data
u'{"bot":false,"comment":"/* wbsetclaim-update:2||1 */ [[Property:P19]]: [[Q142102]]","id":693710948,"length":{"new":5086,"old":5084},"meta":{"domain":"www.wikidata.org","dt":"2018-03-29T07:43:31+00:00","id":"e0e61373-3324-11e8-b520-b083fecf0db3","request_id":"dd980328-f56d-405c-a58c-647e2f54cee1","schema_uri":"mediawiki/recentchange/1","topic":"eqiad.mediawiki.recentchange","uri":"https://www.wikidata.org/wiki/Q30223516","partition":0,"offset":802547041},"minor":false,"namespace":0,"parsedcomment":"\u200e<span dir=\\"auto\\"><span class=\\"autocomment\\">Changed claim: </span></span> <a href=\\"/wiki/Property:P19\\" title=\\"Property:P19\\">Property:P19</a>: <a href=\\"/wiki/Q142102\\" title=\\"Q142102\\">Q142102</a>","patrolled":true,"revision":{"new":658277449,"old":630033240},"server_name":"www.wikidata.org","server_script_path":"/w","server_url":"https://www.wikidata.org","timestamp":1522309411,"title":"Q30223516","type":"edit","user":"FromZeFild","wiki":"wikidatawiki"}'
>>> event
<sseclient.Event object at 0x7f3e7421f1d0>
>>> event.id
u'[{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":27021971},{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":802547069}]'
>>> event.data
u'{"bot":false,"comment":"AWB \u0c35\u0c3e\u0c21\u0c3f RETF \u0c2e\u0c3e\u0c30\u0c4d\u0c2a\u0c41\u0c32\u0c41 \u0c1a\u0c47\u0c38\u0c3e\u0c28\u0c41, typos fixed: \u0c05\u0c38\u0c46\u0c02\u0c2c\u0c4d\u0c32\u0c40 \u2192 \u0c36\u0c3e\u0c38\u0c28\u0c38\u0c2d, \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c38\u0c4d\u0c1f\u0c47\u0c37\u0c28\u0c4d \u2192 \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c15\u0c47\u0c02\u0c26\u0c4d\u0c30\u0c02,   using [[Project:AWB|AWB]]","id":3321024,"length":{"new":15396,"old":15402},"meta":{"domain":"te.wikipedia.org","dt":"2018-03-29T07:43:34+00:00","id":"e284238d-3324-11e8-91e6-1866da98024d","request_id":"8e37a345-15e4-4353-9920-a64dbea62aed","schema_uri":"mediawiki/recentchange/1","topic":"eqiad.mediawiki.recentchange","uri":"https://te.wikipedia.org/wiki/%E0%B0%AC%E0%B0%BE%E0%B0%A8%E0%B0%BE%E0%B0%9C%E0%B1%80%E0%B0%AC%E0%B0%82%E0%B0%A7%E0%B0%82","partition":0,"offset":802547068},"minor":true,"namespace":0,"parsedcomment":"AWB \u0c35\u0c3e\u0c21\u0c3f RETF \u0c2e\u0c3e\u0c30\u0c4d\u0c2a\u0c41\u0c32\u0c41 \u0c1a\u0c47\u0c38\u0c3e\u0c28\u0c41, typos fixed: \u0c05\u0c38\u0c46\u0c02\u0c2c\u0c4d\u0c32\u0c40 \u2192 \u0c36\u0c3e\u0c38\u0c28\u0c38\u0c2d, \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c38\u0c4d\u0c1f\u0c47\u0c37\u0c28\u0c4d \u2192 \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c15\u0c47\u0c02\u0c26\u0c4d\u0c30\u0c02,   using <a href=\\"/wiki/%E0%B0%B5%E0%B0%BF%E0%B0%95%E0%B1%80%E0%B0%AA%E0%B1%80%E0%B0%A1%E0%B0%BF%E0%B0%AF%E0%B0%BE:AWB\\" class=\\"mw-redirect\\" title=\\"\u0c35\u0c3f\u0c15\u0c40\u0c2a\u0c40\u0c21\u0c3f\u0c2f\u0c3e:AWB\\">AWB</a>","revision":{"new":2323442,"old":2278745},"server_name":"te.wikipedia.org","server_script_path":"/w","server_url":"https://te.wikipedia.org","timestamp":1522309414,"title":"\u0c2c\u0c3e\u0c28\u0c3e\u0c1c\u0c40\u0c2c\u0c02\u0c27\u0c02","type":"edit","user":"ChaduvariAWBNew","wiki":"tewiki"}'

Nothing obvious...

The gdb bt full: P6915, nothing obvious either. I kind of believe this is a server side issue since all clients are affected.

Is it possible that wikimedia design is breaking the ability of library to reconnect? This setup seem to be behind proxy and when something on WMF side breaks, the server will start returning 502 bad gateway instead of some "event stream" standard error, which the ES python library may not be designed to handle, so it just simply break and stop operating indefinitely?

So far the only workaround that comes to my mind is to write another daemon that would be checking if es2r is producing any data and in case that it doesn't for like 5 seconds, it would restart the es2r python script

So far the only workaround that comes to my mind is to write another daemon that would be checking if es2r is producing any data and in case that it doesn't for like 5 seconds, it would restart the es2r python script

Sounds exactly what I would use signal.alarm for.

Petrb lowered the priority of this task from High to Low.Sep 25 2018, 8:54 PM

I created a watchdog that kills the daemon when it's stuck, it's a nasty workaround but for now it's working https://github.com/huggle/XMLRCS/commit/bf5c3117e8baf00a70caecb5f83cf06372fe5f3c

I've been getting a similar issue at the Hashtags tool - my EventStream monitoring script, using sseclient, appears to hang and stop receiving data randomly every few days/weeks, but doesn't actually exit with any error (T214060).

Samwalton9 added a subscriber: Ottomata.

@Ottomata Do you have any insight here - is this likely to be an issue on our end or something from the EventStream itself?

I suppose that the connection is left open but for some reason server stops sending any new data to these "old" connections?

That means - I believe it's a server side issue, nothing that can be easily fixed on client side

Hmmmmm! If this happens only once or twice a year, I'd expect it has something to do with EventStreams service restarts, and clients failing reconnects. It seems like you are all using Python sseclient here, yes? If it was totally server side, I'd expect a client side process restart to not fix this.

@Xqt, do your appveyor CI tests always fail like that? If so I we can try to reproduce and I can capture some server side logs.

I have also observed this behaviour in various projects of mine, for instance EditGroups, which also uses the Python sseclient library to consume the stream.
My workaround is ugly: I have another script which checks how old the last event ingested in the tool's local database is, and restarts the listener if it gets too old.
It looks like we could do with a small Python library which would reliably expose WMF streams as Python iterators.