Page MenuHomePhabricator

Investigate why current es2r daemon is randomly hanging
Open, LowPublic

Description

es2r (even streams to redis) is component of XmlRcs which is fetching events and puts them to redis. It's a 24 / 7 daemon, with trivial source code based on sample code: https://github.com/huggle/XMLRCS/blob/master/src/es2r/es2r.py

It however does hang-up for unknown reasons, sometimes after several days, sometimes after weeks and it doesn't show any obvious errors. It's necessary to fix it so that connection to EventStream is persistent and capable to handle any kind of connection problems.

Event Timeline

Petrb created this task.Nov 7 2017, 9:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 7 2017, 9:31 PM
Petrb added a comment.Nov 7 2017, 9:32 PM

An example python code from wikimedia stream team which is capable to hold a working connection forever and handles all kinds of connection errors would be great.

Petrb triaged this task as High priority.Nov 11 2017, 10:28 AM
stjn added a subscriber: stjn.Nov 11 2017, 6:31 PM
Petrb added a comment.Jan 15 2018, 3:04 AM

Just in case for XmlRcs maintainers the workaround right now in case the python script hangs, from labs:

ssh xmlrcs
petrb@xmlrcs:~$ ps -ef | grep xml
xmlrcs    3539     1  2  2017 ?        3-06:14:50 ././xmlrcsd -vvvd
xmlrcs   10811     1  0  2017 ?        00:00:01 /bin/sh ./start
xmlrcs   28573 10811  1  2017 ?        21:11:37 /usr/bin/python ./es2r.py
petrb    31268 31181  0 03:02 pts/1    00:00:00 grep --color=auto xml
petrb@xmlrcs:~$ sudo kill 28573
Xqt added a subscriber: Xqt.Jan 15 2018, 5:43 AM
zhuyifei1999 added a subscriber: zhuyifei1999.

sseclient hanging was also observed in another consumer of mine (based on pywikibot). I was unable to get a full traceback last time this happened, but I will investigate this.

Xqt added a comment.Jan 30 2018, 8:07 PM

I think you should set a timeout for the sse client to prevent hangings. You also could use pywikibot’s EventStreams which provides additional features like automatic re-establishing of the connection. Since few weeks requests >= 2.9 and sseclient >= 0.0.18 is required for recentchange stream.

Petrb added a comment.Jan 30 2018, 9:02 PM

Can you show me pywiki bot EventStream so that I don't need to spend 2 weeks searching its source code tree? :)

Xqt added a comment.EditedJan 31 2018, 6:19 AM

Sure. Either you may use pagegenerators -live recentchanges option or the module itself which is in pywikibot.comms.eventstreams

zhuyifei1999 removed zhuyifei1999 as the assignee of this task.Feb 18 2018, 7:54 PM

Unlicking this- wasn't able to reproduce in a month's continuous running of three sseclients.

Petrb added a comment.Mar 29 2018, 7:14 PM

it just happened now, the problem with this issue is that it happens randomly like once in a year :( so it's not easy to reproduce

Petrb added a comment.Mar 29 2018, 7:15 PM

in stdout I see

Traceback (most recent call last):
  File "./es2r.py", line 18, in <module>
    for event in EventSource(url):
  File "build/bdist.linux-x86_64/egg/sseclient.py", line 39, in __init__
  File "build/bdist.linux-x86_64/egg/sseclient.py", line 52, in _connect
  File "/usr/lib/python2.7/dist-packages/requests/models.py", line 893, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://stream.wikimedia.org/v2/stream/recentchange

But there is lot of these errors, it's possible that when it locks up there is no error msg, it just stops sending any data to redis

Indeed, both of the daemons I kept running earlier is affected. The 60-second time SIGALRM is at 2018-03-29T07:44:35.389710 & 2018-03-29T07:44:32.640241, so the last message happened around 60 seconds prior to the timestamps.

Last events received:

>>> event
<sseclient.Event object at 0x7fd948994310>
>>> dir(event)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'data', 'dump', 'event', 'id', 'parse', 'retry', 'sse_line_pattern']
>>> event.id
u'[{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":27021971},{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":802547042}]'
>>> event.data
u'{"bot":false,"comment":"/* wbsetclaim-update:2||1 */ [[Property:P19]]: [[Q142102]]","id":693710948,"length":{"new":5086,"old":5084},"meta":{"domain":"www.wikidata.org","dt":"2018-03-29T07:43:31+00:00","id":"e0e61373-3324-11e8-b520-b083fecf0db3","request_id":"dd980328-f56d-405c-a58c-647e2f54cee1","schema_uri":"mediawiki/recentchange/1","topic":"eqiad.mediawiki.recentchange","uri":"https://www.wikidata.org/wiki/Q30223516","partition":0,"offset":802547041},"minor":false,"namespace":0,"parsedcomment":"\u200e<span dir=\\"auto\\"><span class=\\"autocomment\\">Changed claim: </span></span> <a href=\\"/wiki/Property:P19\\" title=\\"Property:P19\\">Property:P19</a>: <a href=\\"/wiki/Q142102\\" title=\\"Q142102\\">Q142102</a>","patrolled":true,"revision":{"new":658277449,"old":630033240},"server_name":"www.wikidata.org","server_script_path":"/w","server_url":"https://www.wikidata.org","timestamp":1522309411,"title":"Q30223516","type":"edit","user":"FromZeFild","wiki":"wikidatawiki"}'
>>> event
<sseclient.Event object at 0x7f3e7421f1d0>
>>> event.id
u'[{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":27021971},{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":802547069}]'
>>> event.data
u'{"bot":false,"comment":"AWB \u0c35\u0c3e\u0c21\u0c3f RETF \u0c2e\u0c3e\u0c30\u0c4d\u0c2a\u0c41\u0c32\u0c41 \u0c1a\u0c47\u0c38\u0c3e\u0c28\u0c41, typos fixed: \u0c05\u0c38\u0c46\u0c02\u0c2c\u0c4d\u0c32\u0c40 \u2192 \u0c36\u0c3e\u0c38\u0c28\u0c38\u0c2d, \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c38\u0c4d\u0c1f\u0c47\u0c37\u0c28\u0c4d \u2192 \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c15\u0c47\u0c02\u0c26\u0c4d\u0c30\u0c02,   using [[Project:AWB|AWB]]","id":3321024,"length":{"new":15396,"old":15402},"meta":{"domain":"te.wikipedia.org","dt":"2018-03-29T07:43:34+00:00","id":"e284238d-3324-11e8-91e6-1866da98024d","request_id":"8e37a345-15e4-4353-9920-a64dbea62aed","schema_uri":"mediawiki/recentchange/1","topic":"eqiad.mediawiki.recentchange","uri":"https://te.wikipedia.org/wiki/%E0%B0%AC%E0%B0%BE%E0%B0%A8%E0%B0%BE%E0%B0%9C%E0%B1%80%E0%B0%AC%E0%B0%82%E0%B0%A7%E0%B0%82","partition":0,"offset":802547068},"minor":true,"namespace":0,"parsedcomment":"AWB \u0c35\u0c3e\u0c21\u0c3f RETF \u0c2e\u0c3e\u0c30\u0c4d\u0c2a\u0c41\u0c32\u0c41 \u0c1a\u0c47\u0c38\u0c3e\u0c28\u0c41, typos fixed: \u0c05\u0c38\u0c46\u0c02\u0c2c\u0c4d\u0c32\u0c40 \u2192 \u0c36\u0c3e\u0c38\u0c28\u0c38\u0c2d, \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c38\u0c4d\u0c1f\u0c47\u0c37\u0c28\u0c4d \u2192 \u0c2a\u0c4b\u0c32\u0c3f\u0c02\u0c17\u0c4d \u0c15\u0c47\u0c02\u0c26\u0c4d\u0c30\u0c02,   using <a href=\\"/wiki/%E0%B0%B5%E0%B0%BF%E0%B0%95%E0%B1%80%E0%B0%AA%E0%B1%80%E0%B0%A1%E0%B0%BF%E0%B0%AF%E0%B0%BE:AWB\\" class=\\"mw-redirect\\" title=\\"\u0c35\u0c3f\u0c15\u0c40\u0c2a\u0c40\u0c21\u0c3f\u0c2f\u0c3e:AWB\\">AWB</a>","revision":{"new":2323442,"old":2278745},"server_name":"te.wikipedia.org","server_script_path":"/w","server_url":"https://te.wikipedia.org","timestamp":1522309414,"title":"\u0c2c\u0c3e\u0c28\u0c3e\u0c1c\u0c40\u0c2c\u0c02\u0c27\u0c02","type":"edit","user":"ChaduvariAWBNew","wiki":"tewiki"}'

Nothing obvious...

The gdb bt full: P6915, nothing obvious either. I kind of believe this is a server side issue since all clients are affected.

Petrb added a comment.Aug 25 2018, 6:13 PM

Is it possible that wikimedia design is breaking the ability of library to reconnect? This setup seem to be behind proxy and when something on WMF side breaks, the server will start returning 502 bad gateway instead of some "event stream" standard error, which the ES python library may not be designed to handle, so it just simply break and stop operating indefinitely?

Petrb added a comment.Aug 25 2018, 6:15 PM

So far the only workaround that comes to my mind is to write another daemon that would be checking if es2r is producing any data and in case that it doesn't for like 5 seconds, it would restart the es2r python script

So far the only workaround that comes to my mind is to write another daemon that would be checking if es2r is producing any data and in case that it doesn't for like 5 seconds, it would restart the es2r python script

Sounds exactly what I would use signal.alarm for.

Petrb lowered the priority of this task from High to Low.Sep 25 2018, 8:54 PM

I created a watchdog that kills the daemon when it's stuck, it's a nasty workaround but for now it's working https://github.com/huggle/XMLRCS/commit/bf5c3117e8baf00a70caecb5f83cf06372fe5f3c

Petrb moved this task from Backlog to Code / Bug on the Huggle board.Oct 10 2018, 6:16 PM
Samwalton9 added a subscriber: Samwalton9.EditedMay 20 2019, 8:46 AM

I've been getting a similar issue at the Hashtags tool - my EventStream monitoring script, using sseclient, appears to hang and stop receiving data randomly every few days/weeks, but doesn't actually exit with any error (T214060).

Samwalton9 added a subscriber: Ottomata.

@Ottomata Do you have any insight here - is this likely to be an issue on our end or something from the EventStream itself?

I suppose that the connection is left open but for some reason server stops sending any new data to these "old" connections?

That means - I believe it's a server side issue, nothing that can be easily fixed on client side

Hmmmmm! If this happens only once or twice a year, I'd expect it has something to do with EventStreams service restarts, and clients failing reconnects. It seems like you are all using Python sseclient here, yes? If it was totally server side, I'd expect a client side process restart to not fix this.

@Xqt, do your appveyor CI tests always fail like that? If so I we can try to reproduce and I can capture some server side logs.