Page MenuHomePhabricator

Investigate how to keep the EventStream tracking script connected [8HR]
Closed, ResolvedPublicSpike

Description

The eventstream tracking script is seemingly randomly stopping data collection, sometimes in a way which produces errors, and other times producing no errors and not bringing the container down.

Silent errors

On occasion, the tool simply stops adding new entries to the database until the container is rebooted. No error is shown in the container logs. T214060 and T179986 have some discussion about potential causes and fixes, from the same issue in the Hashtags tool. This may not have happened since the June upgrades documented in T242767.

Server errors

Recently we have also been receiving the following style of error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 437, in _error_catcher
    yield
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 764, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 572, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 66, in __next__
    next_chunk = next(self.resp_iterator)
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "manage.py", line 20, in <module>
    main()
  File "manage.py", line 16, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/app/extlinks/links/management/commands/linkevents_collect.py", line 47, in handle
    for event in EventSource(url):
  File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 74, in __next__
    self._connect()
  File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 53, in _connect
    self.resp.raise_for_status()
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://stream.wikimedia.org/v2/stream/page-links-change?since=2020-07-17T23:40:53Z

This has been more common than the silent error in recent months.

Investigation

We need to investigate solutions to maintaining the eventstream connection and/or reconnecting when a disconnect occurs.

In addition to connectivity issues we may also find that the script runs into an error or encounters an unexpected type of downtime. In this case we probably want to let the developers know so that further investigation can take place. As part of this investigation we would also like to understand what the best mechanism is here - for example, we could trigger an email to librarycard-dev@lists.wikimedia.org if the script errors or the latest data in the database is more than 6 hours out of date.

It may be worth noting that the eventstream client we're using, SSEClient, is out of date (0.0.22, most recent is 0.0.26). This discussion may be useful.

Avenues of investigation should include:

  • Understanding how the data collection script currently functions
  • Reading the discussions at T214060, T179986, and T250912
  • Chatting with the Analytics team if further input would be valuable
  • Understanding if we need direct support/work from the Analytics team
  • Determining whether there are steps we can take to maintain the connection on an ongoing basis
  • Evaluate the degree to which T258793 solved this situation
  • Solutions for detecting a loss of connection and reconnecting
  • A recommendation for how best to notify developers of a potential issue.

Event Timeline

Samwalton9 triaged this task as High priority.Apr 13 2020, 2:19 PM
Samwalton9 created this task.
Samwalton9 updated the task description. (Show Details)Jul 8 2020, 10:24 AM
Samwalton9 renamed this task from Eventstream tracking script stops working without errors to Eventstream tracking script should remain connected at all times.Jul 14 2020, 2:38 PM
Samwalton9 renamed this task from Eventstream tracking script should remain connected at all times to Investigate how to keep the EventStream tracking script connected.Jul 16 2020, 9:51 AM
Samwalton9 added a project: Spike.
Samwalton9 updated the task description. (Show Details)
Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 16 2020, 9:51 AM
Samwalton9 updated the task description. (Show Details)Jul 16 2020, 9:55 AM
Samwalton9 updated the task description. (Show Details)Jul 16 2020, 10:02 AM
Samwalton9 updated the task description. (Show Details)Jul 16 2020, 2:12 PM
Samwalton9 updated the task description. (Show Details)Aug 5 2020, 10:09 AM
Samwalton9 updated the task description. (Show Details)
Samwalton9 updated the task description. (Show Details)Aug 5 2020, 10:14 AM
Samwalton9 updated the task description. (Show Details)Aug 6 2020, 8:48 AM
Samwalton9 updated the task description. (Show Details)

In our initial exploration of our task priorities for Wikilink yesterday we noted the following on this topic:

  • Docker Swarm integration (T258793) may provide a quick fix for rebooting the container - Swarm actively monitors containers and will reboot any which go down.
  • We might want to explore a Django SSEClient. This could update the asynchronous computations we're exploring in T240673 directly.
  • We need to be aware of the issue that data from the EventStream doesn't always come in in date order - actively restarting the container too often can lead to issues where we create data gaps.
ARamirez_WMF renamed this task from Investigate how to keep the EventStream tracking script connected to Investigate how to keep the EventStream tracking script connected [8HR].Aug 11 2020, 2:09 PM

If y'all end up needing follow-up support from Analytics or any other teams in the Technology Department, please let me know.

I have been investigating this issue and have a couple of questions:

  1. According to the EvenStreams documentation, production services that consume streams should do it directly from Kafka. Is there a reason why we are not doing this?
  2. Does the EventStreams time out after a certain amount of time if a time stream connection hasn't finished?

I am asking the last question because I was experimenting with the EventStream API via curl and it always seems to timeout after approximately 15 minutes. It's not exactly 15 minutes every time, but 15:02, 15:43, etc. Since I was using the curlcommand and not the SSEClient, I thought the stream would not disconnect and return every page link change from a certain date successfully. Instead, it times out after 15 minutes and never finishes properly. If I open the files where I dump the result of the curl command, I see the information download is always interrupted in different points of the stream (sometimes there is more information downloaded than others). This leads me to believe the problem might not be the SSEClient in Python, but the EventStream API itself. We could try connecting directly to Kafka (if we are allowed to) and check if this problem persists, or we could maybe try downloading the data in chunks (although this might be hard to implement because the stream only accepts a since or Event-ID parameter).

@Samwalton9 @jsn.sherman Let me know if this makes any sense to you. We can discuss my findings in tomorrow's meeting.

According to the EvenStreams documentation, production services that consume streams should do it directly from Kafka. Is there a reason why we are not doing this?

Interesting - I hadn't seen this. There's very little information in that section so it's not clear to me if by "production services" they mean "in MediaWiki", or if Cloud VPS tools are also covered by this. @Ottomata could you shine any light on this?

Does the EventStreams time out after a certain amount of time if a time stream connection hasn't finished?

Not from what I've seen. For example I last rebooted the Wikilink eventstream connection a day or two ago, and it's still live now.

actively restarting the container too often can lead to issues where we create data gaps.

Hm, this shouldn't be true, if you save the Last-Event-ID between restarts. You might get duplicate data that you've already consumed for a short amount of time after a restart, but not gaps.

Does the EventStreams time out after a certain amount of time if a time stream connection hasn't finished?

Unfortunately, yes. See: T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable. I also just added some docs about this here: https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams#Server_side_enforced_timeout

If I open the files where I dump the result of the curl command, I see the information download is always interrupted in different points of the stream (sometimes there is more information downloaded than others). This leads me to believe the problem might not be the SSEClient in Python, but the EventStream API itself.

As I started to read the description of this task, I began to suspect this is the problem too. If the server is disconnecting after 15 minutes, there is no guarantee that it does so after a full SSE message has finished sending. It just closes the connection, and the client might get an incomplete message.

If we tried to use this as an argument in T242767 for why the timeout should be more than 15 minutes, the SREs in the traffic team would respond that even if the timeout was higher, clients need to be able to handle disconnects properly, as they can happen due to other network problems in the same way.

A good SSE client should be able to reconnect properly and start consuming from the Last-Event-ID, so in practice you might see error messages, but a disconnect shouldn't result in lost data.

According to the EvenStreams documentation, production services that consume streams should do it directly from Kafka. Is there a reason why we are not doing this?

Kafka is only available from within WMF production networks, not externally or from Cloud VPS. If you can develop and deploy your service inside WMF production, then yes by all means consuming directly from Kafka makes sense. :)

Thanks for the insights, @Ottomata! Based on this information, I will create a set of tickets to address this and will link to them here.

I have created several tickets to address the refactoring of the EventStream tracking script in T261807. Let me know if these tasks are clear and if I missed something!

Tasks are looking great. Marking this as Done for now, we can come back to it if we determine further investigation is needed.

After poking around with deploying this, my thought is that we may just be able to wrap the sse client calls here:
https://github.com/WikipediaLibrary/externallinks/blob/e735c29e7da3b8241996581b2398a771713570b0/extlinks/links/management/commands/linkevents_collect.py#L48-L61
with a method that has a timeout of something like 30 minutes.
We could also add some additional logging here. I'm thinking of this as a quick fix alternative to T261807

jsn.sherman added a comment.EditedTue, Sep 29, 3:19 PM

And actually, it looks like this client can pass arguments through to the underlying requests library, which supports connect and read timeouts. The default behavior is *no timeout*.
https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

so we should be able to do
for event in EventSource(url, timeout=1800):
for a 30 minute connect and read timeout.

jsn.sherman added a comment.EditedWed, Sep 30, 4:17 PM

Retries are also supported. We'll configure that too.

Samwalton9 closed this task as Resolved.Tue, Oct 13, 12:28 PM