Investigate how to keep the EventStream tracking script connected [8HR]
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	Samwalton9-WMF
	Apr 13 2020, 2:19 PM

Description

The eventstream tracking script is seemingly randomly stopping data collection, sometimes in a way which produces errors, and other times producing no errors and not bringing the container down.

Silent errors

On occasion, the tool simply stops adding new entries to the database until the container is rebooted. No error is shown in the container logs. T214060 and T179986 have some discussion about potential causes and fixes, from the same issue in the Hashtags tool. This may not have happened since the June upgrades documented in T242767.

Server errors

Recently we have also been receiving the following style of error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 437, in _error_catcher
    yield
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 764, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 751, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 572, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 793, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 66, in __next__
    next_chunk = next(self.resp_iterator)
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "manage.py", line 20, in <module>
    main()
  File "manage.py", line 16, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 375, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 323, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 364, in execute
    output = self.handle(*args, **options)
  File "/app/extlinks/links/management/commands/linkevents_collect.py", line 47, in handle
    for event in EventSource(url):
  File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 74, in __next__
    self._connect()
  File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 53, in _connect
    self.resp.raise_for_status()
  File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://stream.wikimedia.org/v2/stream/page-links-change?since=2020-07-17T23:40:53Z

This has been more common than the silent error in recent months.

Investigation

We need to investigate solutions to maintaining the eventstream connection and/or reconnecting when a disconnect occurs.

In addition to connectivity issues we may also find that the script runs into an error or encounters an unexpected type of downtime. In this case we probably want to let the developers know so that further investigation can take place. As part of this investigation we would also like to understand what the best mechanism is here - for example, we could trigger an email to librarycard-dev@lists.wikimedia.org if the script errors or the latest data in the database is more than 6 hours out of date.

It may be worth noting that the eventstream client we're using, SSEClient, is out of date (0.0.22, most recent is 0.0.26). This discussion may be useful.

Avenues of investigation should include:

Understanding how the data collection script currently functions
Reading the discussions at T214060, T179986, and T250912
Chatting with the Analytics team if further input would be valuable
Understanding if we need direct support/work from the Analytics team
Determining whether there are steps we can take to maintain the connection on an ongoing basis
Evaluate the degree to which T258793 solved this situation
Solutions for detecting a loss of connection and reconnecting
A recommendation for how best to notify developers of a potential issue.

Related Objects

Mentioned In: T264211: Configure SSE Client retries and timeouts to improve reliability
T261807: Refactor Wikilink data collection from EventStream
T261806: Create script that parses and saves EventStream data into Wikilink
T261801: Create script that reads from EventStream and saves the data into a file
T261798: Upgrade Wikilink to Django 3.1
T259483: Updates to Event Stream API
T258793: Update the Wikilink production server to use Docker Swarm
T255144: Add uptime robot alerts for site downtime
T256211: Investigate whether we require Analytics support for Wikilink EventStream issues
Mentioned Here: T261807: Refactor Wikilink data collection from EventStream
T240673: Pages shouldn't time out due to expensive database queries
T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable
T258793: Update the Wikilink production server to use Docker Swarm
T250912: EventStreams socket stays connected without any traffic incoming
T179986: Investigate why current es2r daemon is randomly hanging
T214060: Collection script seemingly randomly stops working

Event Timeline

Samwalton9-WMF triaged this task as High priority.Apr 13 2020, 2:19 PM

Samwalton9-WMF created this task.

MBinder_WMF added a project: The-Wikipedia-Library (Kanban).Jun 2 2020, 7:43 PM

ARamirez_WMF edited projects, added The-Wikipedia-Library; removed The-Wikipedia-Library (Kanban).Jun 2 2020, 9:04 PM

ARamirez_WMF moved this task from Backlog to Bugs on the The-Wikipedia-Library board.

Samwalton9-WMF moved this task from Bugs to Open Tasks WikiLink on the The-Wikipedia-Library board.Jun 3 2020, 1:26 PM

Samwalton9-WMF mentioned this in T256211: Investigate whether we require Analytics support for Wikilink EventStream issues.Jun 24 2020, 9:19 AM

Samwalton9-WMF updated the task description. (Show Details)Jul 8 2020, 10:24 AM

Samwalton9-WMF mentioned this in T255144: Add uptime robot alerts for site downtime.Jul 8 2020, 10:26 AM

Samwalton9-WMF renamed this task from Eventstream tracking script stops working without errors to Eventstream tracking script should remain connected at all times.Jul 14 2020, 2:38 PM

Samwalton9-WMF renamed this task from Eventstream tracking script should remain connected at all times to Investigate how to keep the EventStream tracking script connected.Jul 16 2020, 9:51 AM

Samwalton9-WMF added a project: Spike.

Samwalton9-WMF updated the task description. (Show Details)

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 16 2020, 9:51 AM

Samwalton9-WMF moved this task from Open Tasks WikiLink to Ready for Dev on the The-Wikipedia-Library board.Jul 16 2020, 9:52 AM

Samwalton9-WMF updated the task description. (Show Details)Jul 16 2020, 9:55 AM

Samwalton9-WMF updated the task description. (Show Details)Jul 16 2020, 10:02 AM

Samwalton9-WMF updated the task description. (Show Details)Jul 16 2020, 2:12 PM

Samwalton9-WMF updated the task description. (Show Details)Aug 5 2020, 10:09 AM

Samwalton9-WMF updated the task description. (Show Details)

Samwalton9-WMF updated the task description. (Show Details)Aug 5 2020, 10:14 AM

Samwalton9-WMF updated the task description. (Show Details)Aug 6 2020, 8:48 AM

Samwalton9-WMF updated the task description. (Show Details)

In our initial exploration of our task priorities for Wikilink yesterday we noted the following on this topic:

Docker Swarm integration (T258793) may provide a quick fix for rebooting the container - Swarm actively monitors containers and will reboot any which go down.
We might want to explore a Django SSEClient. This could update the asynchronous computations we're exploring in T240673 directly.
We need to be aware of the issue that data from the EventStream doesn't always come in in date order - actively restarting the container too often can lead to issues where we create data gaps.

Samwalton9-WMF mentioned this in T258793: Update the Wikilink production server to use Docker Swarm.Aug 6 2020, 9:48 AM

Samwalton9-WMF moved this task from Ready for Dev to Kanban on the The-Wikipedia-Library board.Aug 6 2020, 10:02 AM

Samwalton9-WMF edited projects, added The-Wikipedia-Library (Kanban); removed The-Wikipedia-Library.

ARamirez_WMF renamed this task from Investigate how to keep the EventStream tracking script connected to Investigate how to keep the EventStream tracking script connected [8HR].Aug 11 2020, 2:09 PM

If y'all end up needing follow-up support from Analytics or any other teams in the Technology Department, please let me know.

kaldari mentioned this in T259483: Updates to Event Stream API.Aug 19 2020, 8:13 PM

Scardenasmolinar claimed this task.Aug 31 2020, 5:38 PM

Scardenasmolinar moved this task from Ready to In Progress on the The-Wikipedia-Library (Kanban) board.

I have been investigating this issue and have a couple of questions:

According to the EvenStreams documentation, production services that consume streams should do it directly from Kafka. Is there a reason why we are not doing this?
Does the EventStreams time out after a certain amount of time if a time stream connection hasn't finished?

I am asking the last question because I was experimenting with the EventStream API via curl and it always seems to timeout after approximately 15 minutes. It's not exactly 15 minutes every time, but 15:02, 15:43, etc. Since I was using the curlcommand and not the SSEClient, I thought the stream would not disconnect and return every page link change from a certain date successfully. Instead, it times out after 15 minutes and never finishes properly. If I open the files where I dump the result of the curl command, I see the information download is always interrupted in different points of the stream (sometimes there is more information downloaded than others). This leads me to believe the problem might not be the SSEClient in Python, but the EventStream API itself. We could try connecting directly to Kafka (if we are allowed to) and check if this problem persists, or we could maybe try downloading the data in chunks (although this might be hard to implement because the stream only accepts a since or Event-ID parameter).

@Samwalton9 @jsn.sherman Let me know if this makes any sense to you. We can discuss my findings in tomorrow's meeting.

According to the EvenStreams documentation, production services that consume streams should do it directly from Kafka. Is there a reason why we are not doing this?

Interesting - I hadn't seen this. There's very little information in that section so it's not clear to me if by "production services" they mean "in MediaWiki", or if Cloud VPS tools are also covered by this. @Ottomata could you shine any light on this?

Does the EventStreams time out after a certain amount of time if a time stream connection hasn't finished?

Not from what I've seen. For example I last rebooted the Wikilink eventstream connection a day or two ago, and it's still live now.

actively restarting the container too often can lead to issues where we create data gaps.

Hm, this shouldn't be true, if you save the Last-Event-ID between restarts. You might get duplicate data that you've already consumed for a short amount of time after a restart, but not gaps.

Does the EventStreams time out after a certain amount of time if a time stream connection hasn't finished?

Unfortunately, yes. See: T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable. I also just added some docs about this here: https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams#Server_side_enforced_timeout

If I open the files where I dump the result of the curl command, I see the information download is always interrupted in different points of the stream (sometimes there is more information downloaded than others). This leads me to believe the problem might not be the SSEClient in Python, but the EventStream API itself.

As I started to read the description of this task, I began to suspect this is the problem too. If the server is disconnecting after 15 minutes, there is no guarantee that it does so after a full SSE message has finished sending. It just closes the connection, and the client might get an incomplete message.

If we tried to use this as an argument in T242767 for why the timeout should be more than 15 minutes, the SREs in the traffic team would respond that even if the timeout was higher, clients need to be able to handle disconnects properly, as they can happen due to other network problems in the same way.

A good SSE client should be able to reconnect properly and start consuming from the Last-Event-ID, so in practice you might see error messages, but a disconnect shouldn't result in lost data.

According to the EvenStreams documentation, production services that consume streams should do it directly from Kafka. Is there a reason why we are not doing this?

Kafka is only available from within WMF production networks, not externally or from Cloud VPS. If you can develop and deploy your service inside WMF production, then yes by all means consuming directly from Kafka makes sense. :)

Thanks for the insights, @Ottomata! Based on this information, I will create a set of tickets to address this and will link to them here.

Scardenasmolinar mentioned this in T261798: Upgrade Wikilink to Django 3.1.Sep 1 2020, 9:52 PM

Scardenasmolinar mentioned this in T261801: Create script that reads from EventStream and saves the data into a file.Sep 1 2020, 10:31 PM

Scardenasmolinar mentioned this in T261806: Create script that parses and saves EventStream data into Wikilink.Sep 1 2020, 10:50 PM

Scardenasmolinar mentioned this in T261807: Refactor Wikilink data collection from EventStream.Sep 1 2020, 10:53 PM

I have created several tickets to address the refactoring of the EventStream tracking script in T261807. Let me know if these tasks are clear and if I missed something!

Tasks are looking great. Marking this as Done for now, we can come back to it if we determine further investigation is needed.

After poking around with deploying this, my thought is that we may just be able to wrap the sse client calls here:
https://github.com/WikipediaLibrary/externallinks/blob/e735c29e7da3b8241996581b2398a771713570b0/extlinks/links/management/commands/linkevents_collect.py#L48-L61
with a method that has a timeout of something like 30 minutes.
We could also add some additional logging here. I'm thinking of this as a quick fix alternative to T261807

And actually, it looks like this client can pass arguments through to the underlying requests library, which supports connect and read timeouts. The default behavior is *no timeout*.
https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

so we should be able to do
for event in EventSource(url, timeout=1800):
for a 30 minute connect and read timeout.

Retries are also supported. We'll configure that too.

jsn.sherman mentioned this in T264211: Configure SSE Client retries and timeouts to improve reliability.Sep 30 2020, 4:18 PM

• jlinehan subscribed.Oct 8 2020, 4:20 PM

Samwalton9-WMF closed this task as Resolved.Oct 13 2020, 12:28 PM

Investigate how to keep the EventStream tracking script connected [8HR]Closed, ResolvedPublicSpikeActions

Description

Investigation

Related Objects

Event Timeline

Investigate how to keep the EventStream tracking script connected [8HR]
Closed, ResolvedPublicSpike
Actions