The eventstream tracking script is seemingly randomly stopping data collection, sometimes in a way which produces errors, and other times producing no errors and not bringing the container down.
**Silent errors**
On occasion, the tool simply stops adding new entries to the database until the container is rebooted. No error is shown in the container logs. T214060 has some discussion about potential causes and fixes, from the same issue in the #hashtags tool.
**Server errors**
Recently we have also been receiving the following style of error:
```
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 697, in _update_chunk_length
self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 437, in _error_catcher
yield
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 764, in read_chunked
self._update_chunk_length()
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 701, in _update_chunk_length
raise httplib.IncompleteRead(line)
http.client.IncompleteRead: IncompleteRead(0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 751, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 572, in stream
for line in self.read_chunked(amt, decode_content=decode_content):
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 793, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.5/contextlib.py", line 77, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 455, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 66, in __next__
next_chunk = next(self.resp_iterator)
File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 754, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "manage.py", line 20, in <module>
main()
File "manage.py", line 16, in main
execute_from_command_line(sys.argv)
File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
File "/app/extlinks/links/management/commands/linkevents_collect.py", line 47, in handle
for event in EventSource(url):
File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 74, in __next__
self._connect()
File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 53, in _connect
self.resp.raise_for_status()
File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 941, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://stream.wikimedia.org/v2/stream/page-links-change?since=2020-07-17T23:40:53Z
```
This has been more common than the silent error in recent months.
# Investigation
We need to investigate solutions to maintaining the eventstream connection and/or reconnecting when a disconnect occurs.
In addition to connectivity issues we may also find that the script runs into an error or encounters an unexpected type of downtime. In this case we probably want to let the developers know so that further investigation can take place. As part of this investigation we would also like to understand what the best mechanism is here - for example, we could trigger an email to librarycard-dev@lists.wikimedia.org if the script errors or the latest data in the database is more than 6 hours out of date.
It may be worth noting that the eventstream client we're using, [[ https://pypi.org/project/sseclient/ | SSEClient ]], is out of date (0.0.22, most recent is 0.0.26). [[ https://github.com/btubbs/sseclient/issues/34 | This discussion ]] may be useful.
Avenues of investigation should include:
- Understanding how the [[ https://github.com/WikipediaLibrary/externallinks/blob/master/extlinks/links/management/commands/linkevents_collect.py | data collection script ]] currently functions
- Reading the discussion at T214060
- Chatting with the Analytics team if further input would be valuable
- Understanding if we need direct support/work from the Analytics team
- Determining whether there are steps we can take to maintain the connection on an ongoing basis
- Solutions for detecting a loss of connection and reconnecting
- A recommendation for how best to notify developers of a potential issue.