The eventstream tracking script is seemingly randomly stopping data collection, sometimes in a way which produces errors, and other times producing no errors and not bringing the container down.
Silent errors
On occasion, the tool simply stops adding new entries to the database until the container is rebooted. No error is shown in the container logs. T214060 and T179986 have some discussion about potential causes and fixes, from the same issue in the Hashtags tool. This may not have happened since the June upgrades documented in T242767.
Server errors
Recently we have also been receiving the following style of error:
Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 697, in _update_chunk_length self.chunk_left = int(line, 16) ValueError: invalid literal for int() with base 16: b'' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 437, in _error_catcher yield File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 764, in read_chunked self._update_chunk_length() File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 701, in _update_chunk_length raise httplib.IncompleteRead(line) http.client.IncompleteRead: IncompleteRead(0 bytes read) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 751, in generate for chunk in self.raw.stream(chunk_size, decode_content=True): File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 572, in stream for line in self.read_chunked(amt, decode_content=decode_content): File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 793, in read_chunked self._original_response.close() File "/usr/local/lib/python3.5/contextlib.py", line 77, in __exit__ self.gen.throw(type, value, traceback) File "/usr/local/lib/python3.5/site-packages/urllib3/response.py", line 455, in _error_catcher raise ProtocolError("Connection broken: %r" % e, e) urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 66, in __next__ next_chunk = next(self.resp_iterator) File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 754, in generate raise ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read)) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "manage.py", line 20, in <module> main() File "manage.py", line 16, in main execute_from_command_line(sys.argv) File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line utility.execute() File "/usr/local/lib/python3.5/site-packages/django/core/management/__init__.py", line 375, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 323, in run_from_argv self.execute(*args, **cmd_options) File "/usr/local/lib/python3.5/site-packages/django/core/management/base.py", line 364, in execute output = self.handle(*args, **options) File "/app/extlinks/links/management/commands/linkevents_collect.py", line 47, in handle for event in EventSource(url): File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 74, in __next__ self._connect() File "/usr/local/lib/python3.5/site-packages/sseclient.py", line 53, in _connect self.resp.raise_for_status() File "/usr/local/lib/python3.5/site-packages/requests/models.py", line 941, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://stream.wikimedia.org/v2/stream/page-links-change?since=2020-07-17T23:40:53Z
This has been more common than the silent error in recent months.
Investigation
We need to investigate solutions to maintaining the eventstream connection and/or reconnecting when a disconnect occurs.
In addition to connectivity issues we may also find that the script runs into an error or encounters an unexpected type of downtime. In this case we probably want to let the developers know so that further investigation can take place. As part of this investigation we would also like to understand what the best mechanism is here - for example, we could trigger an email to librarycard-dev@lists.wikimedia.org if the script errors or the latest data in the database is more than 6 hours out of date.
It may be worth noting that the eventstream client we're using, SSEClient, is out of date (0.0.22, most recent is 0.0.26). This discussion may be useful.
Avenues of investigation should include:
- Understanding how the data collection script currently functions
- Reading the discussions at T214060, T179986, and T250912
- Chatting with the Analytics team if further input would be valuable
- Understanding if we need direct support/work from the Analytics team
- Determining whether there are steps we can take to maintain the connection on an ongoing basis
- Evaluate the degree to which T258793 solved this situation
- Solutions for detecting a loss of connection and reconnecting
- A recommendation for how best to notify developers of a potential issue.