Page MenuHomePhabricator

Create script that reads from EventStream and saves the data into a file
Closed, DeclinedPublic

Description

Background
After conducting an investigation in task T250084 to check why the EventStream kept disconnecting, the team concluded that the data collection script needs to be refactored. The current script uses the Python SSEClient to read from the stream and save all changed page links relevant to the Wikipedia Library project into the Wikilink database for further analysis and consumption.

Actions
The first stage of this refactoring should be to create a script that uses the same Python SSEClient to read from the stream and store it in a file instead of parsing the data and saving it into the database (this will be now done at a later stage). The reason for this change is because the EventStream disconnects approximately every 15 minutes (see T242767) and the current script doesn't have a way to gracefully exit when a disconnection event happens. This new script should handle the disconnection, close the current file, and send a disconnect error. The new instance of this script should start based on the last Last-Event-ID in the previous file.

Acceptance Criteria

  • The stream results should be stored in a file with a unique name.
  • The stream should start from the last Last-Event-ID stored in the database.
  • Whenever the EventStream disconnects, the script should close the file it was writing on and trigger the parsing script (will be done in another ticket) with the most recent file as the parameter.

Event Timeline

Closing this - we fixed the underlying issue in T264211.