Page MenuHomePhabricator

Coordinate with ServiceOps Team about a rework of the Search Update Pipeline
Closed, ResolvedPublic

Description

See parent task and https://docs.google.com/document/d/17tY05WoaT_BloTzaIncR939k3hvhcVQ-E-8DBjo284E/edit# for context.

The rework of the Search Update Pipeline is planned to move away from ChangeProp / JobRunner towards a Flink pipeline that will rely on API calls to Mediawiki. The ServiceOps team will be impacted by this new runtime architecture, so collaboration will be required. The current plan isn't stable yet, but we envision the following needs from the Search Platform team to the ServiceOps team:

  • collaboration on defining the exact solution, taking into account contraints from both teams
  • deployment and maintenance of a high availability Flink cluster
  • workload moving from the JobRunner cluster to the API cluster

AC:

  • we had a kickoff meeting

Event Timeline

There is a general problem I have with this plan, which is that as we stand, the API and appserver clusters are reserved (as much as possible) to live requests from the website or other services to perform their duties.

Asynchronous processing should instead hit a separate cluster - one we can isolate/turn off temporarily in the event of extreme need, and most importantly, that we don't mind overloading by mistake as it won't have an user-visible impact.

Also, on jobrunners we can keep laxer timeouts for requests, making it easier for jobs to complete even under pressure.

Thanfully there is a solution to this problem - we should make the jobrunners finally be "regular" appservers with just tweaked parameters. There was quite some work in that direction by @hnowlan which we can revive if we decide to go forward with this plan.

There's a few other minor things to figure out (for instance: where will we have a log like the one the jobqueue creates now) that can be further analyzed when the plan is more defined.

Kick off meeting done. Further collaboration is expected to happen on specific subtasks of T317045 or as discussion on https://docs.google.com/document/d/17tY05WoaT_BloTzaIncR939k3hvhcVQ-E-8DBjo284E/edit