Per a recent WDQS incident, we had to use Yarn instead of WDQS because the Streaming Updater job died.
Per the incident report, we assume the following:
- Because the job was backfilling 1day of data it required more resources than usual, though this is not the first time that a backfill happens (e.g. k8s cluster upgrades went well).
- Because the job was resource-constrained, rocksdb resource compaction did not happen in a timely manner.
The prior incident was solved by running the rdf-streaming-updater in Yarn as opposed to Kubernetes. Yarn has more CPU and memory to throw at the problem than the wikikube clusters and developers are likely to have direct access to it, as opposed to wikikube which needs SRE-level permissions for some things.
Creating this ticket to:
- Determine if the "yarn fix" is applicable anymore. In other words, is the same failure scenario likely to happen again? If so, will we need to use Yarn again?
- If this is likely enough to happen again and we don't have any other better solutions, document the Yarn work so any Search Platform SWE or SRE can fix in the future.