ORES uses 4 hosts, 2 per DC, in an active/passive manner for storage. The active/passive nature is so that we can have failover capabilities in same the primary redis database suffers catastrophic failure. That being said, the data in either can be lost without suffering irrevocable consequences. That is not to say that we should go around losing that data, but if it is lost, we can restore functionality, at the beginning with some performance issues, within minutes.
The 2 components requiring storage are:
- ORES queue. This one listens on port 6379 and it's small (1GB). This is essentially a celery queue that uses redis as the queening backend. It is also used as the celery result backend. The queue can be emptied on purpose (and has been in the past) by just restart redis. That is, the queue is NOT being persisted to disk, but is memory only. This is on purpose, as storing the queue to disk has proven to cause more issues than was worth it.
- ORES cache. This one listens on port 6380 and it's larger (6GB). This is just a cache that stores the last N scorings so that we avoid the expensive process of rescoring. For the most part, this is artificially populated using changeprop, responding to edits to mediawiki. This is still persisted to disk. There are some open questions about the validity of that, however that is outside the scope of this task.
The 2 DCs are independently populated, that is there is no mechanism to sync either of the redis databases between the 2 DCs. That is a conscious design choice as it is cheaper currently to double score edits in both DCs than fighting redis replication across WAN.
The 4 hosts are using jessie and are a mixture of hardware and VM hosts and are a bit of a snowflake in our day-to-day operations. Contrary to the rest of our redis hosts, they don't have as good monitoring/alerting/dashboards etc.
Move the functionality of those hosts into our misc redis cluster as there is clearly available space there. That would allows us to decommission the hosts, get rid of some tech debt (e.g. puppet code) and obtain better monitoring and alerting.