(Not sure yet if these are the regular dumps or the wikidata-specific ones)
Lots of errors coming from snapshot1008 related to replication lag and read only mode.
This is probably caused because the eqiad s8 master (passive datacenter) was being reimaged for upgrade to the latest kernel and security patches and the mariadb 10.1 upgrade. (Jaime's comment: This happens once every 3 years.)
We need to research why this was happening, and preventing on some way:
- Can server maintenance be done on a better way, not creating log spam?
- Can replication checks reduce its spam-icity so we get properly alerted but don't get the same message millions of time. Eg, give an error, but not on cached subsequent occurences or other different model
- Why does the snapshots require not being lagged? Is lag important for those? If no, Can they just skip those checks? If yes, shouldn't they be running on the active datacenter instead, and not on eqiad?
While this didn't cause issues to final issues, this cause some issues on production due to impact to the logging infrastructure: https://grafana.wikimedia.org/dashboard/db/production-logging?orgId=1&from=1537184523025&to=1537197061054 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&from=1537184523025&to=1537197061054