In the last week or two all of our replag has been confined to frdb2001 at codfw. In the same time frame we have gotten Prometheus monitoring working. Being that the other servers have not lagged since, I have no corresponding graphs and can't speculate as to whether it's the same cause. But this seems to tell a story:
https://grafana.wikimedia.org/dashboard/db/frdb2001?orgId=1&from=1505747335168&to=1505854264832
During each instance of replag, db write activity appears to cease. I can log in during that time, so it isn't a full network outage, but we are wondering if perhaps the replication thread has not recovered from an earlier problem. strace shows select queries from the monitoring software as normal.