Page MenuHomePhabricator

Quarry WMCloud (ruwiki_p, section s6) experiencing sustained replication lag (~16 h)
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

Go to https://quarry.wmcloud.org/ and run any query on ruwiki_p (e.g. SELECT COUNT(*) FROM page;).

Observe warning about synchronization delay and stale data (≈ 11 h behind).

Confirm actual lag on https://replag.toolforge.org/ → section s6 shows ~16 h lag (replag.toolforge.org)

What happens?:
Queries run on Quarry (quarry.wmcloud.org) against the Russian Wikipedia replica (ruwiki_p) are returning stale data and showing this warning:

“The database on which this query was executed has a synchronization delay with the wiki. This can be caused by maintenance or a database incident, and should be resolved soon. Modifications that were made in the last 11 hours on the wiki are not taken into account in the results below.”

Meanwhile, [replag.toolforge.org] reports a consistent ~16 hour lag for section s6 on both the analytics and web replica hosts for ruwiki_p (Lag = 58 594 s ≈ 16 h 16 m 34 s) (replag.toolforge.org)
This far exceeds normal expectations (web replicas: < 5 min; analytics: < 1 h)

Queries return data that is ~16 hours out-of-date.
Warning message states recent modifications (last ~11 h) are not included.
heartbeat_p view shows large lag_seconds on s6.

Impact:
*Analytical queries on ruwiki are missing up to a day’s worth of edits.
*Public queries (e.g. dashboards, reports) based on Quarry are stale, leading to misinformation.
*Bot and service integrations relying on fresh data may fail or produce inconsistent results.

What should have happened instead?:
Replication lag should be under 5 minutes for the web host, and under 1 hour for analytics
No warning shown in Quarry for normal queries.
Other information (browser name/version, screenshots, etc.):
Preliminary diagnostics
*No known scheduled maintenance on s6 at this time.
*Other sections (s3,s5) currently show no lag

Please,
Verify replica I/O and SQL threads on section s6 (SHOW SLAVE STATUS\G).
Check for any long-running or blocking transactions on s6 (SHOW PROCESSLIST).
If stuck, restart the replication worker or clear problematic queries.
Provide ETA for full catch-up or escalate if hardware/network issues.

Related Objects

Event Timeline

This is due to a hardware issue with one of the hosts involved in the replication chain to the wiki replicas: T394624: db1155 HW memory errors

normal expectations (web replicas: < 5 min; analytics: < 1 h)

[citation needed]

The server has been fixed and it is now slowly catching up.
@Voyagerim I would like to understand where these expectations come from:

This far exceeds normal expectations (web replicas: < 5 min; analytics: < 1 h)

I was one of the people that set up the wiki replicas years ago and I am pretty sure we didn't commit to any of this, because it is pretty much impossible to guarantee those numbers, even further, there are schema changes that can take days to replicate. So it is important to understand where those expectations come from, because there is some misalignment somewhere and it should be addressed.

Thanks

@Marostegui , expectations regarding replication latency thresholds - namely that web replicas should maintain a latency of less than 5 minutes and analytic replicas less than 1 hour - are not formal service level agreements (SLAs), but rather empirical observations.

The MediaWiki load balancer is configured to stop sending read requests to a replica if its latency exceeds 5 seconds. The maxlag parameter, introduced in MediaWiki 1.10, allows clients (e.g. bots) to check the replication latency before executing write requests. If the delay exceeds the specified threshold (typically 5 seconds), the server returns an error, prompting the client to repeat the request later.

Thanks for the feedback, I think the problem is solved.

@Marostegui , expectations regarding replication latency thresholds - namely that web replicas should maintain a latency of less than 5 minutes and analytic replicas less than 1 hour - are not formal service level agreements (SLAs), but rather empirical observations.

I think it is important to differentiate between expectations and observations. especially with a service that is not considered mission critical. We are happy that they work well most of the time though, and we hope to keep that level of service, but again, those cannot be called expectations.

The MediaWiki load balancer is configured to stop sending read requests to a replica if its latency exceeds 5 seconds. The maxlag parameter, introduced in MediaWiki 1.10, allows clients (e.g. bots) to check the replication latency before executing write requests. If the delay exceeds the specified threshold (typically 5 seconds), the server returns an error, prompting the client to repeat the request later.

These hosts are not under MW load balancer, in fact, they do not have any load balancer in front of them (they use a DNS system to map IPs and sections but that is about it). These hosts aren't part of the normal production replicas, they live hanging behind a production replica, a host that filters the data (which we call sanitarium, which is in fact, the one that had issues).

Marostegui claimed this task.
Marostegui added a project: Data-Persistence.

Lag is back to 0