Page MenuHomePhabricator

Watchlist fails to update
Closed, ResolvedPublic

Description

Author: qleah

Description:
Special:Watchlist appears to have stopped working at about 11:00 CST.


Version: unspecified
Severity: normal

Details

Reference
bz951
TitleReferenceAuthorSource BranchDest Branch
Update metrics-server and overhaul its deploymentrepos/cloud/toolforge/wmcs-k8s-metrics!2taaviupdate-metrics-servermain
Upgrade kube-state-metrics to 2.2.4repos/cloud/toolforge/wmcs-k8s-metrics!1taavikube-state-metricsmain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
DeclinedNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:03 PM
bzimport set Reference to bz951.
bzimport added a subscriber: Unknown Object (MLST).

jeluf wrote:

DB server lags, removed from replication

qleah wrote:

Watchlists have stopped again at about noon.

river wrote:

Will catch up shortly.

bugzilla_wikipedia_org.to.jamesd wrote:

For background, there are three main causes of this:

  1. Too much load on the slave, so it can't keep up with replication while it

handles queries. In this situation, we adjust load by adjusting the amount of
search we turn off. If they get significantly behind, we turn off search for
some of the big wikis using that slave so it can catch up more quickly.

  1. An operating system version-related issue on the slave Bacon, which causes it

to stop replicating. We can't risk losing Bacon at present so we can't try
different operating system versions yet. Becuase we get fast reports of this
problem from en, we have this machine set to serve en and Zh wikipedias. The
rest are normally unaffected by this issue with the current setup, though in
the past any could be affected. The split is mainly for performance reasons - we
just had to choose which wikis got the one with the problem.

  1. Any other operation which causes replication to stop. There are a wide range

of possibilities. This is less commmon than 1 or 2.

For 1 and 2, on 14 October 2004 we ordered two more database slaves to add to
the two we have. They are being set up now, after delays at both the vendor due
to a compatibility issue and with our install person being unavailable. The new
ones have a different operating system version from Bacon and will confirm
whether that resolves the problem Bacon is having, as well as giving us enough
excess capacity to risk losing Bacon for a while if there is a problem while
switching it to that version.

bugzilla_wikipedia_org.to.jamesd wrote:

The Bacon problem is still around but has been worked around with a modification
to servmon which automtically corrects the problem. It's seen less often on the
new system with the later operating system version, only once so far.

The two new database servers have reduced the general lag problems. Search is
now on full time at full rate. Some MediaWiki 1.4 issues (changed queries) which
can cause lag are still being identified and dealt with - either with querybane
rules or programming chances in MediaWiki.

Two comon causes of significant lag have been removed: special page updating is
now done on a different, not in service, server and copied in without
significant lag. Searchindex updating is also done while slaves are offline and
no longer causes lag.

Guess this is no longer an issue. ;)

  • Bug 2637 has been marked as a duplicate of this bug. ***