Page MenuHomePhabricator

Enabling LocalisationUpdate vastly increases CPU activity
Closed, ResolvedPublic

Description

CPU usage went wayyy up when enabling LocalisationUpdate:
http://techblog.wikimedia.org/wp-content/uploads/2009/09/broke.png

The empty space in the middle is where bug 20773 killed the site after disabling the extension; after fixing that, CPU usage went back up, then back to normal as caches were rebuilt.

Not deployable in this state; CPU usage needs to be tracked down and cleared up. Is it breaking the cache infrastructure, or is pulling extra stuff from DBs when we've got main localization in CDB files inefficient?


Version: unspecified
Severity: enhancement
URL: http://techblog.wikimedia.org/wp-content/uploads/2009/09/broke.png

Details

Reference
bz20774

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:46 PM
bzimport set Reference to bz20774.

Note that l10nupdate's installation triggers invalidation of the l10ncache, causing it to be rebuilt from scratch. Try making a whitespace change to MessagesEn.php and syncing that, then see what the resulting CPU spike looks like. Also, please investigate how well-synchronized the Apaches' clocks are.

(In reply to comment #0)

Not deployable in this state; CPU usage needs to be tracked down and cleared
up. Is it breaking the cache infrastructure, or is pulling extra stuff from DBs
when we've got main localization in CDB files inefficient?

I'll look into the CPU usage; debug logs from earlier local test runs show that l10nupdate is not pulling localizations from the DB once all its stuff is in the l10ncache, however. Offhand, I think the dependency check may hit the DB, but that shouldn't double CPU usage AFAICT.

Hopefully fixed with the rewrite in r56831.

Basically, the two major culprits were:

  1. the code checking the timestamp of the last update.php run (to determine whether to rebuild the l10ncache) pulled stale data from the slaves, and wasn't smart enough to use queriedTimestamp > expectedTimestamp instead of !=
  2. the initial update.php run inserted about a million rows in each of the 5 per-cluster localisation tables, using a separate REPLACE statement for each row; this presumably slowed down replication and worsened #1.

LU now uses a file-based storage system.

This is now believed to be fixed :)

Doing a more conservative progressive production rollout to confirm this...

Yay! System is much happier now :D