Page MenuHomePhabricator

Recentchanges old_id updates have long waits
Closed, ResolvedPublic

Description

Author: bugzilla_wikipedia_org.to.jamesd

Description:
Sequences like this for recentchanges happen several
times an hour:

DB Time Query or State
-- ---- ----------

enwiki 239 UPDATE recentchanges SET
rc_this_oldid=6638303 WHERE rc_namespace=0 AND
rc_title='Legend_tripping' AND
rc_timestamp='20040912055536'
enwiki 205 UPDATE recentchanges SET
rc_this_oldid=6638305 WHERE rc_namespace=0 AND
rc_title='Gholamhossein_Mosahab' AND
rc_timestamp='20040229205048'
enwiki 204 UPDATE recentchanges SET
rc_this_oldid=6638306 WHERE rc_namespace=6 AND
rc_title='ZeroG.jpg' AND rc_timestamp='20040623185548'
enwiki 190 UPDATE recentchanges SET
rc_this_oldid=6638307 WHERE rc_namespace=0 AND
rc_title='Australian_hornet' AND
rc_timestamp='20040828135413'
enwiki 189 UPDATE recentchanges SET
rc_this_oldid=6638308 WHERE rc_namespace=0 AND
rc_title='Nima_Yooshij' AND
rc_timestamp='20040326005112'
enwiki 184 UPDATE recentchanges SET
rc_this_oldid=6638309 WHERE rc_namespace=6 AND
rc_title='As15-86-11603.jpg' AND
rc_timestamp='20040704033819'
enwiki 170 UPDATE recentchanges SET
rc_this_oldid=6638313 WHERE rc_namespace=6 AND
rc_title='Apollo15missionpatch.png' AND
rc_timestamp='20040726010236'
enwiki 170 UPDATE recentchanges SET
rc_this_oldid=6638314 WHERE rc_namespace=0 AND
rc_title='City_of_Sydney' AND
rc_timestamp='20041009112904'

Notice the timestamps, for a time before any existing
recentchanges records. Show innodb status indicates
that they are waiting for a lock on the rc_timestamp
index, presumably all waiting for the first block,
since all are before any entries.

Suggested change:

A. Check the timestamp to see if it is earlier than
the recentchanges purge time limit. If it is, don't
update: if the record still exists it is going to be
purged soon anyway.

This would be less pleasant for low traffic wikis,
where purges may be relatively infrequent and the
records without the information would stay around for
a while.

B. Unnecessary if A is done, better overall.

If the timestamp is before the purge threshold, just
do the update. If not:

Use the rc_timestamp index and search for any entries
with the timestamp from the planned update, limit 1.
If no results, there's no work to do and no write
lock needed.

Do the search on a DB slave. Two possible wrong
result cases for a slave:

  1. New record or edit. The timestamp will be within

the purge threshold so this code path will never
happen.

  1. Record just purged from the master, still

replicating. This will show a hit and a need to
update the record. The master will get an unnecessary
update, just as it does now. But far less of them.


Version: 1.3.x
Severity: normal

Details

Reference
bz730

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 7:00 PM
bzimport set Reference to bz730.
bzimport added a subscriber: Unknown Object (MLST).

Isn't this due to the infrequent, large, slow purges locking things up? Would it be better to just do smaller, more frequent purges?

bugzilla_wikipedia_org.to.jamesd wrote:

Purging from a cron job or jobs in small bites off peak would be ideal for the
bigger databases. No need to do them in connection with an end user operation
which could happen at a high load time. delete low priority where timestamp <
threshold (for mysql 4.0.0 and later: order by rc_id) limit 20 at fairly
frequent intervals might be a reasonable approach. I expect that to be OK with
end user things at peak times as well, just not as nice to have the load then
when we don't care about prompt purging.

These queries are the longest running ones on the server at the time they
happen. That seems to argue against them waiting for the purge. But something is
causing them to wait and I don't know what it is. Want a copy of the log with
show processlist and show innodb status for some of these to see if you can see
what's causing them? I look but haven't found it yet, other than that
rc_timestamp index being involved. It might be the delete having happened and
something in the database engine keeping a lock for a while. Maybe.

In any case, there's no point in even trying most updates with timestamps this
old because they are to records which are well before the purge threshold. Just
wastes some master disk/CPU time without any benefit to try it. Quick check on
the slave and then discard the potential update if the slave doesn't have it is
the way to go for any which are less recent than the purge threshold. I'm
curious about the cause of the delay but since the updates aren't necessary
anyway I can happily forget that curiosity this time.

jeluf wrote:

Cron jobs are a solution for wikipedia.org, but not for other cases,
where web space providers might not offer a way to schedule cron jobs.

bugzilla_wikipedia_org.to.jamesd wrote:

Definitely agreed. This sort of thing is one of the cases where the way
Wikimedia and other big projects work best is different from how smaller and
shared hosting places need to work.

bugzilla_wikipedia_org.to.jamesd wrote:

A update on these: it's not only older records which are
affected, it's all. The problem is a lock on the rc_timestamp
index which is being used to find the rc_id value so the
record can be updated. rc_timestamp is more unique than the
combination of rc_namespace and rc_title (about half the
number of matches). Asking a slave to look up the rc_id value
and then updating via the rc_id will reduce the demand for the
rc_timestamp index.

bugzilla_wikipedia_org.to.jamesd wrote:

Please at least implement at least a configurable time threshold, if not the
existence check, for 1.4. These rc_old_id updates are by far the most frequent
cause of save delays, happening many times a day and it's completely unnecessary
to try to update RC records several weeks ago when we know that RC entries are
kept for little more than a week.

bugzilla_wikipedia_org.to.jamesd wrote:

assumed fixed or at least different in 1.5.