Page MenuHomePhabricator

Deal with logging query spam on crawler 404 floods
Closed, ResolvedPublic

Description

As seen at https://tendril.wikimedia.org/report/, we have a bunch of crawlers of various types hitting non-existent pages. We do a move/delete log query on such page views...which is fine except when lots of queries come in at once. They end up taking 16s to 18s.

Possible solution is to avoid calling the LogEventList method in showMissingArticle based on a Bloom Filter in Redis. This would be updated on the fly. Not sure how to estimate the set size to keep the false hit rate down.


Version: 1.24rc
Severity: normal

Details

Reference
bz67439

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:28 AM
bzimport set Reference to bz67439.

(In reply to Aaron Schulz from comment #0)

As seen at https://tendril.wikimedia.org/report/, we have a bunch of
crawlers of various types hitting non-existent pages. We do a move/delete
log query on such page views...which is fine except when lots of queries
come in at once. They end up taking 16s to 18s.

Possible solution is to avoid calling the LogEventList method in
showMissingArticle based on a Bloom Filter in Redis. This would be updated
on the fly. Not sure how to estimate the set size to keep the false hit rate
down.

Of course a bloom filter requires scanning all of logging and using add() for new deletes. This is problematic if the redis server is not durable or is downed (since repopulation cannot be on the fly). Maybe the rebuilding could be automatic and batched (switching the filter on when done).

Also it might help to route non-user based logging queries to all DBs rather than just db1055 (the partitioning of that table by user is necessary for this query).

Change 143802 merged by jenkins-bot:
Added BloomCache classes

https://gerrit.wikimedia.org/r/143802

Deployed and populated (on enwiki, mostly automatically).