Page MenuHomePhabricator

Tons of OCG jobs caused a massive increase in queue length
Closed, ResolvedPublic

Description

Since ~10:40 UTC the OCG's Job Queue keep increasing:

Screen Shot 2016-10-03 at 7.17.49 PM.png (896×2 px, 130 KB)

More data in https://grafana.wikimedia.org/dashboard/db/ocg

Event Timeline

elukey triaged this task as High priority.Oct 3 2016, 5:22 PM

Someone is trying to render all of enwiktionary to PDF. At the moment they are up to "unadventurous".

cscott@ocg1001:/srv/deployment/ocg/ocg$ tail -f /var/log/ocg/ocg.log | fgrep enwiktionary
Oct  3 18:59:35 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":2899,"level":30,"channel":"backend.bundler.bin","job":{"id":"acb9c57867b85b6cf8096bbd2904b2861541efac","writer":"rdf2latex"},"msg":"[33%] Fetching parsed articles: enwiktionary:unadulteratedly [complete]","time":"2016-10-03T18:59:35.787Z","v":0}
Oct  3 18:59:35 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":2899,"level":30,"channel":"backend.bundler.bin","job":{"id":"acb9c57867b85b6cf8096bbd2904b2861541efac","writer":"rdf2latex"},"msg":"[73%] Creating attribution page: enwiktionary:<custom wikitext> [Parsoid]","time":"2016-10-03T18:59:35.796Z","v":0}
Oct  3 18:59:35 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":425,"level":30,"channel":"backend.bundler.bin","job":{"id":"029e8c039f6acf99ef7ae1e346e0caf821051929","writer":"rdf2latex"},"msg":"[10%] Fetching wiki configuration: enwiktionary siteinfo","time":"2016-10-03T18:59:35.820Z","v":0}
Oct  3 18:59:35 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":24955,"level":30,"channel":"backend.bundler.bin","job":{"id":"8f5788875e76c0722452b953d3bd5b283e0fac28","writer":"rdf2latex"},"msg":"[28%] Fetching parsed articles: enwiktionary:unadulterously [Parsoid, revision 39056132]","time":"2016-10-03T18:59:35.903Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":425,"level":30,"channel":"backend.bundler.bin","job":{"id":"029e8c039f6acf99ef7ae1e346e0caf821051929","writer":"rdf2latex"},"msg":"[28%] Fetching parsed articles: enwiktionary:unadventurous [Parsoid, revision 31584643]","time":"2016-10-03T18:59:36.009Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":24955,"level":30,"channel":"backend.bundler.bin","job":{"id":"8f5788875e76c0722452b953d3bd5b283e0fac28","writer":"rdf2latex"},"msg":"[30%] Fetching parsed articles: enwiktionary:unadulterously [authors, revision 39056132]","time":"2016-10-03T18:59:36.088Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":24955,"level":30,"channel":"backend.bundler.bin","job":{"id":"8f5788875e76c0722452b953d3bd5b283e0fac28","writer":"rdf2latex"},"msg":"[33%] Fetching parsed articles: enwiktionary:unadulterously [complete]","time":"2016-10-03T18:59:36.149Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":24955,"level":30,"channel":"backend.bundler.bin","job":{"id":"8f5788875e76c0722452b953d3bd5b283e0fac28","writer":"rdf2latex"},"msg":"[73%] Creating attribution page: enwiktionary:<custom wikitext> [Parsoid]","time":"2016-10-03T18:59:36.156Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":425,"level":30,"channel":"backend.bundler.bin","job":{"id":"029e8c039f6acf99ef7ae1e346e0caf821051929","writer":"rdf2latex"},"msg":"[30%] Fetching parsed articles: enwiktionary:unadventurous [authors, revision 31584643]","time":"2016-10-03T18:59:36.262Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":425,"level":30,"channel":"backend.bundler.bin","job":{"id":"029e8c039f6acf99ef7ae1e346e0caf821051929","writer":"rdf2latex"},"msg":"[33%] Fetching parsed articles: enwiktionary:unadventurous [complete]","time":"2016-10-03T18:59:36.381Z","v":0}
Oct  3 18:59:36 ocg1001 mw-ocg-service: {"name":"mw-ocg-service","hostname":"ocg1001","pid":425,"level":30,"channel":"backend.bundler.bin","job":{"id":"029e8c039f6acf99ef7ae1e346e0caf821051929","writer":"rdf2latex"},"msg":"[73%] Creating attribution page: enwiktionary:<custom wikitext> [Parsoid]","time":"2016-10-03T18:59:36.388Z","v":0}

I cleared the queue to make things a little more responsive for others:

cscott@ocg1001:/srv/deployment/ocg/ocg$ sudo -u ocg -g ocg nodejs-ocg mw-ocg-service/scripts/clear-queue.js -c /etc/ocg/mw-ocg-service.js

Going to have to come up with a better solution to rate limit by project or something like that? Or just do a user ban (but first I need to add enough logging to figure out who is responsible).

@elukey said:

(12:53:51 PM) elukey: cscott: one thing that was noted from https://grafana.wikimedia.org/dashboard/db/eventbus is that EventBus (and ChangeProp) saw a rapid increase in transcludes events ~20 mins after the OCG's queue started to increase
(12:54:07 PM) elukey: not sure if related or not (we were thinking about a template change somewhere)

It's probably the attribution pages being rendered in the logs above: OCG renders attributions via template transclusion, to allow wiki admins to edit the exact attribution text and format. So a surge in OCG renders will cause a surge in template transclusions.

Change 313876 had a related patch set uploaded (by C. Scott Ananian):
Allow administrators to blacklist renders of certain metabooks.

https://gerrit.wikimedia.org/r/313876

Change 313876 merged by jenkins-bot:
Allow administrators to blacklist renders of certain metabooks.

https://gerrit.wikimedia.org/r/313876

Change 313915 had a related patch set uploaded (by C. Scott Ananian):
Add en.wiktionary.org to blacklist temporarily to prevent DoS.

https://gerrit.wikimedia.org/r/313915

Change 313915 merged by C. Scott Ananian:
Add en.wiktionary.org to blacklist temporarily to prevent DoS.

https://gerrit.wikimedia.org/r/313915

ok, deployed a patch to blacklist en.wiktionary.org for the time being, rejecting jobs in the front end. when the front end error rate on https://grafana.wikimedia.org/dashboard/db/ocg goes back down, we can revert 0bf27e3452dfdc770317f15793e93e6e89c7865a .

The next steps would be to add additional logging to the Collection front end -- IP or username or something like that -- and then we can use a more precise blacklist regexp. We could also try to tackle T97030 while we're at it. The blacklist patch does dump the complete metabook into logstash when it rejects a job, so if we stuff some additional logging into metabook we should be able to get more information about our mystery spider.

ok, deployed a patch to blacklist en.wiktionary.org for the time being, rejecting jobs in the front end. when the front end error rate on https://grafana.wikimedia.org/dashboard/db/ocg goes back down, we can revert 0bf27e3452dfdc770317f15793e93e6e89c7865a .

The next steps would be to add additional logging to the Collection front end -- IP or username or something like that -- and then we can use a more precise blacklist regexp. We could also try to tackle T97030 while we're at it. The blacklist patch does dump the complete metabook into logstash when it rejects a job, so if we stuff some additional logging into metabook we should be able to get more information about our mystery spider.

This is the start of a good incident report with next steps/actionables.

Change 314058 had a related patch set uploaded (by C. Scott Ananian):
Revert "Add en.wiktionary.org to blacklist temporarily to prevent DoS."

https://gerrit.wikimedia.org/r/314058

Change 314177 had a related patch set uploaded (by C. Scott Ananian):
Log request source information passed along from PHP.

https://gerrit.wikimedia.org/r/314177

Change 314178 had a related patch set uploaded (by C. Scott Ananian):
Pass request source and user information to backend.

https://gerrit.wikimedia.org/r/314178

Since the deployment of the last changes, I was looking at disk usage and I noticed that there are a ton of deleted and unclosed files created by ocg in

/srv/deployment/ocg/postmortem/

this happens when a file is unlinked from the filesystem, but its filehandle is still open in the application.

I have no idea if this is a regression introduced by the latest deployment, though.

I'm guessing that's most likely caused by the actions I took to clear the
queue pre-deploy. But I'll take a look.

elukey lowered the priority of this task from High to Medium.Oct 20 2016, 1:22 PM

Change 314177 merged by jenkins-bot:
Log request source information passed along from PHP.

https://gerrit.wikimedia.org/r/314177

Change 314058 merged by Arlolra:
Revert "Add en.wiktionary.org to blacklist temporarily to prevent DoS."

https://gerrit.wikimedia.org/r/314058