Page MenuHomePhabricator

Add logging to the mediawiki-services-chromium-render service
Closed, ResolvedPublic2 Estimated Story Points

Description

@pmiazga reports that he's needed to add logging to the service while testing changes for T178501: Limit resources used by Chromium in order to make the chromium-render service ready for production.

The chromium-render library looks like a really good candidate to replace old electron service. First, we need to do proper performance testing and check how this project tackles the production traffic. Without proper logging, it will be very difficult to monitor the service health.

Each log entry should allow us to identify the job (by URL or task ID). When we get many rejected/failed jobs we will be able to verify what is the problem (service overloaded vs problematic articles).

For now, let's keep logs very detailed, as this will help us determining the server requirements

Questions:

  • do we need queue health status?

Log on those actions:

List of actions when we should log

Queue health status

Might be helpful, every one hour log
Level: info
Include:

  • queue size
  • number of running tasks
  • memory usage

Reasoning: an indicator that everything works as expected and service is up and running, even if there are no incoming requests.

  • DONE
Job got queued

Level: info
Include:

  • ID
  • URL
  • size of the queue

Reasoning: an indicator that queue works properly, the size of the queue will help us determine the max_waiting_timeout. When every task has to wait it means we should allow more concurrent requests (if CPU|MEM resources allow that). From now, we will use the ID to search for all related job log entries.

  • DONE
Job got rejected because queue is full

Level: warning
Include:

  • URL

Reasoning: an indicator that queue is full and we need to throw more resources. The URL will help us finding which books weren't rendered (just in case someone creates a ticket with information that everytime he wants to create a book it fails)

  • DONE
Job started

Level: info
Include:

  • ID
  • URL
  • time spent in queue (in microseconds)
  • size of the queue

Reasoning: an indicator that queue is working and the concurrency is correct. If the queue size is always the same/bigger than in the moment when the task was queued it will mean everything goes into the queue and waits there. The queue should work as a buffer, not as a place where each job has to stay for couple seconds.

  • DONE
Job finished

Level: info
Include:

  • ID
  • time spent while rendering (in microseconds)
  • size of the queue

Reasoning: an indicator that chromium rendering is working correctly, also a good indicator how much time it takes to render an article (for tuning up the timeouts configuration for queue/rendering)

  • DONE
Job failed

Level: error
Include:

  • ID
  • URL
  • time spent while rendering (in microseconds)
  • reason

Reasoning: an indicator that chromium rendering is not working correctly, from URL we will be able to find which articles break chromium rendering

  • DONE
Connection closed while job in the queue|progress

Level: info
Include:

  • ID
  • total time spent in the queue (rendering + queue)
  • task state (queue|render)

Reasoning: an indicator that users don't want to wait so long for a PDFs

  • DONE

Stats

It would be also nice to have a graphana board to visualize the chromium-pdf traffic. Stats we would like to track:

  • rejected jobs
  • queued jobs
  • size of the queue when new job comes in
  • number of rendered jobs
  • number of failed renderings
  • time each job spends in the queue
  • time each job spends in the rendering state
  • might be helpful: generated pdf size

Event Timeline

phuedx renamed this task from Add logging to the mediawiki-services-chromium-render to Add logging to the mediawiki-services-chromium-render service.Nov 15 2017, 2:14 PM

Change 392051 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Make chromium testing easier

https://gerrit.wikimedia.org/r/392051

Change 392051 merged by Bmansurov:
[mediawiki/services/chromium-render@master] Make chromium testing easier

https://gerrit.wikimedia.org/r/392051

ovasileva moved this task from Triaged but Future to Upcoming on the Web-Team-Backlog board.
pmiazga set the point value for this task to 2.Nov 21 2017, 5:15 PM

Sorry for the task linking spam, y'all. It occurred to me that we didn't actually have a task tracking building the new service!

Change 393664 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Log important events

https://gerrit.wikimedia.org/r/393664

This comment was removed by Jdlrobson.

Change 393664 merged by Mobrovac:
[mediawiki/services/chromium-render@master] Log important events

https://gerrit.wikimedia.org/r/393664

phuedx claimed this task.

Per the discussion in the Sprint Kickoff - Reading Web meeting, I'm happy to sign this off if y'all (@pmiazga, @bmansurov, and @pmiazga) have checked over the service's logs and they're readable and easily understood (which I believe to be the case).

IIRC we noted that there may be a few minor tweaks that need to be made as part of T178278: Performance test the service.