@pmiazga reports that he's needed to add logging to the service while testing changes for T178501: Limit resources used by Chromium in order to make the chromium-render service ready for production.
The chromium-render library looks like a really good candidate to replace old electron service. First, we need to do proper performance testing and check how this project tackles the production traffic. Without proper logging, it will be very difficult to monitor the service health.
Each log entry should allow us to identify the job (by URL or task ID). When we get many rejected/failed jobs we will be able to verify what is the problem (service overloaded vs problematic articles).
For now, let's keep logs very detailed, as this will help us determining the server requirements
Questions:
- do we need queue health status?
Log on those actions:
List of actions when we should log
Queue health status
Might be helpful, every one hour log
Level: info
Include:
- queue size
- number of running tasks
- memory usage
Reasoning: an indicator that everything works as expected and service is up and running, even if there are no incoming requests.
- DONE
Job got queued
Level: info
Include:
- ID
- URL
- size of the queue
Reasoning: an indicator that queue works properly, the size of the queue will help us determine the max_waiting_timeout. When every task has to wait it means we should allow more concurrent requests (if CPU|MEM resources allow that). From now, we will use the ID to search for all related job log entries.
- DONE
Job got rejected because queue is full
Level: warning
Include:
- URL
Reasoning: an indicator that queue is full and we need to throw more resources. The URL will help us finding which books weren't rendered (just in case someone creates a ticket with information that everytime he wants to create a book it fails)
- DONE
Job started
Level: info
Include:
- ID
- URL
- time spent in queue (in microseconds)
- size of the queue
Reasoning: an indicator that queue is working and the concurrency is correct. If the queue size is always the same/bigger than in the moment when the task was queued it will mean everything goes into the queue and waits there. The queue should work as a buffer, not as a place where each job has to stay for couple seconds.
- DONE
Job finished
Level: info
Include:
- ID
- time spent while rendering (in microseconds)
- size of the queue
Reasoning: an indicator that chromium rendering is working correctly, also a good indicator how much time it takes to render an article (for tuning up the timeouts configuration for queue/rendering)
- DONE
Job failed
Level: error
Include:
- ID
- URL
- time spent while rendering (in microseconds)
- reason
Reasoning: an indicator that chromium rendering is not working correctly, from URL we will be able to find which articles break chromium rendering
- DONE
Connection closed while job in the queue|progress
Level: info
Include:
- ID
- total time spent in the queue (rendering + queue)
- task state (queue|render)
Reasoning: an indicator that users don't want to wait so long for a PDFs
- DONE
Stats
It would be also nice to have a graphana board to visualize the chromium-pdf traffic. Stats we would like to track:
- rejected jobs
- queued jobs
- size of the queue when new job comes in
- number of rendered jobs
- number of failed renderings
- time each job spends in the queue
- time each job spends in the rendering state
- might be helpful: generated pdf size