Following on from our work stress testing the service there are some cases where we do not clean abort unnecessary aborts which consumes memory unnecessarily and we are worried might impact the reliability of the service.
While stress testing I found that when we puppeteer closes the browser connection two things happen:
sometimes we log a false-negative error after puppeteer closes the browser connection. This is because puppeteer still tries to manage chromium, but the browser is closed and it fails with an error (there are 3 or 4 distinct errors it may throw). We shouldn't log those errors as it might indicate something is wrong.Fixed as part of https://gerrit.wikimedia.org/r/#/c/393664/- sometimes after we abort the job it still renders and after some time logs "job completed" which means that job wasn't aborted.
Developer notes
We estimated 3s and 5s as there is a little risk involved here.
We agreed however that we should not get to deep into this task. If we discover this is more complex than a 3 we should take a step back as a team and reconsider this task - either creating a spike to do more analysis or re-evaluating user value.
QA steps
To simplify QA process I made a small tool API Link generator. The tool will help to create links to the API, instead of making URL by youself you can fill the fields and the tool will create a link for you with a button to get PDF. Tool also takes into consideration the fact, that Beta Cluster proton instance can print only articles from beta cluster wiki.
The easiest way to test the tool is to download the repository locally, install chromium, and then edit the config.dev.yaml file and change the render_concurrency to something bigger than 1 (5 should be fine)
npm install npm run start
Then use the tool and point to the Local dev instance. With that approach dev install will be able to print the articles from production. Beta cluster instance doesn't have all articles/styles/templates/etc, it can be safely used for smoke tests but anything more rigorous we should use the production instance (not present yet) or the dev instance.
Note: if you use chrome, or if you keep chromium somewhere else, please edit the config.dev.yaml file and change the executablePath to point to your local chromium instance (/usr/bin/chromium is the default path).
Note 2: The chromium-render outputs very useful log, it tells you when the job got into the queue, when it started rendering, and when it finished. Keep eye on the log, it will be helpful when for example you need to abort the render.
Note 3: For testing it will be useful to print very long articles (that take ~5-10s to render, I used some articles from Long pages list
Things to test
First, by using API Link generator create couple URLs (for different articles). The easiest way to test this task is also to use a commandline tool, like wget or curl, it will be much easier to start/stop rendering.
- try to generate PDFs for couple random articles, check that PDFs are rendered correctly, verify that after each render is finished all resources are freed (memory is released, there are no chromium processes in the background)
- do the same but for concurrent renders (try to download url in different tabs/terminals at once). all renders should output correct PDF files, and chromium processes should exit by itself
- try to generate the PDF and abort when it's still in the queue (did not start rendering), the process should leave immediately, chromium browser shouldn't be spawned
- try to generate the PDF and abort when it's in the rendering state - the process should stop the chromium browser.
- keep the service running and rendering stuff for some time (like one hour), for constant testing you can use tools like siege or jmeter), and verify that there are no zombie processes, and when you stop the rendering PDF none chromium processes are present.
In short, we want to test that after both successful render and aborted render script cleans up the environment properly - the memory is freed and there are no remaining chromium processes running in the background.