Page MenuHomePhabricator

[EPIC] New service request: chromium-render/deploy
Closed, ResolvedPublic5 Estimated Story Points

Description

This task encompasses the work to stand up the Chromium render service in production and staging.

name: chromium-render
description: a MediaWiki service for rendering wiki pages as PDFs using headless Chromium. This service replaces the Electron rendering service (which replaced OCG). Additional (but outdated) background is available on MW.org.
timeline: as soon as possible; the service as a whole is slated for completion by the end of the quarter.
point person: @pmiazga, @Niedzielski, @phuedx
technologies: a Node.js service that uses Chromium via Puppeteer bindings.
request flow diagram:

Requests flow downwards, responses flow upwards:

         [      MobileFrontend      ]
                     |
        [      chromium-render       ]
                     |
[      MediaWiki (desktop/mobile pages)      ]

Additionally, an internal flow description is available in the project readme.

project repos:

additional concerns:

  • Chromium, a dependency of the project, must be provided by the system. The environment variable, PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=1, must be set prior to executing npm install.
  • The target OS must be Debian Stretch per T180037. Debian Stretch has the most up-to-date version of Chromium and there's no community-maintained new-ish Chromium build for Debian Jessie (and we don't want to build and maintain our own!).

note: I (@Niedzielski) am new to this project, it's changed hands and plans a number of times, and I haven't worked to deploy a service previously so please be wary of incorrect details in this task description. My apologies in advance for any confusion this ticket creates.

Acceptance criteria

  • Instance OS is Debian Stretch.
  • Chromium is installed as a Debian package (e.g., sudo apt install chromium-browser) on the instance.
  • npm install does not download Chromium.
  • Service is available for production and staging deployments.

Details

SubjectRepoBranchLines +/-
mediawiki/services/restbase/deploymaster+1 -1
mediawiki/services/chromium-rendermaster+8 -11
mediawiki/services/chromium-rendermaster+31 -2
mediawiki/services/restbase/deploymaster+4 -0
operations/puppetproduction+1 -0
operations/puppetproduction+7 -0
operations/puppetproduction+2 -2
operations/puppetproduction+27 -0
operations/dnsmaster+2 -0
operations/puppetproduction+4 -0
mediawiki/services/chromium-rendermaster+1 -1
mediawiki/services/chromium-rendermaster+22 -10
mediawiki/services/chromium-rendermaster+5 -0
operations/puppetproduction+8 -2
mediawiki/services/chromium-render/deploymaster+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+9 -0
operations/dnsmaster+6 -0
operations/puppetproduction+6 -0
operations/puppetproduction+9 -1
operations/puppetproduction+12 -0
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
mediawiki/services/chromium-rendermaster+2 -1
mediawiki/services/chromium-rendermaster+3 -3
mediawiki/services/chromium-rendermaster+9 -9
operations/puppetproduction+39 -0
mediawiki/services/chromium-render/deploymaster+5 -0
operations/puppetproduction+19 -0
mediawiki/services/chromium-render/deploymaster+1 M -1
mediawiki/services/chromium-rendermaster+2 -2
mediawiki/services/chromium-render/deploymaster+5 -0
mediawiki/services/chromium-rendermaster+5 -1
integration/configmaster+5 -0
operations/puppetproduction+11 -589
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
Resolvedovasileva
ResolvedNone
ResolvedBawolff
Resolvedphuedx
Resolved mobrovac
Resolved mobrovac
Resolvedphuedx
ResolvedJdrewniak
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
DeclinedNone
Resolved bmansurov
Resolved mobrovac
Resolvedovasileva
InvalidNone
ResolvedJdlrobson
Resolvedphuedx
Resolvedphuedx
Resolved holger.knust
ResolvedTgr
Resolvedjijiki
ResolvedMSantos
Resolved mobrovac
Resolvedovasileva
Resolvedphuedx
Declinedpmiazga
ResolvedDzahn
Resolvedpmiazga
Duplicate holger.knust
ResolvedMSantos
ResolvedTgr
ResolvedJohan
OpenNone
Resolvedovasileva
InvalidNone
Resolved mobrovac
Resolved mobrovac
Resolvedakosiaris
Resolved mobrovac
Resolvedfgiunchedi
Resolvedpmiazga
Resolvedfaidon
Resolved mobrovac
Resolved mobrovac
Resolvedpmiazga
ResolvedJdrewniak
Resolved mobrovac
Resolvedphuedx
Resolvedpmiazga
Resolvedpmiazga

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Icinga is reporting that the new proton endpoints are not healthy, since about 2d 7h, on both proton1001 and proton2001. The reason given is:

	/{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 404 (expecting: 200)

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=proton1001&service=proton+endpoints+health
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=proton2001&service=proton+endpoints+health

Since this task is open this is still WIP right. i just acked the checked with a link over here.

@Dzahn no need to worry - this is not used in prod yet.
@pmiazga could you please silence the icinga for the time being?

@Dzahn no need to worry - this is not used in prod yet.

Yep, thanks for confirming.

@pmiazga could you please silence the icinga for the time being?

I had already ACKed it, that meant it will stay silent until the next state change, so when it turns OK. But i additionally scheduled a 3 month downtime now because then it can go up/down flapping while you still may be testing it without causing notifications and it's out of the "unacknowledged" column in the web UI. So you don't have to worry about it anymore. Except now we have to remember to remove that again once this goes production.

@Dzahn thank you for taking care of it!

Except now we have to remember to remove that again once this goes production.

Now we will not https://phabricator.wikimedia.org/T203276

And, @pmiazga would be great to find out why is it failing the checks

I think (I didn't verify it yet) that it fails because of introduced restbase checks:

@Pchelolo @Dzahn I assume this error happens since https://gerrit.wikimedia.org/r/#/c/mediawiki/services/chromium-render/+/442152/, As @mobrovac asked, we added an extra check to verify the article existence before adding it to the queue. The check is done via restbase /page/title/{TITLE} query. I think that the problem is the fact that Foo` nor Bar articles do not exist on the wiki the service checker is pointing to.

@Dzahn could you provide me the full URL the service checker is requesting for those two calls? I'll add the test article and the redirection there and service should work properly again. Sorry for the inconvenience, I didn't think that the restbase change can break service checks.

@Dzahn could you provide me the full URL the service checker is requesting for those two calls?

It's an NRPE check running on proton servers itself. The command is defined in @proton1001:/etc/nagios/nrpe.d/check_endpoints_proton.cfg. In there we run /usr/local/bin/check-proton and in that there is:

STATSD_HOST="statsd.eqiad.wmnet" STATSD_PORT="8125" STATSD_PREFIX="service_checker.proton.proton1001" /usr/bin/service-checker-swagger -t 10 10.64.0.20 http://10.64.0.20:24766

10.64.0.20 is proton1001

when i strace service-checker-swagger i see the following GET requests

sendto(3, "GET /?spec HTTP/1.1\r\nAccept-Enco"..., 106, 0, NULL, 0) = 106
sendto(14, "GET /_info/version HTTP/1.1\r\nAcc"..., 114, 0, NULL, 0) = 114
sendto(13, "GET /_info HTTP/1.1\r\nAccept-Enco"..., 106, 0, NULL, 0) = 106
sendto(9, "GET /_info/home HTTP/1.1\r\nAccept"..., 111, 0, NULL, 0) = 111
sendto(8, "GET /en.wikipedia.org/v1/pdf/// "..., 127, 0, NULL, 0) = 127
sendto(7, "GET /en.wikipedia.org/v1/pdf/Non"..., 150, 0, NULL, 0) = 150
sendto(6, "GET /en.wikipedia.org/v1/pdf/Bar"..., 138, 0, NULL, 0) = 138
sendto(5, "GET /en.wikipedia.org/v1/pdf/Foo"..., 143, 0, NULL, 0) = 143
sendto(3, "GET /robots.txt HTTP/1.1\r\nAccept"..., 111, 0, NULL, 0) = 111
sendto(15, "GET /_info/name HTTP/1.1\r\nAccept"..., 111, 0, NULL, 0) = 111

The task is nearly ready, the remaining bits and pieces:

  • we need to merge the last patch for T181623 - the new flow for closing the chromium browser - blocked on Services
  • security review - blocked on the review
  • fix the Icinga issues - that is most probably the configuration issue - blocked on services
  • review the Grafana dashboard - @pmiazga will do it

Then we can stream the traffic to the service to verify that it can withstand the live traffic. Service will get all requests to the old Electron service, but the PDF output will be discarded. We want to verify how the service performs in the live environment.

When the service is tested against the live traffic we still have to:

  • change the logic flow in the queue - from async to promises (@pmiazga, services to review)
  • apply changes to existing extensions (like MobileFronted/OCG) -- remove banners, point MobileFrontend downloadIcon to the new service etc

/cc @ovasileva

Change 461160 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Add new_uri and new_probability config params to PDF

https://gerrit.wikimedia.org/r/461160

Change 461160 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Add new_uri and new_probability config params to PDF

https://gerrit.wikimedia.org/r/461160

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:04:35Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@55100d4]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:08:13Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@55100d4]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements (duration: 03m 37s)

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:22:09Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:31:04Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints (duration: 08m 55s)

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:32:51Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:40:49Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out (duration: 07m 58s)

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:43:40Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out

Mentioned in SAL (#wikimedia-operations) [2018-09-18T18:49:54Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@4c3128f]: Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out (duration: 06m 14s)

I've deployed a change that splits traffic in RESTBase and also sends requests to Proton (25% of traffic)

Also, for testing, you can use new_pdf query parameter to force request Proton PDF https://en.wikipedia.org/api/rest_v1/page/pdf/Main_Page?new_pdf=true

Let's monitor what happens and how does the service respond to a little bit more traffic

Mentioned in SAL (#wikimedia-operations) [2018-09-18T19:11:46Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@4c3128f] (dev-cluster): Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out

Mentioned in SAL (#wikimedia-operations) [2018-09-18T19:18:18Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@4c3128f] (dev-cluster): Remove restrictions table T203835, Split 25% PDF traffic to Proton T186748, Metrics endpoints improvements, don't monitor PDF endpoints, take 2, feed timing out (duration: 06m 32s)

After sending 25% of traffic to Proton, there's not much to gather - a couple of crashers, which is ok for a new service, but nothing bad so far. I do not believe the metrics regarding latency are correct, but after fixing them I propose to go right to 100% of traffic to give it a real kick. The results are not served to the user anyway, which means we can be more brutal here.

I think (I didn't verify it yet) that it fails because of introduced restbase checks:

That's not the case. The 404s are produced by Chromium, not RESTBase, which suggest an incompatibility between the version of Chromium we have in production and puppeteer v1.8.0. This problem appeared with this deployment which includes bumping puppeteer's version and the fix for T181623. IMO, the problem is with Puppeteer sending the needed headers to Chromium.

After sending 25% of traffic to Proton, there's not much to gather - a couple of crashers, which is ok for a new service, but nothing bad so far. I do not believe the metrics regarding latency are correct, but after fixing them I propose to go right to 100% of traffic to give it a real kick. The results are not served to the user anyway, which means we can be more brutal here.

Thank you @Pchelolo for deploying this. Let's not go to 100% yet, at least not until we understand what is going on with Proton returning 404s.

@mobrovac we're on 1.7.0, Puppeteer 1.8.0 got released like a week ago. Today I'll create a patch to update puppeteer to the latest version (we will be able to drop the CHROME_BIN variable)

Change 461366 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Bump Puppeteer to latest 1.8.0 version

https://gerrit.wikimedia.org/r/461366

Change 461380 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Handle undefined response

https://gerrit.wikimedia.org/r/461380

Change 461380 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Handle undefined response

https://gerrit.wikimedia.org/r/461380

Mentioned in SAL (#wikimedia-operations) [2018-10-04T13:43:39Z] <pmiazga@deploy1001> Started deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158)

Mentioned in SAL (#wikimedia-operations) [2018-10-04T13:46:35Z] <pmiazga@deploy1001> Finished deploy [proton/deploy@ecb9a0e]: Bugfix:handle undefined response and fix grafana stats (T186748,T201158) (duration: 02m 55s)

Remaining tasks:

  • fix fonts @mobrovac is going to provision new sever, where we can run fonts-config once again. We will check the differences between fresh install and after running the dpkg-reconfigure font-config command.
  • downgrade Puppeteer to 1.6.0 and put it on production
  • finish the T204055
  • once T204055 is merged, verify and resolve T201158

Change 467696 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Rollback Puppeteer to 1.5.0

https://gerrit.wikimedia.org/r/467696

Change 467696 merged by Mobrovac:
[mediawiki/services/chromium-render@master] Rollback Puppeteer to 1.5.0

https://gerrit.wikimedia.org/r/467696

Mentioned in SAL (#wikimedia-operations) [2018-10-16T14:51:04Z] <mobrovac@deploy1001> Started deploy [proton/deploy@a657059]: Rollback to puppeteer v1.5.0 - T186748

Mentioned in SAL (#wikimedia-operations) [2018-10-16T14:51:53Z] <mobrovac@deploy1001> Finished deploy [proton/deploy@a657059]: Rollback to puppeteer v1.5.0 - T186748 (duration: 00m 49s)

Change 468354 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Send 100% of PDF traffic to Proton

https://gerrit.wikimedia.org/r/468354

Change 468354 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Send 100% of PDF traffic to Proton

https://gerrit.wikimedia.org/r/468354

Mentioned in SAL (#wikimedia-operations) [2018-10-18T16:31:28Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@6c879fa]: Have 100% of traffic directed to Proton as well - T186748

Mentioned in SAL (#wikimedia-operations) [2018-10-18T16:52:20Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@6c879fa]: Have 100% of traffic directed to Proton as well - T186748 (duration: 20m 52s)

From today Chromium-render receives 100% of pdf-render traffic, It works under

There 3 remaining bits

  • update Puppeteer to 1.9.0 and test how service performs, If it starts fail (same as 1.7.0 did) we will rollback the patch and stick with Puppeteer 1.5.0 for now.
  • update fonts config. @mobrovac created a new server and found the differences. The ETA for this task is 2 weeks
  • complete removing the async code and turn it into promises (T204055), the task is currently in review. I think we are in good state and we should close this task in ~week from now.

/cc @ovasileva @phuedx

Jdlrobson renamed this task from New service request: chromium-render/deploy to [EPIC] New service request: chromium-render/deploy.Oct 18 2018, 8:22 PM

This is causing a lot of noise in our blocked by others column and devaluing the purpose of it.
To me at least this seems like an epic with subtasks so I'd like to treat it that way in our workflow. Moving to epic column and breaking out the subtask.

The service is ready, the remaining thing is to increase the CPU count (T197862). I'll talk with services today about this task.

There are couple more things to do like

but those are not related to the new service request but to different Epic - Deploy the mediawiki-services-chromium-render service (Proton)

I think that after services decide what to do with increasing the CPU count can resolve this epic.

Being bold. As discussed, this epic only tracks the building of the service and not its deployment (see T181084: [EPIC] Deploy the mediawiki-services-chromium-render service (Proton)).

It that task still not ready to announce to the users?

If ready, how would it be summarized?

@Trizek-WMF service is ready, but it doesn't handle the production traffic yet. We're planning to replace ElectronPDF with chromium-render in January.

So no need to announce anything for now, since it is not going to impact anyone?

This should have no impact on anyone. We're experiencing stability issues with ElectronPDF, that's why we want to replace it with a new service. The switch will be transparent - PDF endpoint in the restbase will remain the same, /page/pdf/{title} consumers will be able to use new optional parameters (PDF size, mobile mode), but the current way to generate PDFs will be the same. We're aiming for a smooth transition. After switch PDF generation will be stable and more reliable thanks to the new queuing system.

Regarding announcements, I cannot answer your question, but I think @ovasileva or @Jhernandez can help you with that.

OK, I leave this ticket as "not ready to announce" until something that impacts users will happen. When ready, please add one sentence to Tech News!

OK, I leave this ticket as "not ready to announce" until something that impacts users will happen. When ready, please add one sentence to Tech News!

See also T220648: Communications plan for Electron -> Proton switchover