We should do this sometime soon after the release of Mediawiki 1.31 in June of 2018.
|Open||None||T91508 [Epic] overhaul fundraising cluster monitoring|
|Open||None||T185134 Prometheus 2 breaking change|
|Open||None||T185013 EPIC: migrate fundraising off of Debian Jessie|
|Resolved||Jgreen||T202290 test and deploy payments on Debian Stretch/PHP7.0|
|Resolved||Ejegg||T184460 Upgrade PaymentsWiki to Mediawiki 1.31 (new LTS)|
|Resolved||Ejegg||T218661 Adjust codfw payments servers to log to frlog1001 as well as bellatrix for live mw upgrade test|
|Resolved||Jgreen||T218667 upgrade payments2001 & 2002 to stretch|
|Resolved||Jgreen||T218669 Reconfigure mariadb for codfw payments cluster|
Unlike 1004 which is working properly, Apache on on 2003 is returning "(52) Empty reply from server"
The last thing in syslog is:
Feb 13 16:35:24 payments2003 SmashPig: | Entering logging context 'amazon'. | | Feb 13 16:35:24 payments2003 amazon_gateway: Constructing! Creating a new adapter of type: [Amazon] Feb 13 16:35:24 payments2003 amazon_gateway: xxx setUtmSource: Payment method is , recurring = NULL, utm_source = Feb 13 16:35:24 payments2003 amazon_gateway: xxx setCountry: Country not set.
I guess PHP exits silently. Perhaps there is a config difference, or divergent behavior due to the source IP.
Ok I finally tracked it down, here's was the useful clue:
payments1004 amazon_gateway: 65738357:65738357-1 setCountry: GeoIP lookup function found nothing for 127.0.0.1! No country available.
payments2003 amazon_gateway: 65738276:65738276-1 setCountry: Country not set.
So the issue turned out to be the lack of stale-yet-available GeoIP.dat on payments2003.
I'd go down the list here https://en.wikipedia.org/wiki/Usage_share_of_web_browsers and get to a point where we're confident we won't flood Donor Services with trouble reports upon launch.
@DStrine and @Eileenmcnaughton can we recruit you to try a couple of the frdev links at the bottom of this etherpad and note how the donation attempt goes in desktop Safari? (with the frdev URL you don't need any ssh tunnel to get to the upgraded version)
I've got access to a Windows box to test IE and Edge. Looks like UC browser is mostly CN and not a big share on Wikipedia. Anyone have strong opinions about testing in Samsung browser on Android?
@DStrine wondered if it would be possible to roll the CSP header before the PHP upgrade. According to https://www.mediawiki.org/wiki/Compatibility we could upgrade PHP before MW. However I think there may be problems with ResourceLoader image inlining.
@Ejegg I noticed T203704 is resolved, even though we aren't sending the header. Was that intentional?
@mepps I think it would be more usefully discussed on a task where it can be seen by everyone, because my essential worry is there haven't been enough eyes on this change, and it is a likely source of unintended consequences.
How the CSP header is handled depends on how the client implements it. The one we are planning to send is complicated, meant to whitelist every external resource we currently use. We have only tested the devices the tech team has access to. A QA person would not consider that sufficient.
Then there is moving to PHP7 (there are a few seemingly unaddressed issues here: https://etherpad.wikimedia.org/p/PaymentsPhp7Test) and MW 1.31 at the same time, both possible causes of donor facing regressions. All that put together seems like a risky deploy that would be wise to break into pieces or at least get a more formal test and checklist for deploy.
@DStrine suggested sending the CSP header first. That is easy from an ops perspective, but I think it means changing the current MW version to not inline some images. @Ejegg is that accurate/difficult?
I'd really like to do it all at once. Over the last few work days I've gone through the testing for IE and Edge on the combined MW 1.31 and PHP7 update. All of the issues I found were either related to the changed hostname (payments.frdev versus just payments), the different cluster IP addresses, or bad server configuration (the session issue). We've sorted all of those out, and the fully updated site seems to work great on all platforms.
I'm hearing that you are worried:
- We have not tested on enough devices. Are there devices you still feel are missing that we can test on?
- That we do not have a formal testing checklist. Would you like to work with @Ejegg to add to the Test plan you linked above? It sounds like you can offer some thoughts from the operations perspective that we might not have considered.
- About the CSP header. I'm not totally clear how sending the CSP header first fixes the issue you raised. @Ejegg what are your thoughts on this?
It does not address the concerns I raised or answer the question I asked.
Are there devices you still feel are missing that we can test on?
The vast majority of devices the 3 of us don't have physical access to. Someone with more frontend focus than me would know options for this type of integration test.
Would you like to work with @Ejegg to add to the Test plan you linked above?
That was a previous test and I was asking for information about the unaddressed problems it raised.
@Jgreen and I have been working on a plan to deploy to codfw so that we can have a rollback strategy.
I'm not totally clear how sending the CSP header first fixes the issue you raised.
Any way to break a monolithic change into smaller ones reduces the chance of panic when you don't know which part of the change broke the site.
Thanks @cwdent for taking the time to answer to each question. It sounds like you're very concerned about deploying this change, and I appreciate that given the importance of the effected systems.
@Ejegg Were the issues still open on the test plan resolved? I'm not clear on whether "Working now" means that the errors previously listed were resolved and I notice that Amazon does not say that.
@DStrine could we ask anyone from advancement to help with testing? I feel like we're running up against the limits of not having QA. I reached out to someone in Audiences who says she may be able to share her team's current list with me, but if you know anyone else @cwdent feel free to ask as well.
I missed the comment directed to me last week, sorry. There are a ton of links on that etherpad.
Can we get one or two links to test at a time?Are there any particular instructions we need to give? Will testers need special access?
I can send out an email with all this and hand enter bugs myself to help.
Thanks for all the detail there @Jgreen! I worry about one thing with option 2 (and with the CODFW cutover option in general) - the payments-wiki logs we use to fill in details of txns from audits. When I test on payments2003, those logs never make it to the archive mounted on civi1001. What can we do to get those logs available to the audit parser?
@DStrine so from line 45 down, there is 1 link for each payment processor. I've been going all the way through the donation attempt for everything except AstroPay and old-style PayPal. For those, payments-wiki doesn't do any processing on the donor return, so as long as the redirect to the processor works we should be fine. If you want to test a couple of them in desktop Safari, please go ahead. It would be great if you could open the dev tools and note any errors that show up in the console. If the payment attempt goes through without a hitch, please note that too, under the other results for that link.
When the Adyen iframe opens, that DOES trigger some warnings, but those are due to Adyen's server configuration, which has its content security policy set to 'report-only', i.e. don't break things. Those same warnings happen no matter what version of payments-wiki opens the iframe.