Page MenuHomePhabricator

Upgrade Parsoid servers to buster
Closed, ResolvedPublic

Description

Since parse2001 is a canary server and has no traffic, I suggest we use it to test at least the buster installation and check if everything is in order. I understand that if the work are doing for the upgrading the appservers/api mediawiki servers, we do not expect many surprises here.

ParsoidJS is not in use anymore, so this is a chance to remove it from the servers as well

Q2

  • try to have finished reimaging parse2001 to buster
  • run some tests

Q3

  • reimage scandium, testing setup
  • remove parsoidJS from production
  • reimage codfw
  • reimage eqiad

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

parse2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202011302038_dzahn_11921_parse2001_codfw_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2020-11-30T20:39:32Z] <mutante> reimaging parse2001 (parsoid canary) with buster (T268524)

Completed auto-reimage of hosts:

['parse2001.codfw.wmnet']

and were ALL successful.

@jijiki parse2001 is now on buster. I noticed no puppet errors or warnings. It went surprisingly smooth. Icinga also looking good. It is currently back in "pooled=no" (from inactive).

I also scap pulled.

That sounds lovely! Are there any sanity tests we could possibly do? Maybe @ssastry can give us an idea?

So I used httpbb to test it and we still have this issue, which looks like firewalling..to my surprise.

1[deploy1001:~] $ httpbb --hosts parse2001.codfw.wmnet /srv/deployment/httpbb-tests/appserver/baseurls.yaml
2Sending to parse2001.codfw.wmnet...
3https://de.wikipedia.org/wiki/Wikipedia:Hauptseite (/srv/deployment/httpbb-tests/appserver/baseurls.yaml:25)
4 ERROR: HTTPSConnectionPool(host='parse2001.codfw.wmnet', port=443): Read timed out. (read timeout=10)
5https://meta.wikimedia.org/wiki/List_of_Wikipedias (/srv/deployment/httpbb-tests/appserver/baseurls.yaml:98)
6 ERROR: HTTPSConnectionPool(host='parse2001.codfw.wmnet', port=443): Read timed out. (read timeout=10)
7https://fr.wikipedia.org/wiki/K.A.Z (/srv/deployment/httpbb-tests/appserver/baseurls.yaml:31)
8 ERROR: HTTPSConnectionPool(host='parse2001.codfw.wmnet', port=443): Read timed out. (read timeout=10)
9===
10ERRORS: 19 requests attempted to parse2001.codfw.wmnet. Errors connecting to 1 host.
11
12
13[deploy1001:~] $ httpbb --hosts parse2002.codfw.wmnet /srv/deployment/httpbb-tests/appserver/baseurls.yaml
14Sending to parse2002.codfw.wmnet...
15PASS: 19 requests sent to parse2002.codfw.wmnet. All assertions passed.

Mentioned in SAL (#wikimedia-operations) [2020-11-30T22:14:34Z] <mutante> parse2001 - systemctl restart ferm - had to restart ferm after reimaging (though there weren't any alerts about that) but it fixed running httpbb tests on it (T268524)

@jijiki @RLazarus I had to manually restart ferm (meh! no alerts about that and should not happen) but that made the httpbb tests work now:

[deploy1001:~] $ httpbb --hosts parse2001.codfw.wmnet /srv/deployment/httpbb-tests/appserver/baseurls.yaml 
Sending to parse2001.codfw.wmnet...
PASS: 19 requests sent to parse2001.codfw.wmnet. All assertions passed.
...
[deploy1001:~] $ httpbb --hosts parse2001.codfw.wmnet /srv/deployment/httpbb-tests/appserver/test_main.yaml 
Sending to parse2001.codfw.wmnet...
PASS: 23 requests sent to parse2001.codfw.wmnet. All assertions passed.

@Dzahn is there something we need to fix on puppet ?

@jijiki Not sure, I think in the past the "official" instructions included rebooting once and that would also fix it.

I checked the journalctl of ferm on parse2001 and there are no errors logged, so it's not a case of Ferm failing to start (via syntax errors, failing DNS lookup or whatever). But be should get to the bottom of this, let's maybe reimage also parse2002 to show whether it's a reproducible error and if it happens again, keep it in the broken state to debug further?

let's maybe reimage also parse2002 to show whether it's a reproducible error and if it happens again, keep it in the broken state to debug further?

I can just repeat the reimaging process on 2001 as well. But then if it happens again leave it in the broken state on purpose and just downtime everything. But no strong preference.

let's maybe reimage also parse2002 to show whether it's a reproducible error and if it happens again, keep it in the broken state to debug further?

I can just repeat the reimaging process on 2001 as well. But then if it happens again leave it in the broken state on purpose and just downtime everything. But no strong preference.

Sounds good!

Alright, will do that and start in a couple minutes.

let's maybe reimage also parse2002 to show whether it's a reproducible error and if it happens again, keep it in the broken state to debug further?

I can just repeat the reimaging process on 2001 as well. But then if it happens again leave it in the broken state on purpose and just downtime everything. But no strong preference.

Ack, if it end up broken again, I'll poke at it tomorrow.

Mentioned in SAL (#wikimedia-operations) [2020-12-03T19:11:12Z] <mutante> depooling parse2001 and repeating auto-reimage to see if ferm issue is repeatable (T268524)

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

parse2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202012031920_dzahn_27732_parse2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['parse2001.codfw.wmnet']

and were ALL successful.

What can I say.. it did not happen this time. Reimaged, ran puppet.. waited a bit, checked Icinga.. all green with the following exceptions:

  • Ensure local MW versions match expected deployment (fixed after running a scap pull because earlier there was a deployment)
  • mediawiki-installation DSH group (will be fixed by pooled=no from pooled=inactive but takes a while)
  • PHP opcache health (not sure about the "below 99.85%" needing any action)

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=parse2001&scroll=242

But either way, no ferm restart involved and things looking good.

What can I say.. it did not happen this time. Reimaged, ran puppet.. waited a bit, checked Icinga.. all green with the following exceptions:

(..)

But either way, no ferm restart involved and things looking good.

Sound good, I'd say we keep it as that, and if we ever run into this error again, keep it broken and ping or John for investigating,

Mentioned in SAL (#wikimedia-operations) [2020-12-09T23:21:54Z] <mutante> repooling parse2001 after buster reimage - T268524

@ssastry is there someway we could check that parse2001, which is running on buster now, works as expected?

Run a few curl commands like these but while using parse2001 as a proxy. Here is the equivalent for scandium itself:

curl -L -x http://scandium.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Hospet

I tried with parse2001 / parse2002 but that failed to resolve:

ssastry@scandium:~$ curl -L -x http://parse2002.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Hospet
curl: (5) Could not resolve proxy: parse2002.eqiad.wmnet
ssastry@scandium:~$ curl -L -x http://parse2001.eqiad.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Hospet
curl: (5) Could not resolve proxy: parse2001.eqiad.wmnet

Not sure if I got the server names wrong, but that is what I would try.

For bonus points, verify that the HTML generated on parse2001 and parse2002 is identical. And try on a couple different wikis, (enwiki plus other lang wikis). Can be easily scripted.

Ah, because parse2001 and parse2002 are codfw, not eqiad. Anyway, here goes:

ssastry@scandium:~$ curl -L -x http://parse2001.codfw.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Hospet > /tmp/p1.html
...
ssastry@scandium:~$ curl -L -x http://parse2002.codfw.wmnet:80 http://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/page/html/Hospet > /tmp/p2.html
...
ssastry@scandium:~$ diff /tmp/p1.html /tmp/p2.html
ssastry@scandium:~$

Change 648383 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] httpbb: add tests for parsoid servers

https://gerrit.wikimedia.org/r/648383

@jijiki @ssastry I wrote the following httpbb tests based on that:

# tests for parsoid appservers
# hosts: parse*, wtp*
http://en.wikipedia.org:
- path: /w/rest.php/en.wikipedia.org/v3/page/html/Hospet
  assert_status: 302 
- path: /w/rest.php/en.wikipedia.org/v3/page/html/Hospet/992733907
  assert_status: 200 
  assert_body_contains: data-parsoid
http://de.wikipedia.org:
- path: /w/rest.php/de.wikipedia.org/v3/page/html/Karnataka
  assert_status: 302 
- path: /w/rest.php/de.wikipedia.org/v3/page/html/Karnataka/206238030
  assert_status: 200 
  assert_body_contains: data-parsoid
http://es.wikipedia.org:
- path: /w/rest.php/es.wikipedia.org/v3/page/html/Bangalore
  assert_status: 302 
- path: /w/rest.php/es.wikipedia.org/v3/page/html/Bangalore/129309635
  assert_status: 200 
  assert_body_contains: data-parsoid
http://it.wikipedia.org:
- path: /w/rest.php/it.wikipedia.org/v3/page/html/Mysore
  assert_status: 302 
- path: /w/rest.php/it.wikipedia.org/v3/page/html/Mysore/112864552
  assert_status: 200 
  assert_body_contains: data-parsoid

So as suggested the "Hospet" page on en.wiki and then some others on de/es/it. The first URL is always a 302 and the second one a 200 where we check if "data-parsoid" is in the body.

They all pass on parse2001.

[deploy1001:~] $ httpbb --hosts parse2001.codfw.wmnet /home/dzahn/test_parse.yaml 
Sending to parse2001.codfw.wmnet...
PASS: 8 requests sent to parse2001.codfw.wmnet. All assertions passed.

If that looks good we can merge the patch that will install it on cumin and deplyment servers for all to use.

Change 648383 merged by Dzahn:
[operations/puppet@production] httpbb: add tests for parsoid servers

https://gerrit.wikimedia.org/r/648383

@ssastry do we still need parsoid JS running in the parsoid servers? This is a good opportunity to clean this up. I am running into this issue T245757#6953720 when I tried to re-image one of them

Change 676068 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] profile::parsoid: remove parsoid class from parsoid profile

https://gerrit.wikimedia.org/r/676068

Change 676071 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] modules: remove parsoidJS puppet module

https://gerrit.wikimedia.org/r/676071

Mentioned in SAL (#wikimedia-operations) [2021-04-01T19:30:03Z] <mutante> depooled parse2001 because on train deployment it caused "MWException: No localisation cache found for English" and then "HTTP CRITICAL: HTTP/1.1 500 Internal Server Error" (T268524)

Mentioned in SAL (#wikimedia-operations) [2021-04-01T19:37:51Z] <mutante> pooled parse2001 again after twentyaftefour rebuilt the l10n cache for wmf.37 which fixed it and made Apache alert recover (T268524)

Change 676068 abandoned by Effie Mouzeli:

[operations/puppet@production] profile::parsoid: remove parsoid class from parsoid profile

Reason:

in favour of Iab525436f3ae10327a2ac9c31f91009ae451611c

https://gerrit.wikimedia.org/r/676068

Mentioned in SAL (#wikimedia-operations) [2021-04-01T19:37:51Z] <mutante> pooled parse2001 again after twentyaftefour rebuilt the l10n cache for wmf.37 which fixed it and made Apache alert recover (T268524)

For this also see T273334

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1025.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104081545_jiji_14346.log.

jijiki updated the task description. (Show Details)

Completed auto-reimage of hosts:

['wtp1025.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2004.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104130735_jiji_22041.log.

Completed auto-reimage of hosts:

['parse2002.codfw.wmnet', 'parse2003.codfw.wmnet', 'parse2004.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2005.codfw.wmnet', 'parse2006.codfw.wmnet', 'parse2007.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104130850_jiji_8219.log.

Completed auto-reimage of hosts:

['parse2005.codfw.wmnet', 'parse2006.codfw.wmnet', 'parse2007.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2008.codfw.wmnet', 'parse2009.codfw.wmnet', 'parse2010.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131108_jiji_17443.log.

Completed auto-reimage of hosts:

['parse2008.codfw.wmnet', 'parse2009.codfw.wmnet', 'parse2010.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2013.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131336_jiji_29644.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1026.eqiad.wmnet', 'wtp1027.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131338_jiji_31386.log.

Completed auto-reimage of hosts:

['parse2011.codfw.wmnet', 'parse2012.codfw.wmnet', 'parse2013.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1026.eqiad.wmnet', 'wtp1027.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1028.eqiad.wmnet', 'wtp1029.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131534_jiji_15418.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2014.codfw.wmnet', 'parse2015.codfw.wmnet', 'parse2016.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131539_jiji_21218.log.

Completed auto-reimage of hosts:

['parse2014.codfw.wmnet', 'parse2015.codfw.wmnet', 'parse2016.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1028.eqiad.wmnet', 'wtp1029.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2017.codfw.wmnet', 'parse2018.codfw.wmnet', 'parse2019.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131728_jiji_26705.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1030.eqiad.wmnet', 'wtp1031.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131729_jiji_26775.log.

Completed auto-reimage of hosts:

['parse2017.codfw.wmnet', 'parse2018.codfw.wmnet', 'parse2019.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1030.eqiad.wmnet', 'wtp1031.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['parse2020.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131856_jiji_16703.log.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1032.eqiad.wmnet', 'wtp1033.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104131857_jiji_16778.log.

Completed auto-reimage of hosts:

['parse2020.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['wtp1032.eqiad.wmnet', 'wtp1033.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104141031_jiji_25538.log.

Completed auto-reimage of hosts:

['wtp1034.eqiad.wmnet', 'wtp1035.eqiad.wmnet', 'wtp1036.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1037.eqiad.wmnet', 'wtp1038.eqiad.wmnet', 'wtp1039.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104141202_jiji_13531.log.

Completed auto-reimage of hosts:

['wtp1038.eqiad.wmnet', 'wtp1037.eqiad.wmnet', 'wtp1039.eqiad.wmnet']

and were ALL successful.

After the mw train we got some Icinga alerts which caused worry among deployers:

20:15 <+icinga-wm> PROBLEM - Ensure local MW versions match expected deployment on wtp1037 is CRITICAL: CRITICAL: 524 mismatched wikiversions 
                   https://wikitech.wikimedia.org/wiki/Application_servers
20:15 <+icinga-wm> PROBLEM - Ensure local MW versions match expected deployment on wtp1038 is CRITICAL: CRITICAL: 524 mismatched wikiversions 
                   https://wikitech.wikimedia.org/wiki/Application_servers
20:17 <+icinga-wm> PROBLEM - Ensure local MW versions match expected deployment on wtp1039 is CRITICAL: CRITICAL: 524 mismatched wikiversions 
                   https://wikitech.wikimedia.org/wiki/Application_servers

That happens when they are in pooled=inactive during reimage which means not being in scap "dsh" groups and getting deployments.

It's not a real issue as long as we remember running scap pull before repooling but the alerts are making people worry something is broken (and then when they try to ssh to the affected servers they see a key change due to the reimages).

I checked they are depooled and manually ran scap pull, which made the alerts recover.

I think the real issue is that this alert should not be there for hosts that are not pooled, but that may not be so easy to do.

Or it's just about expired or failed downtimes from the reimaging cookbook.

edit: yes, it was just about expired downtimes. Also ACKed 'not in mediawiki-installation dsh group' and set downtimes on mw103[7-9] for 24 hours.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1040.eqiad.wmnet', 'wtp1041.eqiad.wmnet', 'wtp1042.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104151100_jiji_16692.log.

Completed auto-reimage of hosts:

['wtp1040.eqiad.wmnet', 'wtp1041.eqiad.wmnet', 'wtp1042.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1043.eqiad.wmnet', 'wtp1044.eqiad.wmnet', 'wtp1045.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104151224_jiji_6464.log.

Completed auto-reimage of hosts:

['wtp1043.eqiad.wmnet', 'wtp1044.eqiad.wmnet', 'wtp1045.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts:

['wtp1046.eqiad.wmnet', 'wtp1047.eqiad.wmnet', 'wtp1048.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104151402_jiji_27358.log.

Completed auto-reimage of hosts:

['wtp1046.eqiad.wmnet', 'wtp1047.eqiad.wmnet', 'wtp1048.eqiad.wmnet']

and were ALL successful.

jijiki claimed this task.
jijiki updated the task description. (Show Details)