Page MenuHomePhabricator

upgrade scandium to buster
Closed, ResolvedPublic

Description

Given the current situation with scandium and testreduce1001 (jessie / buster, different kinds of parsoid tests) and the planned upgrades of appservers, we need to figure out if an equivalent of scandium on buster or a special parse* server on buster is needed for testing.

Then we want to decide when parse*/wtp* can be upgraded to buster.

Event Timeline

Change 642070 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: rename wtp1025 to parse1001

https://gerrit.wikimedia.org/r/642070

@jijiki @hnowlan @RLazarus Here's the summary.

There is scandium.eqiad.wmnet which is a Parsoid test server that is like an appserver and on stretch. And then there is testreduce1001.eqiad.wmnet which is a different kind of Parsoid test server which is NOT like an appserver and uses nodejs/npm and is on buster.

This ticket was meant to be strictly about planning how to deal with the test servers and future upgrades and was not meant to replace a separate ticket like "upgrade all prod parsoid appservers to buster" which we should have as well but still need to create I think.

I talked to @ssastry about all this and how it fits into our general plans to upgrade all appservers to buster (T245757) and we came to the agreement that it's best to separate theses issues. What means "parsoid testing" is not the same type of testing we want to do to see if we can upgrade the prod wtp* / parse* servers to buster and when.

So we should look at these things separately.

Now the second part of this is that we have an ongoing renaming task which is about turning wtp* servers into parse* servers (T245888) as requested by Effie back in February. That task is currently half way done. All parsoid servers in codfw are called parse* but in eqiad they are still called wtp*. But they are the same thing.

My suggestion to combine all this is:

We take wtp1025 (the lowest number) out of the pool and reimage it as "parse1001" and on buster. Then we use it as canary to test if everything is fine on buster and once it is we continue to reimage and rename the other wtp* eqiad servers to parse* servers.

That way we achieve both the renaming and OS upgrade and having a temp. test server. As a bonus it will be obvious what is buster and what is still stretch based on the server name until wtp* is retired completely as a cluster name and finally codfw parse* servers are reimaged to buster without a rename.

This also avoids changing anything on existing test servers used by the Parsoid team so they are not interrupted and we don't need to coordinate. Upgrading or retiring those can and should be discussed but can be entirely separate.

This ticket should have a better title but the point is it is about scandium and testreduce and not about the prod parsoid appservers and upgrading them. That should be another ticket that should still be created as we said on our last meeting.

Dzahn renamed this task from equivalent of scandium on buster? to equivalent of scandium on buster? (upgrading parsoid testing servers).Nov 20 2020, 5:11 AM

Change 643081 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: install npm on parsoid buster testserver

https://gerrit.wikimedia.org/r/643081

Change 643082 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] testreduce: move npm inclusion to the correct profile

https://gerrit.wikimedia.org/r/643082

Change 643081 abandoned by Dzahn:
[operations/puppet@production] testreduce: install npm on parsoid buster testserver

Reason:
replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/ /643082

https://gerrit.wikimedia.org/r/643081

Change 643082 abandoned by Dzahn:
[operations/puppet@production] testreduce: install npm on parsoid buster testserver

Reason:
not needed, it's already installed

https://gerrit.wikimedia.org/r/643082

@jijiki @hnowlan @RLazarus Here's the summary.

There is scandium.eqiad.wmnet which is a Parsoid test server that is like an appserver and on stretch. And then there is testreduce1001.eqiad.wmnet which is a different kind of Parsoid test server which is NOT like an appserver and uses nodejs/npm and is on buster.

This ticket was meant to be strictly about planning how to deal with the test servers and future upgrades and was not meant to replace a separate ticket like "upgrade all prod parsoid appservers to buster" which we should have as well but still need to create I think.

If I am understanding correctly, the parsoid upgrade to buster has no dependencies on this task, correct?

I talked to @ssastry about all this and how it fits into our general plans to upgrade all appservers to buster (T245757) and we came to the agreement that it's best to separate theses issues. What means "parsoid testing" is not the same type of testing we want to do to see if we can upgrade the prod wtp* / parse* servers to buster and when.

So we should look at these things separately.

Now the second part of this is that we have an ongoing renaming task which is about turning wtp* servers into parse* servers (T245888) as requested by Effie back in February. That task is currently half way done. All parsoid servers in codfw are called parse* but in eqiad they are still called wtp*. But they are the same thing.

My suggestion to combine all this is:

We take wtp1025 (the lowest number) out of the pool and reimage it as "parse1001" and on buster. Then we use it as canary to test if everything is fine on buster and once it is we continue to reimage and rename the other wtp* eqiad servers to parse* servers.

That way we achieve both the renaming and OS upgrade and having a temp. test server. As a bonus it will be obvious what is buster and what is still stretch based on the server name until wtp* is retired completely as a cluster name and finally codfw parse* servers are reimaged to buster without a rename.

This also avoids changing anything on existing test servers used by the Parsoid team so they are not interrupted and we don't need to coordinate. Upgrading or retiring those can and should be discussed but can be entirely separate.

Moving this discussion to T268524, thank you !

If I am understanding correctly, the parsoid upgrade to buster has no dependencies on this task, correct?

Yes, that's correct. The parsoid team can use testreduce1001 for their tests and if we assign parse2001 as the test server for the buster upgrade as you suggest on T268524 then these are unrelated things.

Though the question if the parsoid team needs a server that is an appserver running on buster PLUS their test setup is not fully answered yet and that is what this ticket is about.

Dzahn triaged this task as Medium priority.Nov 23 2020, 11:10 PM

scandium is an appserver that runs Parsoid code that isn't yet deployed to production. We used this server for parsoid tests (that are co-ordinated via node.js code on testreduce1001). So, scandium should also be upgraded to buster, and in fact, it should be upgraded before you upgrade production wtp*/parse* Parsoid servers.

@ssastry What do you prefer, upgrading scandium in place so it keeps the same name or a new host with a new name that is on buster but uses the same puppet setup? The latter has the disadvantage that we need to replace "scandium" in more places but maybe the advantage that you can see what doesn't work before scandium on stretch is gone.

Also, is there any data on scandium that needs to be saved and won't just be recreated by puppet or deployment?

Actually, since scandium is hardware and not virtual, reimaging in place would be a lot easier and the other option would involve having to ask for hardware. So @ssastry Let me know if I can simply reimage anytime and whether any data needs to be saved.

Dzahn renamed this task from equivalent of scandium on buster? (upgrading parsoid testing servers) to upgrade scandium to buster.Mar 12 2021, 9:37 PM

Yes, reimaging in place works. I'll take a look at it later to see if there is any data there that needs saving.

Change 673592 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch scandium to use buster installer

https://gerrit.wikimedia.org/r/673592

Change 673592 merged by Dzahn:
[operations/puppet@production] DHCP: switch scandium to use buster installer

https://gerrit.wikimedia.org/r/673592

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

scandium.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202103192015_dzahn_31954_scandium_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-operations) [2021-03-19T21:11:02Z] <mutante> scandium - stop apache and rerun puppet which fails after reimaging because it tries to run an nginx on port 80 which is already used by apache T268248

Completed auto-reimage of hosts:

['scandium.eqiad.wmnet']

and were ALL successful.

scandium is back and on buster, puppet run now shows no more errors