Page MenuHomePhabricator

eqiad rack/setup 11 new DB servers
Closed, ResolvedPublic

Description

@Cmjohnson let us know that he's received all 11 servers: T158580#3156239
This is a first suggestion of how we can maybe rack them:

Latest status:

hostnamerackstate
db1096a6temporarily used as dbstore_multiinstance test host (stretch)
db1097d1provisioned and serving s4
db1098b5ready for provisioning, default puppet role
db1099b2ready for provisioning, default puppet role
db1100c2ready for provisioning, default puppet role
db1101c2 (you can use db1057 slot, as db1057 can go away: T162135)ready for provisioning, default puppet role
db1102d2temporarily used as sanitarium3 - T169510
db1103a3ready for provisioning, default puppet role
db1104b3ready for provisioning, default puppet role
db1105c3ready for provisioning, default puppet role
db1106d3ready for provisioning, default puppet role

Keep in mind that we are planning to free up space once we decommission all the hosts older than db1050 (around 30 servers).
@Cmjohnson let us know if this makes sense from your side or you see something not doable.

Thank you

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui added projects: DBA, ops-eqiad.
Marostegui updated the task description. (Show Details)
Marostegui added a subscriber: jcrespo.

Change 346580 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] Indicate install recipes for newest db1* and db2* DB servers

https://gerrit.wikimedia.org/r/346580

The guidance is the same as T162159#3157547 (documented for databases on https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia ).

On the above patch I have added the right recipe to use for new (first install) servers- which is now different from existing servers.

I hope you enjoy the vendor a bit more :-)

Change 346580 merged by Jcrespo:
[operations/puppet@production] Indicate install recipes for newest db1* and db2* DB servers

https://gerrit.wikimedia.org/r/346580

@Marostegui

db1096a1 (no available u space — pick another location)
db1097d1 (No issues)
db1098a2 (Will definitely need a decom server….there is space but no available power until we remove something)
db1099b2 **(Will need to remove decom servers to make his happen. No available u space or power.)
db1100c2 (No issue)**
db1101c2 (you can use db1057 slot, as db1057 can go away: T162135) (No Issue)
db1102d2 (No Issue)
db1103a3 (there should be one available power outlet this for this server)
db1104b3 (Should be okay, 1 space is available.)
db1097c3 (No issue)
db1106d3 (No Issue)

Thanks @Cmjohnson, what about these changes:

db1096 - a6
db1098 - b5
db1099 - d3

I need to check b5, it's a 24pt switch not 48. I believe there is 1 more available 1G port.

I need to check b5, it's a 24pt switch not 48. I believe there is 1 more available 1G port.

If not, we can try d4.
Let me know what works and I can edit the original task, to leave the final positions, so it is easier to read in the future.

btw @Cmjohnson db1042 can be decommissioned (it is on b2, so maybe db1099 can take its place?): https://phabricator.wikimedia.org/T149793

db1019 can also go away (T147309) and it is on b1.

Just mentioning it here to make sure you are aware, just in case if it is easier for you to completely decomm those two servers and free up space on b2 and b1. If not, let me know if what I wrote on: T162233#3171602 works for you and I will update the task with the final positions.

Great! I will decom those 2 servers and utilize their space.

@Marostegui Final Placement

hostname rack
db1096 a6
db1097 d1
db1098 b5
db1099 b2
db1100 c2
db1101 c2 (you can use db1057 slot, as db1057 can go away: T162135)
db1102 d2
db1103 a3
db1104 b3
db1105 c3
db1106 d3

@Marostegui Final Placement

hostname rack
db1096 a6
db1097 d1
db1098 b5
db1099 b2
db1100 c2
db1101 c2 (you can use db1057 slot, as db1057 can go away: T162135)
db1102 d2
db1103 a3
db1104 b3
db1105 c3
db1106 d3

Great!! Original task edited with those changes
Thank you!

Change 348467 had a related patch set uploaded (by Cmjohnson):
[operations/dns@master] adding mgmt dns entries for db1096-1106 T162233

https://gerrit.wikimedia.org/r/348467

Change 348467 merged by Cmjohnson:
[operations/dns@master] adding mgmt dns entries for db1096-1106 T162233

https://gerrit.wikimedia.org/r/348467

Change 348750 had a related patch set uploaded (by Cmjohnson):
[operations/dns@master] Adding production dns entries for db servers T162233

https://gerrit.wikimedia.org/r/348750

Change 348750 merged by Cmjohnson:
[operations/dns@master] Adding production dns entries for db servers T162233

https://gerrit.wikimedia.org/r/348750

Change 348755 had a related patch set uploaded (by Cmjohnson):
[operations/puppet@production] adding mac address for new db's less db1098--not connecting will add this later T162233

https://gerrit.wikimedia.org/r/348755

Change 348755 merged by Cmjohnson:
[operations/puppet@production] adding mac address for new db's less db1098--not connecting will add this later T162233

https://gerrit.wikimedia.org/r/348755

10 of the 11 servers that arrived are racked, switch configured, raid completed, idrac setup and dns entries for both mgmt and production. They are ready for installs. Could @jcrespo or @Marostegui take over from here.

Still need is racktables input

@jcrespo and @Marostegui d b1106 is racked, idrac/bios setup, switch cfg is done. dhcpd file is configured...ready for install

Hi Chris,

We will take it from here yes.
Thanks for getting all this sorted for us!

Hey @Cmjohnson
I have tried to install 3 servers to just make sure they worked fine and we didn't miss anything. And also to make sure we at least have 3 for the switch back to eqiad next week in case we need them
I chose random servers:

db1099 -> went fine
db1100 -> after pxebooting I get a black screen and nothing happens. I have left it as it is to see if you can see something in the screen with a crash kart.
db1106 -> there is no RAID configured or at least no virtual disks are shown. I have tried to look for it on the BIOS/RAID controller menu but the latency kills it and I cannot see it.

The following servers were installed already (I guess they went into the installation as soon as you turned them on) as they had the puppet cert waiting to be signed:
db1097
db1101
db1102
db1103

I have signed it for those, and completed the installation and rebooted them.
We will take care of the rest of the servers after the DC switchover.

So I did another round:

db1096 -> installed

The ones that we would need @Cmjohnson to check (no need to be done this week!)

db1098 -> after attempting pxe boot: blackscreen
db1100 -> after pxebooting I get a black screen and nothing happens. I have left it as it is to see if you can see something in the screen with a crash kart.
db1104 -> after attempting pxe boot: blackscreen
db1105 -> after attempting pxe boot: blackscreen
db1106 -> there is no RAID configured or at least no virtual disks are shown. I have tried to look for it on the BIOS/RAID controller menu but the latency kills it and I cannot see it.

I believe all those who are not booting up are not even entering PXE, because they do not get an IP (or at least I cannot ping them)

@Cmjohnson yes, check the original task description so you can see that there are a few servers that cannot be installed :-(

I just checked db1098 for instance and it still has the same issue indeed so I assume the other ones will remain with the same issue.

Marostegui, quick question, do you know what is the state of this- are those other servers still not going up, do you want me to have a third quick look in case it is a DNS/IP problem?

Marostegui, quick question, do you know what is the state of this- are those other servers still not going up, do you want me to have a third quick look in case it is a DNS/IP problem?

Last time I checked the problem remained the same as originally posted on the original task description :(
I wasn't able to see anything after PXE boot, as it goes completely blank.

Feel free to take a second look to see if you find something else!

I have checked puppet, and I do not see any error with the puppet configuration (ip, mac of the new hosts). @ayounsi do you have time to help us check the network configuration? It is the only think that I see could be bad other than the servers having issues.

db1098, for example, should have IP 10.64.16.83 and mac 18:66:DA:F8:D5:E0 according to the server and puppet configuration, but PXE doesn't move forward with the installation.

Link Status                                           <Disconnected>

This could be a physical issue or a network configuration issue, could you help us check?

Oh, I think I have it 18:66:DA:F8:D5:E1 says connected. I think we used the wrong port to configure the server. This may still need network check, maybe? but most likely we configured the wrong network port/card.

Change 363563 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Change db1098 MAC address to the one that shows link

https://gerrit.wikimedia.org/r/363563

Change 363563 merged by Jcrespo:
[operations/puppet@production] install_server: Change db1098 MAC address to the one that shows link

https://gerrit.wikimedia.org/r/363563

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['db1098.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707060938_jynus_9802.log.

Sadly, I still cannot see it booting.

@Cmjohnson this is not urgent, but can you check the link of the initially configured device? 18:66:DA:F8:D5:E0 aka network card1. @ayounsi can see link, but cannot see the mac address (normal, as the server is not up). Could you double check the port connections for db1098. I see link on the non-configured network card.

@jcrespo: the issue should be resolved. The cable was in the wrong eth port. Confirmed MAC
cmjohnson@asw-b-eqiad> ... ethernet-switching table brief |grep ge-5/0/5

private1-b-eqiad  18:66:da:f8:d5:e0 Learn          0 ge-5/0/5.0

May I ask you to check db1100, db1104 and db1105- probably the same issue.

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['db1098.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201707111542_jynus_26478.log.

Completed auto-reimage of hosts:

['db1098.eqiad.wmnet']

and were ALL successful.

@jcrespo db1100, 1105 were the same issue db1104 is something else. I will update once I figure it out

Thanks @jcrespo and @Cmjohnson for advancing a lot on this task!
The only pending host now to be able to resolve this task is db1106 which looks like it doesn't have the RAID created.
I have tried to do it myself from the idrac, but the latency makes it impossible to operate the raid menu so I believe we will need @Cmjohnson magic hands here!

@jcrespo please remind how you would like the raid setup..Raid10?

Thanks @Cmjohnson I have restarted the host for its reinstallation. I will close this task when done.

Marostegui updated the task description. (Show Details)

Thank you all people for the help.