Page MenuHomePhabricator

rack/setup/install labsdb1012.eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking/setup/installation of the new labsdb1012 used by Analytics

Racking Proposal: Either row B or row D (as none of the other 3 labsdb hosts are on those rows) - any rack within those rows.

labsdb1012:

  • - receive in system on procurement task T211135
  • - rack system with proposed racking plan (see above) & update netbox (include all system info plus location)
  • - RAID 10
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Related Objects

StatusSubtypeAssignedTask
Resolvedelukey

Event Timeline

Marostegui created this task.

I have suggested to use labsdb1012 as a hostname, as this host has the same hardware as the other labsdb1009-1011 and will be setup the same way of those hosts.

I have suggested to use labsdb1012 as a hostname, as this host has the same hardware as the other labsdb1009-1011 and will be setup the same way of those hosts.

+1

Change 487997 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Allow install labsdb1012

https://gerrit.wikimedia.org/r/487997

Change 487997 merged by Marostegui:
[operations/puppet@production] install_server: Allow install labsdb1012

https://gerrit.wikimedia.org/r/487997

I don't think we should setup new hosts using multi-source.

I don't think we should setup new hosts using multi-source.

This host will act as a current labsdb hosts, but will be dedicated for Analytics.

current labsdb hosts

Those will be on multi-instance soon(TM).

current labsdb hosts

Those will be on multi-instance soon(TM).

Yes, that was discussed during the meetings, this new host will follow that same "deadline" :-)

From our perspective, we will use the replica only for ETL (so with Sqoop) and we'll not grant any user access to the host. Having multi-source or multi-instance is not a big problem for us, so I'll leave the choice to you guys since it is more a matter of maintainability..

Since this host is important for the Analytics team, I'd be up to take over from the OS install perspective to remove some work from dc ops :)

@elukey is this a 1G or 10G rack?

I see that labsdb1010 and 1011 have 1G, so I'd go for 1G as well to keep everything as close as possible to the other ones!

EDIT (after a chat with my team): since we will use Hadoop to pull data from this host, we might generate a lot of traffic. I think that a 1G link is enough, but would it be possible in the future to switch to a 10G if needed? Or is this going to affect the choice of the rack and hence will be a burden?

@elukey, it will affect how it's rack...10G racks have different switches but we are also limited in space for those racks. If 1G works now and for the foreseeable future, stick with that...if a change is needed then we will make adjustments in the future.

@elukey, it will affect how it's rack...10G racks have different switches but we are also limited in space for those racks. If 1G works now and for the foreseeable future, stick with that...if a change is needed then we will make adjustments in the future.

Let's go for 1G then, thanks!

Change 492009 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for labsdb1012

https://gerrit.wikimedia.org/r/492009

Change 492009 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for labsdb1012

https://gerrit.wikimedia.org/r/492009

Cmjohnson subscribed.

@arzhel This server needs to go into the cloud-support vlan but it's not available to me for row C. Can you update it, please.

The port is asw2-c . ge-8/0/29 labsdb1012. Once complete assign to @RobH for install

ayounsi subscribed.

Talked to Chris, the server is actually in B8

Talking to @elukey about that.
In eqiad only rows A and C have the cloud-support vlan: https://netbox.wikimedia.org/ipam/vlans/?q=cloud-support&site=eqiad

As this vlan is being deprecated, I'm reluctant to create that vlan on more rows as it means updating ACLs, allocating IPs, configuring routers, switches, dhcp, etc...
So it comes down to either:
a) You really need it in row B
b) The machine needs to move to A or C

Had a chat with Manuel, if possible let's move the host in row A so we have 2 labsdb hosts in there and 2 in row C :)

@Cmjohnson would it be possible to move the host in a new Rack?

@elukey I moved the host to A6 and updated netbox. Arzhel updated network switch cfg. DNS will need to be updated and then ready for installs.

Change 493295 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Add A/PTR records for labsdb1012

https://gerrit.wikimedia.org/r/493295

Change 493295 merged by Elukey:
[operations/dns@master] Add A/PTR records for labsdb1012

https://gerrit.wikimedia.org/r/493295

Change 493299 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add labsdb1012 basic puppet settings

https://gerrit.wikimedia.org/r/493299

Change 493299 merged by Elukey:
[operations/puppet@production] Add labsdb1012 basic puppet settings

https://gerrit.wikimedia.org/r/493299

@Cmjohnson DHCP works fine and I can PXE boot, but then the Debian installer complains about "no partition found".. I checked via the Installer's shell and I can see the hw raid 10 as /dev/sdb, meanwhile the partman recipe wants /dev/sda. I can see a Logical Drive 0: RAID10, 13.968TB, Optimal form iLO, I am wondering if some setting needs to be tuned to correctly use /dev/sda?

FYI I solved the issue disabling the SD card support in: System Configuration -> Bios/Platform configuration -> System Options -> Usb support (this is a HP gen 10). Now the remaining issue seems to be remote IPMI (trying to start wmf-auto-reimage and failing with Error: Unable to establish IPMI v2 / RMCP+ session).

Via HPE CLI I tried to flip /map1/config1/oemHPE_ipmi_dcmi_overlan_enable=yes but didn't work afaics..

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['labsdb1012.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201903010815_elukey_158975.log.

The above setting seems to have done the trick, now wmf-auto-reimage works.. I got this:

08:27:57 | labsdb1012.eqiad.wmnet | WARNING: unable to verify that BIOS boot parameters are back to normal, got:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

EDIT: followed up with:

elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "labsdb1012.mgmt.eqiad.wmnet" -U root -E chassis bootparam get 5
Unable to read password from environment
Password:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0004000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : Force PXE
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "labsdb1012.mgmt.eqiad.wmnet" -U root -E chassis bootdev none
Unable to read password from environment
Password:
Set Boot Device to none

elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "labsdb1012.mgmt.eqiad.wmnet" -U root -E chassis bootparam get 5
Unable to read password from environment
Password:
Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 8000000000
 Boot Flags :
   - Boot Flag Valid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

Completed auto-reimage of hosts:

['labsdb1012.eqiad.wmnet']

and were ALL successful.

elukey updated the task description. (Show Details)

Change 493653 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] [WIP] Assign role labs::db::wikireplica_analytics to labsdb1012

https://gerrit.wikimedia.org/r/493653

Change 493653 merged by Elukey:
[operations/puppet@production] Assign role labs::db::wikireplica_analytics to labsdb1012

https://gerrit.wikimedia.org/r/493653

Next step is to copy data from labsdb1011 and then start mysql and configure analytics-specific things.

Change 494408 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] dbproxy1010: Depool labsdb1011

https://gerrit.wikimedia.org/r/494408

Change 494408 merged by Marostegui:
[operations/puppet@production] dbproxy1010: Depool labsdb1011

https://gerrit.wikimedia.org/r/494408

Mentioned in SAL (#wikimedia-operations) [2019-03-05T07:08:54Z] <marostegui> Start transferring data from labsdb1011 to labsdb1012 - T215231

The data transfer from labsdb1011 to labsdb1012 finished.
I have deleted all the following files on labsdb1012 before starting mysql on that host:

relay-log-sX.info
multi-master.info
master-sX.info

After that MySQL started all fine - once it was up I ran this just in case:

reset slave all

Then I have configured replication based on the following coordinates https://phabricator.wikimedia.org/P8153
Right now labsdb1012 is replicating fine. Let's see if it all goes fine:

root@cumin1001:~# mysql.py -hlabsdb1012 -e "show all slaves status\G" | egrep "Connection_name|Seconds"
              Connection_name: s1
        Seconds_Behind_Master: 63670
              Connection_name: s2
        Seconds_Behind_Master: 35191
              Connection_name: s3
        Seconds_Behind_Master: 41018
              Connection_name: s4
        Seconds_Behind_Master: 75818
              Connection_name: s5
        Seconds_Behind_Master: 69958
              Connection_name: s6
        Seconds_Behind_Master: 55365
              Connection_name: s7
        Seconds_Behind_Master: 73627
              Connection_name: s8
        Seconds_Behind_Master: 82809

Mentioned in SAL (#wikimedia-operations) [2019-03-06T06:27:28Z] <marostegui> Add labsdb1012 to tendril and zarcillo - T215231

Mentioned in SAL (#wikimedia-operations) [2019-03-06T07:09:43Z] <elukey> raised analytics user's max_user_connection from 10 to 100 on labsdb1012 - T215231

elukey updated the task description. (Show Details)

Change 494648 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] labsdb1012: Enable notifications

https://gerrit.wikimedia.org/r/494648

Change 494648 merged by Elukey:
[operations/puppet@production] labsdb1012: Enable notifications

https://gerrit.wikimedia.org/r/494648

elukey reopened this task as Open.EditedMar 6 2019, 7:11 PM

I added the following bit on cr1/cr2:

elukey@re0.cr1-eqiad# show | compare
[edit firewall family inet filter analytics-in4 term mysql from destination-address]
         10.64.48.18/32 { ... }
+        /* labsdb1012.eqiad.wmnet */
+        10.64.4.16/32;

But the analytics hosts still can't contact labsdb1012. I checked iptables on the host and port 3306 is indeed properly filtered, so I guess we'll need another rule for hosts like an-coord1001.eqiad.wmnet. Another option is to add a special dbproxy domain only for analytics without touching the firewall rules (probably cleaner).

@jcrespo @Marostegui thoughts? What would it be best in your opinion? I'd prefer another dbproxy-based domain but not sure how complicated to create/maintain would be for you.. (my knowledge of dbproxies was overlapping with LVS, sorry for the confusion).

@elukey the problem is that if we add it to the existing proxies, they'll be reachable by wikireplica users, as there is a round robin there, so nothing would be preventing them to be shared, which is not what we want for now.
This is an example of a dbproxy1010:

profile::mariadb::proxy::replicas::servers:
  labsdb1010:
    address: '10.64.37.23:3306'
  labsdb1011:
    address: '10.64.37.24:3306'
profile::mariadb::proxy::firewall: 'disabled'

So maybe you need to work out a ferm rule for that specifically.

Change 494874 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::labs::db::wikireplica: add special ferm rules for analytics

https://gerrit.wikimedia.org/r/494874

Change 494874 merged by Elukey:
[operations/puppet@production] Introduce role::labs::db::wikireplica_analytics::dedicated

https://gerrit.wikimedia.org/r/494874

As per our earlier chat - this seems to be working fine after the puppet change to get the FW opened for labsdb1012