Page MenuHomePhabricator

Install Debian Buster on Hadoop
Open, MediumPublic0 Estimated Story Points

Description

The upgrade to Debian buster for the Hadoop cluster(s) might be a bit more complicated than what we thought, due to the fact that openjdk-8 is not available on Debian Buster. In T229347 Andrew was able to install it on stat1005 since the openjdk-8 was present in Buster before its final release, but not now (so if we reimage we'll not find it for example).

The above becomes problematic due to the following constraints:

  1. Spark 2.3 (our current version) doesn't support Java 11 (see also T229347#5394326). IIUC this is due to the Scala version used (2.11), that doesn't support Java 11 (https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html)
  2. Support of scala 2.12+ for Java 11 is still incomplete - https://docs.scala-lang.org/overviews/jdk-compatibility/overview.html#jdk-11-compatibility-notes
  3. Spark 2.4 comes with scala 2.12 that offers experimental support for Java 11

Also, in stretch-backports we do have openjdk-11: https://packages.debian.org/stretch-backports/openjdk-11-jdk
Last but not the least, we'd also need to make sure that the HDFS/Yarn daemons work correctly on Buster and Java 11. CDH of course supports Java11 only from 6.3 onward: https://www.cloudera.com/documentation/enterprise/upgrade/topics/ug_jdk8.html

But it also true that CDH 6.3 ships with Spark 2.4, so either they support Java 11 as experimental feature or there is a way to make Spark 2.4 working: https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_63_packaging.html

Considerations:

  • I am not a scala/spark expert so what I wrote above might not be true, please double check and in case correct me :)
  • backporting openjdk-8 to buster is possible but it would require a big effort for the SRE team. The last backport of openjdk-8 for cassandra on Debian Jessie still needs to be maintained (application of patches for Debian Security Advisories, etc..), so it would be preferable not to go on that road again.

Details

Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

CDH of course supports Java11 only from 6.3 onward: https://www.cloudera.com/documentation/enterprise/upgrade/topics/ug_jdk8.html

Of, if that's true, this could be a problem. We'd have to upgrade to CDH 6.3 first?

elukey added a comment.Sep 9 2019, 5:20 PM

In my mind, there are three major things that we'd need to do for Hadoop:

  • Complete the work on Kerberos, roll out the new config and handle the fallout of problems that we didn't test/take into account. Even if we did a ton of testing, there will be a lot of people to train and use cases to fix, requiring a lot of time.
  • Test Hadoop 3 in the test cluster, and see if it can work with our code (refinery) and if not, what changes would need to be made. It seems easy but it will require a lot of time and efforts.
  • Migrate to buster, that means also migrating to Java 11. Another long task that will require extensive testing and resources from the Analytics team.

The last two steps must also take into account that testing will not involve only the Hadoop code, but all the dependent systems (Druid, Notebooks, Spark, etc..) and also our own code (most notably, the Analytics refinery with our jobs etc..).

Due to the current size of our team, I can see only two of the above goals doable, three is overcommitting in my opinion. Even if I'd love to test Java 11 as soon as possible, a realist plan in my opinion could be:

  1. Port, if possible, openjdk-8 to wikimedia-buster and establish how much effort is needed by SRE to maintain the package(s). A co-ownership with Analytics could also be possible to split the pain. I offer myself volunteer to maintain the openjdk-8 package(s) if needed (with supervision :).
  2. Unblock the Buster migration of the Hadoop cluster, and at the same time allow the testing of Hadoop 3 (CDH6) on the testing cluster.

Eventually, when the above goals (and Kerberos) will be done, we'll be able to decide/plan for Java 11 (likely next FY in my opinion).

Nuria added a comment.Sep 9 2019, 5:35 PM

Agreed with @elukey and priority wise I think we cannot test any hadoop upgrades until we have rolled out kerberos

@MoritzMuehlenhoff what do you think about our plan? Could it be enough to consider backporting openjdk-8 to buster? The value added would not only be related to Hadoop, but to all java-based systems like Kafka/Druid/Zookeeper/etc.. that will be able to migrate to Buster without testing Java 11 first (non trivial step sadly).

For Kafka I can see https://issues.apache.org/jira/browse/KAFKA-7264, that was fixed for 2.1.0. In theory we could migrate to Buster this year with openjdk-8, and then think about the Kafka 2.x upgrade for next fiscal + java 11. Same path for other systems..

So, let me summarise to make I got this correctly. We have the following two options:

  1. Upgrade to CDH 6.3 on Stretch which provides Hadoop and Scala supporting both Java 8 and 11 and then reimage each server from "CDH 6.3/Stretch" to "CDH 6.3/Buster"
  2. Build Java 8 for Buster and install the current CDH 5 packages on Buster (do we know if they are supported on Buster, though?) and then migrate from "CDH 5 + Java 8 /Buster" to "CDH 6.3+Java 11/Buster" later

Is that correct? With the additional contraint that we want to run the GPU-stuff which is Buster-only as stat host with Hadoop access, right?

Then 2. is the only feasible option and let's do it. We might run into similar migration issues with Elastic as well, so maybe that work also useful on a wider scale.

I have to note though that these temporary things always tend to stick around; I built Java 8 for jessie something like four years ago as a Cassandra performance enhancement for Restbase (which used Java 7 at the time) and to this date we still have to keep it update :-)

A tricky part about Java upgrades is that from what I can tell, any inter JVM process communication seems to fail between different java versions. So, Hadoop <-> Hadoop stuff will fail if the processes are running different JVM versions. This means that we have to do a full cluster downtime to upgrade to Java 11. I'm not 100% sure this is always true, just something I've noticed from trying. I'm not sure if this is true of Kafka. If it is...I'm not sure what we are gonna do! :)

So, let me summarise to make I got this correctly. We have the following two options:

  1. Upgrade to CDH 6.3 on Stretch which provides Hadoop and Scala supporting both Java 8 and 11 and then reimage each server from "CDH 6.3/Stretch" to "CDH 6.3/Buster"

Yes correct, with two caveats: 1) as Andrew mentioned, all hosts will need to be migrated at once 2) We (as Analytics) have to test and port a ton of code that uses core Hadoop function to the new major version.

  1. Build Java 8 for Buster and install the current CDH 5 packages on Buster (do we know if they are supported on Buster, though?) and then migrate from "CDH 5 + Java 8 /Buster" to "CDH 6.3+Java 11/Buster" later

We don't know yet, my plan was to start testing a Buster node on the Testing cluster as soon as the Kerberos work is in a good state. In theory it shouldn't be a problem, but in practice we'll need time to test. If Java 8 is available on Buster we'll be able to also convert a couple of Hadoop Analytics nodes (not testing I mean) and observe them for a couple of weeks to spot anomalies (and in case attempt to fix them or just rollback everything, worst that can happen is to have some job failed).

Is that correct? With the additional contraint that we want to run the GPU-stuff which is Buster-only as stat host with Hadoop access, right?

Yep!

Then 2. is the only feasible option and let's do it. We might run into similar migration issues with Elastic as well, so maybe that work also useful on a wider scale.

I have also another first candidate for Java 8 on Buster, namely the new Zookeeper Analytics nodes T217057. Zookeeper clients use Java libraries to contact the cluster (as opposed to use a more agnostic protocol like HTTP for example), so running Java 11 on the servers and 8 on the clients might end up in serialization issues (same that Andrew mentioned).

I have to note though that these temporary things always tend to stick around; I built Java 8 for jessie something like four years ago as a Cassandra performance enhancement for Restbase (which used Java 7 at the time) and to this date we still have to keep it update :-)

I completely agree and my team is committed to test Hadoop 3 + Java 11 as soon as possible. To share the pain I also offered myself to help in maintaining java 8 on buster if needed :)

T233604 tracks the work to import the openjdk-8 package to a special component for Debian Buster, thanks Moritz!

Change 538844 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::java::analytics: deploy openjdk-8 on Buster

https://gerrit.wikimedia.org/r/538844

Change 538844 merged by Elukey:
[operations/puppet@production] profile::java::analytics: deploy openjdk-8 on Buster

https://gerrit.wikimedia.org/r/538844

T214364 has to be taken into consideration since it lists the missing dependencies that we had to create for CDH on stretch.

elukey added a comment.Jan 3 2020, 2:52 PM

Given the problem of Java 8 vs 11 has been resolved, I'd say that we could concentrate on the Hadoop workers for the moment, leaving aside other corner cases like Hue (that can stay on Stretch for more time, there is no real rush).

Change 561869 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Set Buster for analytics1031

https://gerrit.wikimedia.org/r/561869

Change 561869 merged by Elukey:
[operations/puppet@production] Set Buster for analytics1031

https://gerrit.wikimedia.org/r/561869

elukey added a comment.Jan 3 2020, 4:55 PM

I forgot that libssl1.0.0 is also a dependency of hadoop-* packages, following up in T214364 to see how to solve the problem.

elukey changed the task status from Open to Stalled.Feb 18 2020, 2:36 PM

The current idea is to move to BigTop first (on Stretch) and then wait for the upcoming 1.5 release that should natively support Buster.

Marking this task as stalled until T244499 is completed.

elukey changed the task status from Stalled to Open.Thu, Feb 18, 7:29 AM

Change 665005 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] hadoop: set Buster for all worker nodes

https://gerrit.wikimedia.org/r/665005

Change 665005 merged by Elukey:
[operations/puppet@production] hadoop: set Buster for all worker nodes

https://gerrit.wikimedia.org/r/665005

Change 665049 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: add custom reuse recipes for Hadoop test

https://gerrit.wikimedia.org/r/665049

Change 665049 merged by Elukey:
[operations/puppet@production] install_server: add custom reuse recipes for Hadoop test

https://gerrit.wikimedia.org/r/665049

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

an-test-worker1003.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102180807_elukey_11734_an-test-worker1003_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['an-test-worker1003.eqiad.wmnet']

and were ALL successful.

elukey added a comment.EditedThu, Feb 18, 10:46 AM

The reimage of an-test-worker1003 (preserving the /srv/hadoop dir) went fine! Bigtop works fine on Buster, and the host was re-added to HDFS nicely.

One problem keeps re-occurring though, namely the fact that the hdfs/yarn/etc.. system daemons are created by the hadoop packages, and hence they don't get a fixed uid/gid combination. In the an-test-worker1003 case I got the datanode directory under /srv owned by something else instead of hdfs:hdfs, and the datanode refused to start until a chown -R hdfs:hdfs was issued. It is not something super horrible but a little annoying do to for all the workers that we have, and I am afraid we'll not be able to skip it.

What we can do is see if we can set fixed uids in puppet and apply them during the upgrade to Buster, so that we'll be set for the Bullseye upgrade! :)

The reimage of an-test-worker1003 (preserving the /srv/hadoop dir) went fine! Bigtop works fine on Buster, and the host was re-added to HDFS nicely.

One problem keeps re-occurring though, namely the fact that the hdfs/yarn/etc.. system daemons are created by the hadoop packages, and hence they don't get a fixed uid/gid combination. In the an-test-worker1003 case I got the datanode directory under /srv owned by something else instead of hdfs:hdfs, and the datanode refused to start until a chown -R hdfs:hdfs was issued. It is not something super horrible but a little annoying do to for all the workers that we have, and I am afraid we'll not be able to skip it.

What we can do is see if we can set fixed uids in puppet and apply them during the upgrade to Buster, so that we'll be set for the Bulleye upgrade! :)

We recently added a new mechanism to configure system UIDs/GIDs via data.yaml, see the reprepro examples already in there. We need to research/test what happens if the user/group already/exists with a different GID/UID, but this could be rolled out with the existing stretch hosts and then with the reimage it would simply gain the new settings upfront.

elukey added a subscriber: razzi.Thu, Feb 18, 11:27 AM

@razzi this is an interesting problem, I am going to add some context in here :)

At the moment we rely on Bigtop deb packages for the creation of users like hdfs yarn etc.. When the packages are installed, they allocate new users using the first slot/uid available, not a specific one. For example:

elukey@cumin1001:~$ sudo cumin 'A:hadoop-worker' 'id hdfs'
59 hosts will be targeted:
an-worker[1078-1116].eqiad.wmnet,analytics[1058-1077].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                            
(21) an-worker1078.eqiad.wmnet,analytics[1058-1077].eqiad.wmnet                                                                                   
----- OUTPUT of 'id hdfs' -----                                                                                                                   
uid=117(hdfs) gid=123(hdfs) groups=123(hdfs),120(hadoop)                                                                                          
===== NODE GROUP =====                                                                                                                            
(38) an-worker[1079-1116].eqiad.wmnet                                                                                                             
----- OUTPUT of 'id hdfs' -----                                                                                                                   
uid=116(hdfs) gid=122(hdfs) groups=122(hdfs),119(hadoop)

In the case of the Hadoop cluster, it seems that we have some hosts with uid 116 and others with 117 (gid are different too). At the cluster level this is not really important, since authentication etc.. work on names, but in the case of a OS reinstall the issue might be sneaky. For example, say that we reimage an-worker1078 preserving the datanode dirs (where hdfs blocks are stored, owned by user hdfs with uid 117). If the first puppet run on the new OS (installing hadoop packages) ends up in creating the user hdfs with uid 130, then all files in the datanode dirs will have incorrect ownership (since files are owned by uid, and then the uid is mapped to a name). In the case that I brought above, the reimage of an-test-worker1003 led to the HDFS Datanode daemon not starting, due to access permission denided to some files. The fix is easy, namely doing a chown -R hdfs:hdfs, and we'll need to do it for all the reimaged hosts I am afraid, but what we are discussing in here is to introduce a way to force the creation of some system users to fixed uids, so that next upgrades of the OS will be less annoying.

Filippo documented a similar issue for Swift in https://phabricator.wikimedia.org/T123918, where he ended up adding a special use case to late_command.sh (the script used after the end of debian installs) to create swift users with fixed uid/gids.

I had a chat with Moritz on IRC and he suggested another alternative approach, namely having a package called wmf-analytics-users, that simply ships hdfs/yarn/presto/etc.. users on hosts with fixed uid/gids. It would need to be installed before anything else, but it would be a nice option as well.

Long term, if Bigtop will ship systemd configs, we'll be able to apply overrides via puppet, but for the moment we have to choose one of the above solutions :)

@Ottomata @razzi thoughts/preferences?

I am trying to figure out if there is a quick way to do this in puppet, but the main problem is that if we try to declare a user with a specific uid/gid in puppet then puppet will override it, if already present, during the first puppet run.

the main problem is that if we try to declare a user with a specific uid/gid in puppet then puppet will override it, if already present, during the first puppet run.

If we are reinstalling these, could we declare the user in puppet, and then make the bigtop package installation depend on that first? Assuming bigtop won't recreate the user if the user already exists before it is installed. E.g.

if $os == 'buster' {
  user { 'hdfs': ... }
}

package { 'bigtop-hdfs': ..., require => User['hdfs']

Hm I guess we'd have to deal with the existent users on the already on buster client nodes.

@Ottomata this may work, even if we have some hosts already on Buster (say stat100x, an-launcher, etc.. but we can fix those manually in theory).

In order to use the require => User['hdfs'] we shoud have a correspondent user {'hdfs': } anyway, so possibly in the stretch use case we should just create the user without fixed uid/gid?

anyway, so possibly in the stretch use case we should just create the user without fixed uid/gid?

Oh ya that could work. Alternatively we could just do user { 'hdfs': ..., before => Package['bigtop-hdfs'] }, but I like your idea better.

Luca from the past already added hdfs/yarn/mapred users to puppet! Completely forgot about it.. Of course we didn't set any specific uid/gid

Some cumin magic:

elukey@cumin1001:~$ sudo cumin 'P{R:User=hdfs} and P{F:lsbdistcodename=buster}'
31 hosts will be targeted:
an-airflow1001.eqiad.wmnet,an-druid[1001-1002].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-presto[1001-1005].eqiad.wmnet,an-test-client1001.eqiad.wmnet,an-test-druid1001.eqiad.wmnet,an-test-presto1001.eqiad.wmnet,an-test-ui1001.eqiad.wmnet,an-test-worker1003.eqiad.wmnet,an-tool[1008-1009].eqiad.wmnet,druid[1001-1008].eqiad.wmnet,labstore[1006-1007].wikimedia.org,stat[1004-1008].eqiad.wmnet
DRY-RUN mode enabled, aborting

We could do something like this:

  1. Add specific require => User[..] to package resources in core hadoop classes (should be one/two maximum, easy)
  2. Introduce a conditional in puppet to have the fixed uid/gid for hdfs/yarn/mapred only for Buster
  3. On the above nodes, right after running puppet, the hdfs/yarn/mapred gids will change. So we'll save the original uid/gids somewhere and then run
find / -group Y -exec chgrp -h hdfs {} \;
find / -user X -exec chown -h hdfs {} \;

After this we should be able to just reimage new nodes with fixed gid, and get consistency once all hosts will be migrated to Buster.

@MoritzMuehlenhoff does it make sense? It could be an alternative solution to the package one, lemme know :)

After this we should be able to just reimage new nodes with fixed gid, and get consistency once all hosts will be migrated to Buster.

@MoritzMuehlenhoff does it make sense? It could be an alternative solution to the package one, lemme know :)

Sounds good to me. Make sure to grab GIDs in the < 500 range, since 500-1000 can clash with system managers managed via data.yaml.

Change 665360 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop: require hadoop users before installing daemon packages

https://gerrit.wikimedia.org/r/665360

Change 665360 merged by Elukey:
[operations/puppet@production] bigtop: require hadoop users before installing daemon packages

https://gerrit.wikimedia.org/r/665360

A high level script to change the user hdfs could be:

#!/bin/bash

set -ex

UID=$(id -u hdfs)
GID=$(id -g hdfs)

usermod -u 200 hdfs
groupmod -g 200 hdfs

find / \( -path /proc -o -path /mnt -o -path /sys -o /path /dev  -o -path /media  \) -prune -false -o -user $UID -exec chown hdfs {} \;
find / \( -path /proc -o -path /mnt -o -path /sys -o /path /dev  -o -path /media \) -prune -false -o -user $GID -exec chown :hdfs {} \;

For hadoop workers, the idea is to run the above right before the reimage. We could stop yarn/hdfs daemons and let the worker drain, then we could change manually the hdfs/yarn/etc.. users and then kick off the reimage. This should avoid some issues when the node will come up on Buster and puppet runs for the first time (potentially starting hadoop daemons).

We can repeat it for multiple user/group, that I think should be:

  • group hadoop
  • user/group analytics
  • user/group analytics-search
  • user/group analytics-privatedata
  • user/group analytics-product
  • user/group druid

Basically all the ones in profile::analytics::cluster::users. There are other ones like presto/superset/etc.. but I would concentrate only on the ones present on the Hadoop workers.

Some considerations:

  • I tried to reduce the amount of chmod/chowns reviewing the current uid/gid usage across the nodes, but sadly the same uid/gid numbers are used for different things across hosts (like uid 116 equal to hdfs on some nodes, and equal to presto on others).
  • The final idea that I have is to just pick a new uid/gid not used anywhere, allocate it/them for all the above user/groups and apply the fix. With the above find commands it shouldn't take a long time.
  • Adding system users to data.yaml (to reserve uid/gid ) means that our users will be deployed fleetwide, that seems to be too much. The idea that me an Moritz had was to limit the scope of the deployment of these users to Analytics-land, but now I am wondering if it may become a problem in the future. We have deployed our users to systems like labstore nodes, that are outside our realm, and we could have the same use case in the future on different systems. If we enforce a fixed gid/uid for some users and then it clashes with other ones, we might need to apply some hacks/conditionals to make everything work. Maybe we could reserve uid/gids anyway in puppet, with a comment to avoid people using them? (but not forcing their deployment). @MoritzMuehlenhoff would it make sense?

Change 666092 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop: add the hadoop group to the catalog

https://gerrit.wikimedia.org/r/666092

  • Adding system users to data.yaml (to reserve uid/gid ) means that our users will be deployed fleetwide, that seems to be too much. The idea that me an Moritz had was to limit the scope of the deployment of these users to Analytics-land, but now I am wondering if it may become a problem in the future. We have deployed our users to systems like labstore nodes, that are outside our realm, and we could have the same use case in the future on different systems. If we enforce a fixed gid/uid for some users and then it clashes with other ones, we might need to apply some hacks/conditionals to make everything work.

Sure, if we see other uses cases where the Hadoop related users may be used outside of analytics land, why not!

Maybe we could reserve uid/gids anyway in puppet, with a comment to avoid people using them? (but not forcing their deployment). @MoritzMuehlenhoff would it make sense?

People won't use them for unrelated tasks anyway, the names are descriptive enough anyway. I mean noone is currently using the reprepro group either :-)

Change 666133 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: reserve gid/uid for various Hadoop daemons

https://gerrit.wikimedia.org/r/666133

Change 666134 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop: set uid/gid for yarn/hdfs/mapred/hadoop user/groups for Buster

https://gerrit.wikimedia.org/r/666134

Change 666135 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward

https://gerrit.wikimedia.org/r/666135

Change 666092 merged by Elukey:
[operations/puppet@production] bigtop: add the hadoop/hdfs/mapred/yarn groups to the catalog

https://gerrit.wikimedia.org/r/666092

Change 666133 merged by Elukey:
[operations/puppet@production] admin: reserve gid/uid ftor various Hadoop daemons

https://gerrit.wikimedia.org/r/666133

elukey added a comment.EditedTue, Feb 23, 8:48 AM

Updated script after uid/gid reservation (the script can be refactored in 100 ways but I prefer to keep it simple and clear):

#!/bin/bash

set -x

change_uid() {
    # $1 new uid
    # $2 username
    if id "$2" &>/dev/null
    then
        OLD_UID=$(id -u $2)
        usermod -u $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -user $OLD_UID -print0 | xargs -0 chown $1
    fi
}

change_gid() {
    # $1 new gid
    # $2 username
    if getent group $2 &>/dev/null
    then
        OLD_GID=$(getent group $2 | cut -d ":" -f 3)
        groupmod -g $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -group $OLD_GID -print0  | xargs -0 chgrp $1
    fi
}

## hdfs


change_uid 903 hdfs
change_gid 903 hdfs

## yarn

change_uid 904 yarn
change_gid 904 yarn

## mapred

change_uid 905 mapred
change_gid 905 mapred

## analytics

change_uid 906 analytics
change_gid 906 analytics

## druid

change_uid 907 druid
change_gid 907 druid

## hadoop

change_gid 908 hadoop

I tried to apply the above on an-test-worker1003 (already on Buster) doing the following:

  1. Stop all hadoop daemons + puppet disabled
  2. run the script
  3. enable + run puppet

Some notes:

  • find + chown is incredibly slow, going through the hdfs datanode dirs/files take a lot of minutes. It may be unfeasible for production workers, I'll try with chown -R --from or something similar (only for datanode hdfs dirs).
  • the chown step is not completely safe, since it may remove the suid bit from executables. More specifically, it happened to /usr/lib/hadoop-yarn/bin/container-executor leading to the Yarn nodemanager not willing to start. It shouldn't be a problem for us since we'll reimage all the workers, but something to keep in mind.

Tobias reviewed my horrible code and suggested some changes:

  • avoid using the name of the user/group in find to avoid unnecessary calls to getent
  • use find -print0 | xargs -0 to avoid the slow find/exec model and batch groups of chown/chgrp calls together.

The script now runs like 100 times faster, now the timings are acceptable :D

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

an-test-worker1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202102231401_elukey_6277_an-test-worker1002_eqiad_wmnet.log.

I am testing something on an-test-worker1002, but the next step is to merge the change to enforce uid/gid for Buster nodes and see how the reimages go. Before that, we need to manually fix these hosts:

elukey@cumin1001:~$ sudo cumin 'P{R:User=hdfs or R:User=analytics or R:User=yarn or R:User=mapred or R:Group=hadoop} and P{F:lsbdistcodename=buster}'
32 hosts will be targeted:
an-airflow1001.eqiad.wmnet,an-druid[1001-1002].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-presto[1001-1005].eqiad.wmnet,an-test-client1001.eqiad.wmnet,an-test-druid1001.eqiad.wmnet,an-test-presto1001.eqiad.wmnet,an-test-ui1001.eqiad.wmnet,an-test-worker[1002-1003].eqiad.wmnet,an-tool[1008-1009].eqiad.wmnet,druid[1001-1008].eqiad.wmnet,labstore[1006-1007].wikimedia.org,stat[1004-1008].eqiad.wmnet
DRY-RUN mode enabled, aborting

I have already done the -test nodes, and the Druid ones can be done on a later step. The remaining ones are:

an-airflow1001.eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-presto[1001-1005].eqiad.wmnet,an-tool[1008-1009].eqiad.wmnet,labstore[1006-1007].wikimedia.org,stat[1004-1008].eqiad.wmnet

Completed auto-reimage of hosts:

['an-test-worker1002.eqiad.wmnet']

and were ALL successful.

Change 666134 merged by Elukey:
[operations/puppet@production] bigtop: set uid/gid for hadoop user/groups for Buster

https://gerrit.wikimedia.org/r/666134

Change 666135 merged by Elukey:
[operations/puppet@production] druid::bigtop::hadoop::user: add fixed uid/gid from Buster onward

https://gerrit.wikimedia.org/r/666135

The druid/mapred/yarn/hdfs/analytics users have all fixed uid/gids on buster nodes. But I realized that we forgot a few, namely:

  • analytics-privatedata
  • analytics-product
  • analytics-search

The main reason is the following:

root@an-worker1080:/home/elukey# ls -l /var/lib/hadoop/data/b/yarn/local/usercache
total 132
drwxr-s--- 4 aikochou              yarn 4096 Feb 17 17:18 aikochou
drwxr-s--- 4 analytics             yarn 4096 Feb 16 21:30 analytics
drwxr-s--- 4 analytics-privatedata yarn 4096 Feb 16 22:00 analytics-privatedata
drwxr-s--- 4 analytics-product     yarn 4096 Feb 17 00:22 analytics-product
drwxr-s--- 4 analytics-search      yarn 4096 Feb 16 22:25 analytics-search
[..]

Regular users have fixed gid already, but not the above tree. Since we use containers in Yarn and they execute as the user who launched the job, the log files etc.. will be owned by the user running the container as well. If we reimage without having fixed gid/uids for all the users then we risk to get trouble later on.

Change 666657 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Allocate fixed uid/gid for analytics-related system daemons

https://gerrit.wikimedia.org/r/666657

elukey added a comment.EditedWed, Feb 24, 6:30 PM

https://gerrit.wikimedia.org/r/666657 needs some follow up on the following nodes first:

elukey@cumin1001:~$ sudo cumin 'P{c:profile::analytics::cluster::users} and P{F:lsbdistcodename=buster}'
10 hosts will be targeted:
an-airflow1001.eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-test-client1001.eqiad.wmnet,an-test-worker[1002-1003].eqiad.wmnet,stat[1004-1008].eqiad.wmnet
DRY-RUN mode enabled, aborting

More precisely:

#!/bin/bash

set -x

change_uid() {
    # $1 new uid
    # $2 username
    if id "$2" &>/dev/null
    then
        OLD_UID=$(id -u $2)
        usermod -u $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -user $OLD_UID -print0 | xargs -0 chown $1
    fi
}

change_gid() {
    # $1 new gid
    # $2 username
    if getent group $2 &>/dev/null
    then
        OLD_GID=$(getent group $2 | cut -d ":" -f 3)
        groupmod -g $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -group $OLD_GID -print0  | xargs -0 chgrp $1
    fi
}

change_uid 909 analytics-privatedata
change_gid 909 analytics-privatedata

change_uid 910 analytics-product
change_gid 910 analytics-product

change_uid 911 analytics-search
change_gid 911 analytics-search

Change 666657 merged by Elukey:
[operations/puppet@production] Allocate fixed uid/gid for analytics-related system daemons

https://gerrit.wikimedia.org/r/666657

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1117.eqiad.wmnet', 'an-worker1118.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102251023_elukey_19345.log.

Completed auto-reimage of hosts:

['an-worker1117.eqiad.wmnet', 'an-worker1118.eqiad.wmnet']

and were ALL successful.

Change 666865 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-worker111[7,8] to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/666865

Change 666865 merged by Elukey:
[operations/puppet@production] Add an-worker111[7,8] to the Analytics Hadoop cluster

https://gerrit.wikimedia.org/r/666865

Change 666872 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop::hadoop: create system users before installing hadoop-client

https://gerrit.wikimedia.org/r/666872

Change 666872 merged by Elukey:
[operations/puppet@production] bigtop::hadoop: create system users before installing hadoop-client

https://gerrit.wikimedia.org/r/666872

Change 666875 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop::hadoop: avoid a dependency between hadoop-client and users

https://gerrit.wikimedia.org/r/666875

Change 666875 merged by Elukey:
[operations/puppet@production] bigtop::hadoop: avoid a dependency between hadoop-client and users

https://gerrit.wikimedia.org/r/666875

Change 666880 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] bigtop::hadoop: hdfs/mapred/yarn system users needs to be in grp hadoop

https://gerrit.wikimedia.org/r/666880

Change 666880 merged by Elukey:
[operations/puppet@production] bigtop::hadoop: hdfs/mapred/yarn system users needs to be in grp hadoop

https://gerrit.wikimedia.org/r/666880

All right an-worker111[7,8] (previously in the backup cluster) were bootstrapped fine with the new fixe gid/uid, will proceed with the rest of the backup cluster in T274795 before proceeding with the reimage of the current Analytics Hadoop nodes.

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['analytics1058.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102260748_elukey_30366.log.

Change 667122 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add specific settings for Hadoop workers on Buster with GPUs

https://gerrit.wikimedia.org/r/667122

Final script to use for workers:

#!/bin/bash

set -x

change_uid() {
    # $1 new uid
    # $2 username
    if id "$2" &>/dev/null
    then
        OLD_UID=$(id -u $2)
        usermod -u $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -user $OLD_UID -print0 | xargs -0 chown -h $1
    fi
}

change_gid() {
    # $1 new gid
    # $2 username
    if getent group $2 &>/dev/null
    then
        OLD_GID=$(getent group $2 | cut -d ":" -f 3)
        groupmod -g $1 $2
        find / \( -path /proc -o -path /mnt -o -path /sys -o -path /dev -o -path /media \) -prune -false -o -group $OLD_GID -print0  | xargs -0 chgrp -h $1
    fi
}

## hdfs


change_uid 903 hdfs
change_gid 903 hdfs

## yarn

change_uid 904 yarn
change_gid 904 yarn

## mapred

change_uid 905 mapred
change_gid 905 mapred

## analytics

change_uid 906 analytics
change_gid 906 analytics

## druid

change_uid 907 druid
change_gid 907 druid

## hadoop

change_gid 908 hadoop

change_uid 909 analytics-privatedata
change_gid 909 analytics-privatedata

change_uid 910 analytics-product
change_gid 910 analytics-product

change_uid 911 analytics-search
change_gid 911 analytics-search

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1096.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102261000_elukey_6199.log.

Change 667122 merged by Elukey:
[operations/puppet@production] Add specific settings for Hadoop workers on Buster with GPUs

https://gerrit.wikimedia.org/r/667122

Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts:

['an-worker1096.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202102261103_elukey_20564.log.

Completed auto-reimage of hosts:

['an-worker1096.eqiad.wmnet']

and were ALL successful.

Change 667180 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: switch to partman's reuse-parts.cfg for hadoop workers

https://gerrit.wikimedia.org/r/667180

Change 667180 merged by Elukey:
[operations/puppet@production] install_server: switch to partman's reuse-parts.cfg for hadoop workers

https://gerrit.wikimedia.org/r/667180

elukey added a comment.EditedFri, Feb 26, 4:42 PM

Current status:

  • reimaged analytics1058 (regular hadoop worker, 12 disks) - all good! (the reuse partman recipe preserved the datanode dirs)
  • reimaged an-worker1096 (GPU worker, 24 disks) - sort of good, see T275896 (but the reuse partman recipe preserved the datanode dirs)
  • added an-worker111[7,8] from the old Hadoop Backup cluster (new worker nodes) - all good