Page MenuHomePhabricator

Bring stat1009 into service
Closed, ResolvedPublic

Description

stat1009 is a new stats box, which is curently in the insetup::data_engineering role

We should aim to bring this server into production as soon as it is convenient to do so.

It is running bullseye, so it will depend on fixing any outstanding issues in T329363: Upgrade Hadoop test cluster to Bullseye that relate to an-test-client1002 first.
We encountered some upgrade related errors tracking them via the checklist below;

  • Create the /etc/jupyter directory required for jupyter_notebook_config.py
  • Add stat1009 to nfs_clients
  • Reduce mariadb version constraints on Bullseye

Once it is done, we should look to decommission stat1006

  • update Wikitech documentation: Updated on stat1009 and as part of Data Engineering/Systems/Clients
  • announce availability to users
  • update refinery deployment targets
  • update any other references to stats servers

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 919791 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[labs/private@master] Add stat1009 dummykeytabs

https://gerrit.wikimedia.org/r/919791

Change 919791 merged by Stevemunene:

[labs/private@master] Add stat1009 dummykeytabs

https://gerrit.wikimedia.org/r/919791

Change 919826 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Bring stat1009 into service

https://gerrit.wikimedia.org/r/919826

Change 919826 merged by Stevemunene:

[operations/puppet@production] Bring stat1009 into service

https://gerrit.wikimedia.org/r/919826

Merged the patch to bring stat1009 into role(statistics::explorer) and there are some errors/conflicts that I am looking into from having our first stat node on bullseye. Details below;
Error with mounting clouddumps

Notice: /Stage[main]/Statistics::Dataset_mount/Mount[/mnt/nfs/dumps-clouddumps1001.wikimedia.org]/ensure: ensure changed 'unmounted' to 'mounted' (corrective)
Error: /Stage[main]/Statistics::Dataset_mount/Mount[/mnt/nfs/dumps-clouddumps1001.wikimedia.org]: Could not evaluate: Execution of '/usr/bin/mount /mnt/nfs/dumps-clouddumps1001.wikimedia.org' returned 32: mount.nfs: access denied by server while mounting clouddumps1001.wikimedia.org:/
Notice: /Stage[main]/Statistics::Dataset_mount/Mount[/mnt/nfs/dumps-clouddumps1002.wikimedia.org]/ensure: ensure changed 'unmounted' to 'mounted' (corrective)
Error: /Stage[main]/Statistics::Dataset_mount/Mount[/mnt/nfs/dumps-clouddumps1002.wikimedia.org]: Could not evaluate: Execution of '/usr/bin/mount /mnt/nfs/dumps-clouddumps1002.wikimedia.org' returned 32: mount.nfs: access denied by server while mounting clouddumps1002.wikimedia.org:/

An error to do with Mariadb package, we may need to repackage this for bullseye

Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mariadb-client-10.3' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
Package mariadb-client-10.3 is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  mariadb-client-10.5

E: Package 'mariadb-client-10.3' has no installation candidate
Error: /Stage[main]/Profile::Analytics::Cluster::Packages::Statistics/Package[mariadb-client-10.3]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install mariadb-client-10.3' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
Package mariadb-client-10.3 is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  mariadb-client-10.5

E: Package 'mariadb-client-10.3' has no installation candidate

An error with the /etc/jupyter folder previously mentioned here and was solved manually, however we need to get a puppet patch for this. This will be achieved by creating the folder on puppet before passing the config path.

Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230518-257510-eq1xq0.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/jupyterhub/manifests/server.pp, line: 81)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230518-257510-eq1xq0.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/jupyterhub/manifests/server.pp, line: 81)
Wrapped exception:
No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230518-257510-eq1xq0.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Jupyterhub::Server/File[/etc/jupyter/jupyter_notebook_config.py]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /etc/jupyter/jupyter_notebook_config.py20230518-257510-eq1xq0.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/jupyterhub/manifests/server.pp, line: 81)

More puppet folder errors relating to the .gitconfig folder

Error: Could not set 'file' on ensure: No such file or directory - A directory component in /var/lib/stats/.git-credentials20230518-257510-fju91e.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/statistics/manifests/user.pp, line: 62)
Error: Could not set 'file' on ensure: No such file or directory - A directory component in /var/lib/stats/.git-credentials20230518-257510-fju91e.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/statistics/manifests/user.pp, line: 62)
Wrapped exception:
No such file or directory - A directory component in /var/lib/stats/.git-credentials20230518-257510-fju91e.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Statistics::User/File[/var/lib/stats/.git-credentials]/ensure: change from 'absent' to 'file' failed: Could not set 'file' on ensure: No such file or directory - A directory component in /var/lib/stats/.git-credentials20230518-257510-fju91e.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/statistics/manifests/user.pp, line: 62)
Error: Could not set 'present' on ensure: No such file or directory - A directory component in /var/lib/stats/.gitconfig20230518-257510-19zertf.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/git/manifests/userconfig.pp, line: 25)
Error: Could not set 'present' on ensure: No such file or directory - A directory component in /var/lib/stats/.gitconfig20230518-257510-19zertf.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/git/manifests/userconfig.pp, line: 25)
Wrapped exception:
No such file or directory - A directory component in /var/lib/stats/.gitconfig20230518-257510-19zertf.lock does not exist or is a dangling symbolic link
Error: /Stage[main]/Statistics::User/Git::Userconfig[stats]/File[/var/lib/stats/.gitconfig]/ensure: change from 'absent' to 'present' failed: Could not set 'present' on ensure: No such file or directory - A directory component in /var/lib/stats/.gitconfig20230518-257510-19zertf.lock does not exist or is a dangling symbolic link (file: /etc/puppet/modules/git/manifests/userconfig.pp, line: 25)
Notice: /Stage[main]/Statistics::Compute/File[/srv/mediawiki]: Dependency File[/var/lib/stats/.gitconfig] has failures: true
Notice: /Stage[main]/Statistics::Compute/File[/srv/mediawiki]: Dependency File[/var/lib/stats/.git-credentials] has failures: true
Warning: /Stage[main]/Statistics::Compute/File[/srv/mediawiki]: Skipping because of failed dependencies
Warning: /Stage[main]/Statistics::Compute/Tidy[/tmp]: Skipping because of failed dependencies
Warning: /Stage[main]/Statistics::Compute/Git::Clone[statistics_mediawiki]/File[/srv/mediawiki/core]: Skipping because of failed dependencies

Let's look at these one at a time.
For the first one, this is an NFS mount. The two NFS servers are clouddumps100[1-2].eqiad.wmnet.

The file that exists on an NFS server to manage client permissions is: /etc/exports

In this case, we can check the contents of that file like this.

btullis@clouddumps1001:~$ cat /etc/exports 
#  THIS FILE IS MANAGED BY PUPPET
#
# Export dumps to Cloud VPS instances
#
# /etc/exports: the access control list for filesystems which may be exported
#		to NFS clients.  See exports(5).
#

/srv/dumps/xmldatadumps/public  -ro,insecure,fsid=0,sec=sys,no_subtree_check,all_squash,nocrossmnt 10.68.0.0/16 172.16.0.0/21 185.15.56.0/25 172.16.128.0/24 185.15.57.0/29 185.15.57.16/29 208.80.153.190/32 stat1004.eqiad.wmnet stat1005.eqiad.wmnet stat1006.eqiad.wmnet stat1007.eqiad.wmnet stat1008.eqiad.wmnet an-launcher1002.eqiad.wmnet wcqs2001.codfw.wmnet wdqs1009.eqiad.wmnet wdqs1010.eqiad.wmnet wdqs2009.codfw.wmnet wdqs2010.codfw.wmnet

stat1009 is definitely missing from that list, but it confirms that it's managed by puppet.

The template file is here: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/templates/dumps/distribution/nfs-exports.erb

So we can see that the list is generated by a variable in puppet called nfs_clients which is defined here:
https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common/profile/dumps/distribution.yaml#L19-L37

For the second one, the mariadb version, this requirement comes from this file: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/analytics/cluster/packages/statistics.pp#L65

That file contains the following guidance in the header:

# NOTE: If done carefully, most if not all of the packages declared in this
# class could probably be removed.  It doesn't hurt to have them, but if
# you ever run into an issue (e.g. after an OS upgrade), be bold and
# remove packages at will.
# See also: https://phabricator.wikimedia.org/T275786

I'm not sure that I'd go as far as removing the mariadb-client-10.3 package from the list, but one thing we could do is relax the version constraint.

We can use the command apt rdepends to find out what depends on a particular package.

Compare the output of the following two commands:

  • apt rdepends mariadb-client-10.3 on stat1008
  • apt rdepends maraidb-client-10.5 on stat1009
btullis@stat1008:~$ apt rdepends mariadb-client-10.3
mariadb-client-10.3
Reverse Depends:
  Depends: mariadb-client (>= 1:10.3.38-0+deb10u1)
  Depends: mariadb-client-10.3-dbgsym (= 1:10.3.34-0+deb10u1)
  Depends: mariadb-test (= 1:10.3.38-0+deb10u1)
  Depends: mariadb-server-10.3 (>= 1:10.3.38-0+deb10u1)
  Depends: mariadb-plugin-gssapi-client
  Depends: mariadb-client (>= 1:10.3.34-0+deb10u1)
  Depends: default-mysql-client
  Depends: mariadb-test (= 1:10.3.34-0+deb10u1)
  Depends: mariadb-server-10.3 (>= 1:10.3.34-0+deb10u1)
  Depends: mariadb-plugin-gssapi-client
btullis@stat1009:~$ apt rdepends mariadb-client-10.5
mariadb-client-10.5
Reverse Depends:
  Depends: mariadb-client (>= 1:10.5.19-0+deb11u2)
  Depends: mariadb-client-10.5-dbgsym (= 1:10.5.19-0+deb11u2)
  Depends: default-mysql-client
  Depends: mariadb-test (= 1:10.5.19-0+deb11u2)
  Depends: mariadb-server-10.5 (>= 1:10.5.19-0+deb11u2)
  Depends: mariadb-plugin-gssapi-client

You'll see that the package mariadb-client pulls in the respective mariadb-client-$version for each distribution.

Therefore, I think that I would investigate replacing the specific version with the less specific version in statistics.pp. It should be a noop on buster boxes and should pull in mariadb-client-10.5 on bullseye.

An error with the /etc/jupyter folder previously mentioned here and was solved manually, however we need to get a puppet patch for this. This will be achieved by creating the folder on puppet before passing the config path.

Yes, I agree. I think that you'll need to manage the /etc/jupyter directory before this file resource: https://github.com/wikimedia/operations-puppet/blob/production/modules/jupyterhub/manifests/server.pp#L77-L84

No such file or directory - A directory component in /var/lib/stats/.gitconfig20230518-257510-19zertf.lock does not exist or is a dangling symbolic link

This looks like we should fix it too.
The user account for stats is created here: https://github.com/wikimedia/operations-puppet/blob/production/modules/statistics/manifests/user.pp#L14

The defined type systemd::sysuser specifically states:

home_dir home directory, must be pre-existing and does not get added by the define

So I think that we should add management of the directory /var/lib/stats around line 13 here: https://github.com/wikimedia/operations-puppet/blob/production/modules/statistics/manifests/user.pp#L13

Icinga downtime and Alertmanager silence (ID=58b1da63-7fca-4cbc-8725-c1ba80542dd9) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bringing stat1009 into service

stat1009.eqiad.wmnet

Change 921885 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Create the jupyter notebook config folder

https://gerrit.wikimedia.org/r/921885

Change 922091 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Grant stat1009 access to cloud dumps

https://gerrit.wikimedia.org/r/922091

Change 922115 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Create stat user home directory

https://gerrit.wikimedia.org/r/922115

Change 922158 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Set mariadb-client to pull the right version

https://gerrit.wikimedia.org/r/922158

Change 921885 merged by Stevemunene:

[operations/puppet@production] Create the jupyter notebook config folder

https://gerrit.wikimedia.org/r/921885

Change 922115 merged by Stevemunene:

[operations/puppet@production] Create stat user home directory

https://gerrit.wikimedia.org/r/922115

Change 922158 merged by Stevemunene:

[operations/puppet@production] Set mariadb-client to pull the right version

https://gerrit.wikimedia.org/r/922158

Change 922091 merged by Stevemunene:

[operations/puppet@production] Grant stat1009 access to cloud dumps

https://gerrit.wikimedia.org/r/922091

We are getting an error CRITICAL - degraded: The following units failed: rsync-published.service Details here;

May 23 10:34:23 stat1009 systemd[1]: Failed to start Rsync push to analytics.wikimedia.org web host.
May 23 10:45:00 stat1009 systemd[1]: Starting Rsync push to analytics.wikimedia.org web host...
May 23 10:49:22 stat1009 published-sync[2979602]: rsync: [sender] failed to connect to analytics-web.discovery.wmnet (2620:0:861:105:10:64:21:14): Connection timed out (110)
May 23 10:49:22 stat1009 published-sync[2979602]: rsync: [sender] failed to connect to analytics-web.discovery.wmnet (10.64.21.14): Connection timed out (110)
May 23 10:49:22 stat1009 published-sync[2979602]: rsync error: error in socket IO (code 10) at clientserver.c(137) [sender=3.2.3]
May 23 10:49:22 stat1009 systemd[1]: rsync-published.service: Main process exited, code=exited, status=10/n/a
May 23 10:49:22 stat1009 systemd[1]: rsync-published.service: Failed with result 'exit-code'.
May 23 10:49:22 stat1009 systemd[1]: Failed to start Rsync push to analytics.wikimedia.org web host.

This probably caused by the fact that stat1009 is not part of the group of stat hosts rsync hosts_allow list mentioned here

@BTullis should I also add it to the list of profile::dumps::rsync_internal_clients? Noticed we only have 2 stat servers there

This probably caused by the fact that stat1009 is not part of the group of stat hosts rsync hosts_allow list mentioned here

Yes, I believe you've found the correct place. I've been tracing through the various different puppet modules involved in setting up an-web1001:/etc/rsync.d/frag-published-destination and it's not easy.

Anyway, if you add stat1009 it should make a few related changes to rsync modules on all of the stats servers.

Should I also add it to the list of profile::dumps::rsync_internal_clients? Noticed we only have 2 stat servers there

I would hesitate before doing this. Maybe someone like @ArielGlenn and/or @taavi would know why we only have two stats servers listed under: profile::dumps::rsync_internal_clients and profile::dumps::stats_hosts in the file: hieradata/common/profile/dumps.yaml

I think it might be something to do with this legacy set of dumps: https://phabricator.wikimedia.org/T238243#8739981 although from what I can see it only references stat1007. I haven't looked at it in much depth.

Also experiencing a SQLAlchemy errors with jupyterhub-conda.service on stat1009 which has not been experienced on other hosts including bullseye an-test-client1002

May 24 09:33:56 stat1009 jupyterhub-conda[3472435]: [D 2023-05-24 09:33:56.584 JupyterHub app:1721] Connecting to db: sqlite:////srv/jupyterhub-conda/data/jupyterhub.sqlite.db
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]: [E 2023-05-24 09:33:56.591 JupyterHub app:2989]
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:     Traceback (most recent call last):
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2986, in launch_instance_async
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         await self.initialize(argv)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2521, in initialize
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self.init_db()
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 1726, in init_db
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self.session_factory = orm.new_session_factory(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/orm.py", line 880, in new_session_factory
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         check_db_revision(engine)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/orm.py", line 771, in check_db_revision
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         current_table_names = set(inspect(engine).get_table_names())
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/inspection.py", line 111, in inspect
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         ret = reg(subject)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 304, in _engine_insp
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return Inspector._construct(Inspector._init_engine, bind)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 237, in _construct
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         init(self, bind)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 248, in _init_engine
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         engine.connect().close()
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3264, in connect
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return self._connection_cls(self)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 174, in __init__
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self.dispatch.engine_connect(self)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 487, in __call__
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         fn(*args, **kw)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/event/legacy.py", line 100, in wrap_leg
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return fn(*conv(*args))
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/orm.py", line 737, in ping_connection
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         connection.scalar(select([1]))
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/_selectable_constructors.py", line 489, in select
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return Select(*entities)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/selectable.py", line 5135, in __init__
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self._raw_columns = [
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/selectable.py", line 5136, in <listcomp>
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         coercions.expect(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 413, in expect
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         resolved = impl._literal_coercion(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 652, in _literal_coercion
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self._raise_for_expected(element, argname)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 1143, in _raise_for_expected
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return super()._raise_for_expected(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 711, in _raise_for_expected
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         super()._raise_for_expected(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 536, in _raise_for_expected
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         raise exc.ArgumentError(msg, code=code) from err
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:     sqlalchemy.exc.ArgumentError: Column expression, FROM clause, or other columns clause element expected, got [1]. Did you mean to say select(1)?
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:     
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]: [D 2023-05-24 09:33:56.593 JupyterHub application:1028] Exiting application: jupyterhub
May 24 09:33:56 stat1009 systemd[1]: jupyterhub-conda.service: Main process exited, code=exited, status=1/FAILURE

Looking into the error led me to the jupyterhub issues page where I found JupyterHub is not compatible with sqlalchemy 2 #4312. However this should not affect us much since we are running jupyterhub 1.5.0 still exploring the logs for more.

Also experiencing a SQLAlchemy errors with jupyterhub-conda.service on stat1009 which has not been experienced on other hosts including bullseye an-test-client1002

May 24 09:33:56 stat1009 jupyterhub-conda[3472435]: [D 2023-05-24 09:33:56.584 JupyterHub app:1721] Connecting to db: sqlite:////srv/jupyterhub-conda/data/jupyterhub.sqlite.db
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]: [E 2023-05-24 09:33:56.591 JupyterHub app:2989]
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:     Traceback (most recent call last):
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2986, in launch_instance_async
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         await self.initialize(argv)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 2521, in initialize
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self.init_db()
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/app.py", line 1726, in init_db
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self.session_factory = orm.new_session_factory(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/orm.py", line 880, in new_session_factory
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         check_db_revision(engine)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/orm.py", line 771, in check_db_revision
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         current_table_names = set(inspect(engine).get_table_names())
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/inspection.py", line 111, in inspect
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         ret = reg(subject)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 304, in _engine_insp
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return Inspector._construct(Inspector._init_engine, bind)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 237, in _construct
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         init(self, bind)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/reflection.py", line 248, in _init_engine
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         engine.connect().close()
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3264, in connect
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return self._connection_cls(self)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 174, in __init__
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self.dispatch.engine_connect(self)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/event/attr.py", line 487, in __call__
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         fn(*args, **kw)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/event/legacy.py", line 100, in wrap_leg
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return fn(*conv(*args))
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/jupyterhub/orm.py", line 737, in ping_connection
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         connection.scalar(select([1]))
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/_selectable_constructors.py", line 489, in select
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return Select(*entities)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/selectable.py", line 5135, in __init__
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self._raw_columns = [
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/selectable.py", line 5136, in <listcomp>
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         coercions.expect(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 413, in expect
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         resolved = impl._literal_coercion(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 652, in _literal_coercion
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         self._raise_for_expected(element, argname)
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 1143, in _raise_for_expected
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         return super()._raise_for_expected(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 711, in _raise_for_expected
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         super()._raise_for_expected(
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:       File "/opt/conda-analytics/lib/python3.10/site-packages/sqlalchemy/sql/coercions.py", line 536, in _raise_for_expected
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:         raise exc.ArgumentError(msg, code=code) from err
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:     sqlalchemy.exc.ArgumentError: Column expression, FROM clause, or other columns clause element expected, got [1]. Did you mean to say select(1)?
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]:     
May 24 09:33:56 stat1009 jupyterhub-conda[3472435]: [D 2023-05-24 09:33:56.593 JupyterHub application:1028] Exiting application: jupyterhub
May 24 09:33:56 stat1009 systemd[1]: jupyterhub-conda.service: Main process exited, code=exited, status=1/FAILURE

Looking into the error led me to the jupyterhub issues page where I found JupyterHub is not compatible with sqlalchemy 2 #4312. However this should not affect us much since we are running jupyterhub 1.5.0 still exploring the logs for more.

This is being worked on here https://phabricator.wikimedia.org/T337471 with a temp fix to downgrade conda-analytics

Icinga downtime and Alertmanager silence (ID=e14a6c2c-888d-45b4-94a2-edc04252cc36) set by stevemunene@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Bringing stat1009 into service

stat1009.eqiad.wmnet

Change 926422 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Add new stat1009 to the stat servers rsync hosts_allow

https://gerrit.wikimedia.org/r/926422

I see that we are affected by the same issue as T336281: Fix hiveserver2 related errors on bullseye hadoop clients and workers in that two consecutive runs of puppet add and then remove hive

The last time we looked at it, it was caused by a dependency on python-in-python2 and then python-is-python3 so they kept swapping over on each puppet run.

It may be related to the presto-cli package, since when we looked a the issue on an-test-worker1001 it was related to the presto-server package.

btullis@stat1009:~$ apt-cache policy presto-cli
presto-cli:
  Installed: 0.273.3-1
  Candidate: 0.273.3-1
  Version table:
 *** 0.273.3-1 1001
       1001 http://apt.wikimedia.org/wikimedia bullseye-wikimedia/main amd64 Packages
        100 /var/lib/dpkg/status

Change 926422 merged by Stevemunene:

[operations/puppet@production] Add new stat1009 to the stat servers rsync hosts_allow

https://gerrit.wikimedia.org/r/926422

The rsync-published.service is yet to function as expected on the stat host.

Jun 06 16:15:08 stat1009 systemd[1]: Starting Rsync push to analytics.wikimedia.org web host...
Jun 06 16:15:08 stat1009 published-sync[4151358]: rsync: mkdir "/stat1009" (in published-destination) failed: Permission denied (13)
Jun 06 16:15:08 stat1009 published-sync[4151358]: rsync error: error in file IO (code 11) at main.c(682) [Receiver=3.1.3]
Jun 06 16:15:08 stat1009 systemd[1]: rsync-published.service: Main process exited, code=exited, status=11/n/a
Jun 06 16:15:08 stat1009 systemd[1]: rsync-published.service: Failed with result 'exit-code'.
Jun 06 16:15:08 stat1009 systemd[1]: Failed to start Rsync push to analytics.wikimedia.org web host

The error occurs as we try to Rsync $source to $destination/$::hostname using the published-sync script. The destination host is expected to use the statistics::published class to merge $::hostname directories into a single directory.
However, for this case the rsync service user does not have enough permission to create the $::hostname directory in the destination folder

The rsync-published.service is yet to function as expected on the stat host.

Jun 06 16:15:08 stat1009 systemd[1]: Starting Rsync push to analytics.wikimedia.org web host...
Jun 06 16:15:08 stat1009 published-sync[4151358]: rsync: mkdir "/stat1009" (in published-destination) failed: Permission denied (13)
Jun 06 16:15:08 stat1009 published-sync[4151358]: rsync error: error in file IO (code 11) at main.c(682) [Receiver=3.1.3]
Jun 06 16:15:08 stat1009 systemd[1]: rsync-published.service: Main process exited, code=exited, status=11/n/a
Jun 06 16:15:08 stat1009 systemd[1]: rsync-published.service: Failed with result 'exit-code'.
Jun 06 16:15:08 stat1009 systemd[1]: Failed to start Rsync push to analytics.wikimedia.org web host

The error occurs as we try to Rsync $source to $destination/$::hostname using the published-sync script. The destination host is expected to use the statistics::published class to merge $::hostname directories into a single directory.
However, for this case the rsync service user does not have enough permission to create the $::hostname directory in the destination folder

Let's just add this manually for now. I don't think that there is significant value in automating this, given that adding a new stat host is infrequent and we only have a single destination server.

I did this:

btullis@an-web1001:/srv/published-rsynced$ sudo mkdir stat1009 && sudo chown stats:wikidev stat1009 && sudo chmod 775 stat1009

Let's see how this affects the rsync-published.service on stat1009.

The service is up and running as expected, thanks @BTullis

stevemunene@stat1009:/srv$ systemctl status rsync-published.service
● rsync-published.service - Rsync push to analytics.wikimedia.org web host
     Loaded: loaded (/lib/systemd/system/rsync-published.service; static)
     Active: inactive (dead) since Tue 2023-06-06 17:15:02 UTC; 2min 5s ago
TriggeredBy: ● rsync-published.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 4166156 ExecStart=/usr/local/bin/published-sync -q (code=exited, status=0/SUCCESS)
   Main PID: 4166156 (code=exited, status=0/SUCCESS)
        CPU: 14ms

Jun 06 17:15:02 stat1009 systemd[1]: Starting Rsync push to analytics.wikimedia.org web host...
Jun 06 17:15:02 stat1009 systemd[1]: rsync-published.service: Succeeded.
Jun 06 17:15:02 stat1009 systemd[1]: Finished Rsync push to analytics.wikimedia.org web host.

I see that we are affected by the same issue as T336281: Fix hiveserver2 related errors on bullseye hadoop clients and workers in that two consecutive runs of puppet add and then remove hive

The last time we looked at it, it was caused by a dependency on python-in-python2 and then python-is-python3 so they kept swapping over on each puppet run.

Ah, I think that this is easier than we thought. We just need to set the following hiera parameter:

profile::base::remove_python2_on_bullseye: false

We have it set for the role::analytics_test_cluster::client but we don't yet have it set for role::statistics::explorer.

Change 929675 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Prevent removal of python2 on bullseye stat hosts

https://gerrit.wikimedia.org/r/929675

Change 928918 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[analytics/refinery/scap@master] Add stat1009 to scap targets

https://gerrit.wikimedia.org/r/928918

Change 929675 merged by Stevemunene:

[operations/puppet@production] Prevent removal of python2 on bullseye stat hosts

https://gerrit.wikimedia.org/r/929675

Change 928918 merged by Stevemunene:

[analytics/refinery/scap@master] Add stat1009 to scap targets

https://gerrit.wikimedia.org/r/928918

Scap deployment of analytics/refinery to stat1009 failed for me today due to:

“git: ‘fat’ is not a git command.

Looks it should be installed by puppet standard_packages class, but isn't because of lt bullsye conditional.

I manually installed git-fat on stat1009 sudo apt-get install git-fat, and then scap-deploy completed successfully.

Default python on stat1009 is still Python 2, so I'm not so sure this conditional is necessary?

@MoritzMuehlenhoff maybe will know? https://gerrit.wikimedia.org/r/c/operations/puppet/+/677807

@Ottomata we ran into this problem when deploying WDQS hosts on Bullseye. The default Bullseye config removes all Puppet 2 config. Here's the patch we used to allow Python2 on Bullseye.

Kay.

@BTullis @Stevemunene FYI I removed git-fat package from stat1009, so the next time a deploy happens this will bite us again, until we do as @bking suggests.

Kay.

@BTullis @Stevemunene FYI I removed git-fat package from stat1009, so the next time a deploy happens this will bite us again, until we do as @bking suggests.

We had implemented the python solution as @bking suggests earlier in the day, here that's why the default python is still python2. Having another look at the git-fat discussions.

If git-fat wil be needed on an ongoing basis (also for Bookworm and later, which no longer have any Python 2 at all), it needs to be ported to Python 3, see the older discussion starting at https://phabricator.wikimedia.org/T279509#7868911

It seems fine to use the Py2 packages in Bullseye as an interim to unblock the upgrades of the stat hosts.

stat1009 is experiencing a jupyterhub error whenever a single user tries to create a server.
Created an SSH tunnel with: ssh -N stat1009.eqiad.wmnet -L 8880:127.0.0.1:8880
Amd logged in via the User interface accessible from 127.0.0.1:8880 When I try to create a server, the single user service fails with

Jun 23 09:53:14 stat1009 systemd[1631091]: jupyter-stevemunene-singleuser-conda-analytics.service: Failed to set up mount namespacing: /run/systemd/unit-root/srv/spark-tmp: No such file or directory
Jun 23 09:53:14 stat1009 systemd[1631091]: jupyter-stevemunene-singleuser-conda-analytics.service: Failed at step NAMESPACE spawning /bin/bash: No such file or directory
Jun 23 09:53:14 stat1009 systemd[1]: jupyter-stevemunene-singleuser-conda-analytics.service: Main process exited, code=exited, status=226/NAMESPACE

This is similar to what we had on the test client after upgrading to bullseye https://phabricator.wikimedia.org/T333511#8823084

I succeeded in creating a jupyterhub single user session.
The error was due to a missing directory namely the /srv/spark-tmp directory which is part of the c.SystemdSpawner.readwrite_paths.
I checked the other hosts and manually created the folder with the same permissions as below.

stevemunene@stat1009:~$ sudo mkdir /srv/spark-tmp
stevemunene@stat1009:~$ sudo chmod 1777 /srv/spark-tmp/

With that the server was successfully started for my user.

image.png (1×2 px, 210 KB)

Working on a puppet patch to automate the creation of the folder to avoid this error as we upgrade the stat hosts to bullseye.

Change 935444 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Create spark3 local directory

https://gerrit.wikimedia.org/r/935444

Change 935444 merged by Stevemunene:

[operations/puppet@production] Create spark3 local directory

https://gerrit.wikimedia.org/r/935444

Gehel triaged this task as High priority.Jul 21 2023, 12:50 PM
Gehel moved this task from Needs Reporting to Done on the Data-Platform-SRE board.