Page MenuHomePhabricator

Deploy magnum cluster for quarry
Closed, ResolvedPublic

Description

This should probably be done in tofu.

Event Timeline

https://quarry-test.wmcloud.org offers a running, but not working, quarry on k8s. When I run a query it is giving:

Can't connect to MySQL server on 'enwiki' ([Errno -2] Name or service not known)

Presumably something is not making a connection to the replica. Though also on a second run of a query the favicon changes to "working" though the text does not, and it doesn't seem to complete. So probably something else is up.

The logs kubectl logs -f deployment.apps/worker and kubectl logs -f deployment.apps/web don't seem to be showing anything.

I installed with:
bash deploy.sh
Followed by:
mysql -uquarry -h ythbv6lgsvs.svc.trove.eqiad1.wikimedia.cloud -p < schema.sql

Maybe more setup is needed? Seemed to be enough on the minikube setup.

Long and short of it @SD0001 @Audiodude I could use help troubleshooting.

k8s creds are on the quarry-bastion node in /opt/quarry-123-1.config
db creds are in the git branch (T349032) helm-quarry/prod-config.yaml

@rook This is due to misconfigured db config. I can see config.yaml has REPLICA_DOMAIN: '' which could be overriding the valid value provided a few lines above it.

@rook This is due to misconfigured db config. I can see config.yaml has REPLICA_DOMAIN: '' which could be overriding the valid value provided a few lines above it.

ooo so it is! I found that one, but it would appear that helm is not updating it...Or putting it back at all when the config map is removed...

Looks like ansible is the culpurate, as invoking helm directly seems to get it back. And on to new errors with NFS. Thank you for getting me unblocked!

Looks like the web pod had some db connection issues a little after it started. Restarting seems to have cleared it, though let's see if it comes back.

[2023-11-01 12:20:32 +0000] [1] [INFO] Starting gunicorn 21.2.0
[2023-11-01 12:20:32 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2023-11-01 12:20:32 +0000] [1] [INFO] Using worker: sync
[2023-11-01 12:20:32 +0000] [10] [INFO] Booting worker with pid: 10
[2023-11-01 12:20:32 +0000] [11] [INFO] Booting worker with pid: 11
[2023-11-01 13:09:26 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 756, in _write_bytes
    self._sock.sendall(data)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1772, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 547, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 814, in _execute_command
    self._write_bytes(packet)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 760, in _write_bytes
    CR.CR_SERVER_GONE_ERROR, "MySQL server has gone away (%r)" % (e,)
pymysql.err.OperationalError: (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/quarry/web/app.py", line 82, in index
    stats_count_users=global_conn.session.query(User).count(),
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3091, in count
    return self._from_self(col).enable_eagerloads(False).scalar()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2832, in scalar
    ret = self.one()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2809, in one
    return self._iter().one()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2850, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1583, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 324, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1462, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1815, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1996, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1772, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 547, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 814, in _execute_command
    self._write_bytes(packet)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 760, in _write_bytes
    CR.CR_SERVER_GONE_ERROR, "MySQL server has gone away (%r)" % (e,)
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))")
[SQL: SELECT count(*) AS count_1 
FROM (SELECT user.id AS user_id, user.username AS user_username, user.wiki_uid AS user_wiki_uid 
FROM user) AS anon_1]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[2023-11-02 10:46:48 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/quarry/web/app.py", line 82, in index
    stats_count_users=global_conn.session.query(User).count(),
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3091, in count
    return self._from_self(col).enable_eagerloads(False).scalar()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2832, in scalar
    ret = self.one()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2809, in one
    return self._iter().one()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2850, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1583, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 324, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1462, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1669, in _execute_context
    conn = self._revalidate_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 560, in _revalidate_connection
    self._invalid_transaction()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 540, in _invalid_transaction
    code="8s2b",
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)
[2023-11-02 17:00:43 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/app/quarry/web/app.py", line 82, in index
    stats_count_users=global_conn.session.query(User).count(),
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3091, in count
    return self._from_self(col).enable_eagerloads(False).scalar()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2832, in scalar
    ret = self.one()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2809, in one
    return self._iter().one()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2850, in _iter
    execution_options={"_sa_orm_load_options": self.load_options},
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1583, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 324, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1462, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1669, in _execute_context
    conn = self._revalidate_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 560, in _revalidate_connection
    self._invalid_transaction()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 540, in _invalid_transaction
    code="8s2b",
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)

Looks like it happened again:

[2023-11-02 17:02:06 +0000] [11] [INFO] Booting worker with pid: 11
[2023-11-02 23:03:43 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 756, in _write_bytes
    self._sock.sendall(data)
ConnectionResetError: [Errno 104] Connection reset by peer

About six hours after startup this time.

Perhaps gunicorn maintains connections differently than uwsgi?

The gunicorn migration sounds like an unlikely culprit, since it's the db connections referenced here - which are managed by pymysql in any case.

The gunicorn migration sounds like an unlikely culprit, since it's the db connections referenced here - which are managed by pymysql in any case.

Did I install the db correctly with:

mysql -uquarry -h ythbv6lgsvs.svc.trove.eqiad1.wikimedia.cloud -p < schema.sql

?

NFS appears to result in a permissions issue as nfs is creating files and directories as nfsmanager/498 where quarry is trying to create files as quarry/999

@SD0001 @Audiodude could yinz take a look at quarry-test.wmcloud.org and see if there are any obvious problems? The data is all a duplicate of the production quarry, though now everything is running in k8s (except for nfs and trove, which are also their own separate instance from the current production quarry) If it seems alright we can probably cut over next week.

Had a brief look and it looks good to me. Thanks!

I'm no longer able to invoke kubectl from quarry-bastion, though. It says Unable to connect to the server: dial tcp 172.16.4.237:6443: connect: no route to host.

Oh there's a new k8s cluster for it. I've put the updated config in /opt/quarry-123-2.config

I might update to a newer k8s before deploying, as there is one more version we can upgrade to on this version of openstack. So it might change again.

I've opened T360233 to manage the tofu state in an object store. That way things shouldn't be stored locally, and tofu/deploy.sh can be run from wherever in the bastion.

Quarry is now on kubernetes.