This should probably be done in tofu.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | taavi | T349605 Remove quarry.wsgi on move to k8s | |||
| Duplicate | None | T349690 Move away from nfs? | |||
| Resolved | • rook | T349029 Move quarry to magnum | |||
| Resolved | • rook | T364753 remove buster systems | |||
| Resolved | • rook | T361470 Shutdown quarry VMs | |||
| Resolved | • rook | T360233 store quarry state in object storage | |||
| Resolved | • rook | T349032 Deploy magnum cluster for quarry |
Event Timeline
https://quarry-test.wmcloud.org offers a running, but not working, quarry on k8s. When I run a query it is giving:
Can't connect to MySQL server on 'enwiki' ([Errno -2] Name or service not known)
Presumably something is not making a connection to the replica. Though also on a second run of a query the favicon changes to "working" though the text does not, and it doesn't seem to complete. So probably something else is up.
The logs kubectl logs -f deployment.apps/worker and kubectl logs -f deployment.apps/web don't seem to be showing anything.
I installed with:
bash deploy.sh
Followed by:
mysql -uquarry -h ythbv6lgsvs.svc.trove.eqiad1.wikimedia.cloud -p < schema.sql
Maybe more setup is needed? Seemed to be enough on the minikube setup.
Long and short of it @SD0001 @Audiodude I could use help troubleshooting.
k8s creds are on the quarry-bastion node in /opt/quarry-123-1.config
db creds are in the git branch (T349032) helm-quarry/prod-config.yaml
@rook This is due to misconfigured db config. I can see config.yaml has REPLICA_DOMAIN: '' which could be overriding the valid value provided a few lines above it.
ooo so it is! I found that one, but it would appear that helm is not updating it...Or putting it back at all when the config map is removed...
Looks like ansible is the culpurate, as invoking helm directly seems to get it back. And on to new errors with NFS. Thank you for getting me unblocked!
Looks like the web pod had some db connection issues a little after it started. Restarting seems to have cleared it, though let's see if it comes back.
[2023-11-01 12:20:32 +0000] [1] [INFO] Starting gunicorn 21.2.0
[2023-11-01 12:20:32 +0000] [1] [INFO] Listening at: http://0.0.0.0:5000 (1)
[2023-11-01 12:20:32 +0000] [1] [INFO] Using worker: sync
[2023-11-01 12:20:32 +0000] [10] [INFO] Booting worker with pid: 10
[2023-11-01 12:20:32 +0000] [11] [INFO] Booting worker with pid: 11
[2023-11-01 13:09:26 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 756, in _write_bytes
self._sock.sendall(data)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1772, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
result = self._query(query)
File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
conn.query(q)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 547, in query
self._execute_command(COMMAND.COM_QUERY, sql)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 814, in _execute_command
self._write_bytes(packet)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 760, in _write_bytes
CR.CR_SERVER_GONE_ERROR, "MySQL server has gone away (%r)" % (e,)
pymysql.err.OperationalError: (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))")
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/quarry/web/app.py", line 82, in index
stats_count_users=global_conn.session.query(User).count(),
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3091, in count
return self._from_self(col).enable_eagerloads(False).scalar()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2832, in scalar
ret = self.one()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2809, in one
return self._iter().one()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2850, in _iter
execution_options={"_sa_orm_load_options": self.load_options},
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1583, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 324, in _execute_on_connection
self, multiparams, params, execution_options
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1462, in _execute_clauseelement
cache_hit=cache_hit,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1815, in _execute_context
e, statement, parameters, cursor, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1996, in _handle_dbapi_exception
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
raise exception
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1772, in _execute_context
cursor, statement, parameters, context
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 717, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
result = self._query(query)
File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
conn.query(q)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 547, in query
self._execute_command(COMMAND.COM_QUERY, sql)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 814, in _execute_command
self._write_bytes(packet)
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 760, in _write_bytes
CR.CR_SERVER_GONE_ERROR, "MySQL server has gone away (%r)" % (e,)
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))")
[SQL: SELECT count(*) AS count_1
FROM (SELECT user.id AS user_id, user.username AS user_username, user.wiki_uid AS user_wiki_uid
FROM user) AS anon_1]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[2023-11-02 10:46:48 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/quarry/web/app.py", line 82, in index
stats_count_users=global_conn.session.query(User).count(),
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3091, in count
return self._from_self(col).enable_eagerloads(False).scalar()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2832, in scalar
ret = self.one()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2809, in one
return self._iter().one()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2850, in _iter
execution_options={"_sa_orm_load_options": self.load_options},
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1583, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 324, in _execute_on_connection
self, multiparams, params, execution_options
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1462, in _execute_clauseelement
cache_hit=cache_hit,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1669, in _execute_context
conn = self._revalidate_connection()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 560, in _revalidate_connection
self._invalid_transaction()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 540, in _invalid_transaction
code="8s2b",
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)
[2023-11-02 17:00:43 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 178, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2464, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2450, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1867, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/app/quarry/web/app.py", line 82, in index
stats_count_users=global_conn.session.query(User).count(),
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3091, in count
return self._from_self(col).enable_eagerloads(False).scalar()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2832, in scalar
ret = self.one()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2809, in one
return self._iter().one()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 2850, in _iter
execution_options={"_sa_orm_load_options": self.load_options},
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1689, in execute
result = conn._execute_20(statement, params or {}, execution_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1583, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 324, in _execute_on_connection
self, multiparams, params, execution_options
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1462, in _execute_clauseelement
cache_hit=cache_hit,
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1669, in _execute_context
conn = self._revalidate_connection()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 560, in _revalidate_connection
self._invalid_transaction()
File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 540, in _invalid_transaction
code="8s2b",
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back. (Background on this error at: https://sqlalche.me/e/14/8s2b)Looks like it happened again:
[2023-11-02 17:02:06 +0000] [11] [INFO] Booting worker with pid: 11
[2023-11-02 23:03:43 +0000] [11] [ERROR] Error handling request /
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 756, in _write_bytes
self._sock.sendall(data)
ConnectionResetError: [Errno 104] Connection reset by peerAbout six hours after startup this time.
The gunicorn migration sounds like an unlikely culprit, since it's the db connections referenced here - which are managed by pymysql in any case.
Did I install the db correctly with:
mysql -uquarry -h ythbv6lgsvs.svc.trove.eqiad1.wikimedia.cloud -p < schema.sql
?
NFS appears to result in a permissions issue as nfs is creating files and directories as nfsmanager/498 where quarry is trying to create files as quarry/999
@SD0001 @Audiodude could yinz take a look at quarry-test.wmcloud.org and see if there are any obvious problems? The data is all a duplicate of the production quarry, though now everything is running in k8s (except for nfs and trove, which are also their own separate instance from the current production quarry) If it seems alright we can probably cut over next week.
Had a brief look and it looks good to me. Thanks!
I'm no longer able to invoke kubectl from quarry-bastion, though. It says Unable to connect to the server: dial tcp 172.16.4.237:6443: connect: no route to host.
Oh there's a new k8s cluster for it. I've put the updated config in /opt/quarry-123-2.config
I might update to a newer k8s before deploying, as there is one more version we can upgrade to on this version of openstack. So it might change again.
I've opened T360233 to manage the tofu state in an object store. That way things shouldn't be stored locally, and tofu/deploy.sh can be run from wherever in the bastion.
Mentioned in SAL (#wikimedia-cloud) [2024-04-01T12:48:21Z] <Rook> quarry moved to k8s T349032