Page MenuHomePhabricator

Migrate 'Quarry' project to eqiad1
Closed, ResolvedPublic

Description

I'm not sure if we should do the extra work to make this move without downtime... @zhuyifei1999 how big of a deal would it be to build a duplicate cluster in eqiad1-r and switch traffic over there rather then stopping and copying each host?

Details

Event Timeline

Andrew triaged this task as Medium priority.

The main thing would be copying over the db and creating the config file (containing the secrets). Everything else should be automatic.

However, I think it should be fine to have a downtime for a machine copy to the region as long as we have warning message beforehand. @Framawiki any opinions?

I see:

  • setup a maintenance message few days before (well-named var in /srv/quarry/config.yaml then restart web service)
  • when no queries are running nor pending (using recent query page on the site then verify that workers are empty with in /srv/quarry sudo /srv/quarry/venv/bin/celery -A quarry.web.worker inspect active) stop celery service on workers, then lock mysql db (@zhuyifei1999 can you add the sql query that you've used in last deployment window to the doc ?)
  • if you're building new instances from scratch take a look to last window logs, particularly uid and chown (of course please log everything interesting in sal :) )

Everything else should be handled by puppet.

@zhuyifei1999 can you add the sql query that you've used in last deployment window to the doc ?

FLUSH TABLES WITH READ LOCK;
SET GLOBAL read_only = 1;
SET GLOBAL read_only = 0;
UNLOCK TABLES;

Can I ask one of you to put up the maintenance message and suggest a window for this move? Anytime during US work hours (let's say after 14:00 UTC) will suit me. Thank you!

OK, I'll go first :) How about if we schedule downtime for Monday the 5th?

That should work, though I will probably be off 10-11 AM, 1-2PM, & 3-5PM Central Time

How about noon CST on that Monday? (that's probably 17:00 UTC although that week is the week-of-timezone-slip so I can't make any promises)

11AM-1PM sounds good to me. @Framawiki is it okay for you? (if not, are you fine with me handling it?)

@zhuyifei1999 or @Framawiki, can one of you announce this downtime to interested parties? Or at least rattle of a list of contacts here so I can do that?

I added a maintenance message:

MAINTENANCE_MSG: 'Quarry will be down for maintenance on Monday, November 5 at 5 PM UTC. (T207677)'

I can't be sure to be present but at worst I'm certain @zhuyifei1999 will be able to handle it :p
Thank you both!

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T17:24:24Z] <zhuyifei1999_> shutting down all workers sudo kill -TERM $(systemctl show -p MainPID celery-quarry-worker.service | cut -d= -f2) T207677

Change 471770 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: move quarry project to eqiad1

https://gerrit.wikimedia.org/r/471770

Change 471770 merged by Andrew Bogott:
[operations/puppet@production] Horizon: move quarry project to eqiad1

https://gerrit.wikimedia.org/r/471770

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T17:40:15Z] <zhuyifei1999_> set db read-only FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = 1; T207677

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T18:22:28Z] <zhuyifei1999_> unset db read-only SET GLOBAL read_only = 0; UNLOCK TABLES; T207677

This happened after restore:

Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: [2018-11-05 18:25:17,158] ERROR in app: Exception on /api/query/run [POST]
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: Traceback (most recent call last):
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     response = self.full_dispatch_request()
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1815, in full_dispatch_request
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     rv = self.handle_user_exception(e)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1718, in handle_user_exception
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     reraise(exc_type, exc_value, tb)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/_compat.py", line 35, in reraise
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     raise value
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1813, in full_dispatch_request
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     rv = self.dispatch_request()
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1799, in dispatch_request
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.view_functions[rule.endpoint](**req.view_args)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "./quarry/web/app.py", line 236, in api_run_query
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     if not result.ready():
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 311, in ready
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.state in self.backend.READY_STATES
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 471, in state
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self._get_task_meta()['status']
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 410, in _get_task_meta
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self._maybe_set_cache(self.backend.get_task_meta(self.id))
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 365, in get_task_meta
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     meta = self._get_task_meta_for(task_id)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 680, in _get_task_meta_for
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.decode_result(meta)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 284, in decode_result
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.meta_from_decoded(self.decode(payload))
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 280, in meta_from_decoded
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     meta['result'] = self.exception_to_python(meta['result'])
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 260, in exception_to_python
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     exc = cls(*exc_msg if isinstance(exc_msg, tuple) else exc_msg)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: TypeError: __init__() missing 2 required positional arguments: 'params' and 'orig'

Not sure what is going on. Not reproducible by creating a new query. Workarounded by using FLUSHALL on redis-cli.