I'm not sure if we should do the extra work to make this move without downtime... @zhuyifei1999 how big of a deal would it be to build a duplicate cluster in eqiad1-r and switch traffic over there rather then stopping and copying each host?
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/puppet | production | +1 -1 | Horizon: move quarry project to eqiad1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T53494 Use Beta cluster as a true canary for code deployments (epic) | |||
Open | None | T87220 Minimize infrastructure differences between Beta Cluster and production | |||
Open | None | T196662 Set up LVS in beta like prod | |||
Resolved | • bd808 | T166396 Program 1 Outcome 4: VPS hosting | |||
Resolved | None | T167293 Nova-network to Neutron migration | |||
Resolved | Andrew | T204745 cloudvps: migrate projects from main to eqiad1 | |||
Resolved | None | T207677 Migrate 'Quarry' project to eqiad1 |
Event Timeline
The main thing would be copying over the db and creating the config file (containing the secrets). Everything else should be automatic.
However, I think it should be fine to have a downtime for a machine copy to the region as long as we have warning message beforehand. @Framawiki any opinions?
I see:
- setup a maintenance message few days before (well-named var in /srv/quarry/config.yaml then restart web service)
- when no queries are running nor pending (using recent query page on the site then verify that workers are empty with in /srv/quarry sudo /srv/quarry/venv/bin/celery -A quarry.web.worker inspect active) stop celery service on workers, then lock mysql db (@zhuyifei1999 can you add the sql query that you've used in last deployment window to the doc ?)
- if you're building new instances from scratch take a look to last window logs, particularly uid and chown (of course please log everything interesting in sal :) )
Everything else should be handled by puppet.
FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = 1;
SET GLOBAL read_only = 0; UNLOCK TABLES;
Can I ask one of you to put up the maintenance message and suggest a window for this move? Anytime during US work hours (let's say after 14:00 UTC) will suit me. Thank you!
That should work, though I will probably be off 10-11 AM, 1-2PM, & 3-5PM Central Time
How about noon CST on that Monday? (that's probably 17:00 UTC although that week is the week-of-timezone-slip so I can't make any promises)
11AM-1PM sounds good to me. @Framawiki is it okay for you? (if not, are you fine with me handling it?)
@zhuyifei1999 or @Framawiki, can one of you announce this downtime to interested parties? Or at least rattle of a list of contacts here so I can do that?
I added a maintenance message:
MAINTENANCE_MSG: 'Quarry will be down for maintenance on Monday, November 5 at 5 PM UTC. (T207677)'
I can't be sure to be present but at worst I'm certain @zhuyifei1999 will be able to handle it :p
Thank you both!
Mentioned in SAL (#wikimedia-cloud) [2018-11-05T17:24:24Z] <zhuyifei1999_> shutting down all workers sudo kill -TERM $(systemctl show -p MainPID celery-quarry-worker.service | cut -d= -f2) T207677
Change 471770 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: move quarry project to eqiad1
Change 471770 merged by Andrew Bogott:
[operations/puppet@production] Horizon: move quarry project to eqiad1
Mentioned in SAL (#wikimedia-cloud) [2018-11-05T17:40:15Z] <zhuyifei1999_> set db read-only FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = 1; T207677
Mentioned in SAL (#wikimedia-cloud) [2018-11-05T18:22:28Z] <zhuyifei1999_> unset db read-only SET GLOBAL read_only = 0; UNLOCK TABLES; T207677
This happened after restore:
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: [2018-11-05 18:25:17,158] ERROR in app: Exception on /api/query/run [POST] Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: Traceback (most recent call last): Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: response = self.full_dispatch_request() Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1815, in full_dispatch_request Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: rv = self.handle_user_exception(e) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1718, in handle_user_exception Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: reraise(exc_type, exc_value, tb) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/flask/_compat.py", line 35, in reraise Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: raise value Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1813, in full_dispatch_request Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: rv = self.dispatch_request() Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1799, in dispatch_request Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: return self.view_functions[rule.endpoint](**req.view_args) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "./quarry/web/app.py", line 236, in api_run_query Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: if not result.ready(): Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 311, in ready Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: return self.state in self.backend.READY_STATES Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 471, in state Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: return self._get_task_meta()['status'] Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 410, in _get_task_meta Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: return self._maybe_set_cache(self.backend.get_task_meta(self.id)) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 365, in get_task_meta Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: meta = self._get_task_meta_for(task_id) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 680, in _get_task_meta_for Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: return self.decode_result(meta) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 284, in decode_result Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: return self.meta_from_decoded(self.decode(payload)) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 280, in meta_from_decoded Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: meta['result'] = self.exception_to_python(meta['result']) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 260, in exception_to_python Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: exc = cls(*exc_msg if isinstance(exc_msg, tuple) else exc_msg) Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: TypeError: __init__() missing 2 required positional arguments: 'params' and 'orig'
Not sure what is going on. Not reproducible by creating a new query. Workarounded by using FLUSHALL on redis-cli.
Mentioned in SAL (#wikimedia-cloud) [2018-11-05T18:43:22Z] <framawiki> migration is over T207677