Page MenuHomePhabricator

Migrate 'Quarry' project to eqiad1
Closed, ResolvedPublic

Description

I'm not sure if we should do the extra work to make this move without downtime... @zhuyifei1999 how big of a deal would it be to build a duplicate cluster in eqiad1-r and switch traffic over there rather then stopping and copying each host?

Event Timeline

Andrew triaged this task as Normal priority.Oct 22 2018, 5:53 PM
Andrew created this task.
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 22 2018, 5:53 PM

The main thing would be copying over the db and creating the config file (containing the secrets). Everything else should be automatic.

However, I think it should be fine to have a downtime for a machine copy to the region as long as we have warning message beforehand. @Framawiki any opinions?

Framawiki added a comment.EditedOct 23 2018, 4:58 PM

I see:

  • setup a maintenance message few days before (well-named var in /srv/quarry/config.yaml then restart web service)
  • when no queries are running nor pending (using recent query page on the site then verify that workers are empty with in /srv/quarry sudo /srv/quarry/venv/bin/celery -A quarry.web.worker inspect active) stop celery service on workers, then lock mysql db (@zhuyifei1999 can you add the sql query that you've used in last deployment window to the doc ?)
  • if you're building new instances from scratch take a look to last window logs, particularly uid and chown (of course please log everything interesting in sal :) )

Everything else should be handled by puppet.

@zhuyifei1999 can you add the sql query that you've used in last deployment window to the doc ?

FLUSH TABLES WITH READ LOCK;
SET GLOBAL read_only = 1;
SET GLOBAL read_only = 0;
UNLOCK TABLES;

Can I ask one of you to put up the maintenance message and suggest a window for this move? Anytime during US work hours (let's say after 14:00 UTC) will suit me. Thank you!

OK, I'll go first :) How about if we schedule downtime for Monday the 5th?

That should work, though I will probably be off 10-11 AM, 1-2PM, & 3-5PM Central Time

How about noon CST on that Monday? (that's probably 17:00 UTC although that week is the week-of-timezone-slip so I can't make any promises)

11AM-1PM sounds good to me. @Framawiki is it okay for you? (if not, are you fine with me handling it?)

@zhuyifei1999 or @Framawiki, can one of you announce this downtime to interested parties? Or at least rattle of a list of contacts here so I can do that?

zhuyifei1999 added a comment.EditedOct 31 2018, 3:01 PM

I added a maintenance message:

MAINTENANCE_MSG: 'Quarry will be down for maintenance on Monday, November 5 at 5 PM UTC. (T207677)'

I can't be sure to be present but at worst I'm certain @zhuyifei1999 will be able to handle it :p
Thank you both!

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T17:24:24Z] <zhuyifei1999_> shutting down all workers sudo kill -TERM $(systemctl show -p MainPID celery-quarry-worker.service | cut -d= -f2) T207677

Change 471770 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] Horizon: move quarry project to eqiad1

https://gerrit.wikimedia.org/r/471770

Change 471770 merged by Andrew Bogott:
[operations/puppet@production] Horizon: move quarry project to eqiad1

https://gerrit.wikimedia.org/r/471770

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T17:40:15Z] <zhuyifei1999_> set db read-only FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = 1; T207677

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T18:22:28Z] <zhuyifei1999_> unset db read-only SET GLOBAL read_only = 0; UNLOCK TABLES; T207677

This happened after restore:

Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: [2018-11-05 18:25:17,158] ERROR in app: Exception on /api/query/run [POST]
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: Traceback (most recent call last):
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 2292, in wsgi_app
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     response = self.full_dispatch_request()
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1815, in full_dispatch_request
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     rv = self.handle_user_exception(e)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1718, in handle_user_exception
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     reraise(exc_type, exc_value, tb)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/_compat.py", line 35, in reraise
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     raise value
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1813, in full_dispatch_request
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     rv = self.dispatch_request()
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/flask/app.py", line 1799, in dispatch_request
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.view_functions[rule.endpoint](**req.view_args)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "./quarry/web/app.py", line 236, in api_run_query
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     if not result.ready():
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 311, in ready
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.state in self.backend.READY_STATES
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 471, in state
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self._get_task_meta()['status']
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/result.py", line 410, in _get_task_meta
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self._maybe_set_cache(self.backend.get_task_meta(self.id))
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 365, in get_task_meta
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     meta = self._get_task_meta_for(task_id)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 680, in _get_task_meta_for
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.decode_result(meta)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 284, in decode_result
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     return self.meta_from_decoded(self.decode(payload))
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 280, in meta_from_decoded
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     meta['result'] = self.exception_to_python(meta['result'])
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:   File "/srv/quarry/venv/lib/python3.5/site-packages/celery/backends/base.py", line 260, in exception_to_python
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]:     exc = cls(*exc_msg if isinstance(exc_msg, tuple) else exc_msg)
Nov 05 18:25:17 quarry-web-01 uwsgi-quarry-web[1815]: TypeError: __init__() missing 2 required positional arguments: 'params' and 'orig'

Not sure what is going on. Not reproducible by creating a new query. Workarounded by using FLUSHALL on redis-cli.

zhuyifei1999 closed this task as Resolved.Nov 5 2018, 6:37 PM

Mentioned in SAL (#wikimedia-cloud) [2018-11-05T18:43:22Z] <framawiki> migration is over T207677