error 47 from memcached_set: (0x7f000800eda0) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY, host: mcrouter:11213 -> libmemcached/connect.cc:720 File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/core/handlers/exception.py", line 34, in inner response = get_response(request) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/utils/deprecation.py", line 93, in __call__ response = self.process_request(request) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/middleware/locale.py", line 21, in process_request language = translation.get_language_from_request(request, check_path=i18n_patterns_used) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/utils/translation/__init__.py", line 236, in get_language_from_request return _trans.get_language_from_request(request, check_path) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/utils/translation/trans_real.py", line 532, in get_language_from_request lang_code = request.session.get(LANGUAGE_SESSION_KEY) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/contrib/sessions/backends/base.py", line 65, in get return self._session.get(key, default) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/contrib/sessions/backends/base.py", line 194, in _get_session self._session_cache = self.load() File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/contrib/sessions/backends/cached_db.py", line 38, in load self._cache.set(self.cache_key, data, self.get_expiry_age(expiry=s.expire_date)) File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/core/cache/backends/memcached.py", line 83, in set if not self._cache.set(key, value, self.get_backend_timeout(timeout)):
Description
Details
Related Objects
Event Timeline
The memcached library change was https://gerrit.wikimedia.org/r/c/wikimedia/toolhub/+/770987
Change 771068 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):
[wikimedia/toolhub@main] bug(memcached): roll back to python-memcached and pin django-prometheus
Change 771068 merged by jenkins-bot:
[wikimedia/toolhub@main] bug(memcached): roll back to python-memcached and pin django-prometheus
Change 771070 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):
[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-214735-production
Change 771070 merged by jenkins-bot:
[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-214735-production
Leaving this open for now. We need to do a retrospective on the good and not so good decisions that went on with today's deploy and make some tasks for things to follow up on. It's after 17:00 local time and I'm not in the mood to be constructively reflective, so I will start on that tomorrow.
Quick dump of things that this event highlighted:
- Never testing memcached outside of production is not ideal.
- No monitoring of site for uptime/errors
- No solid plan for rolling back following a database migration
- Current "run maintenance actions from Bryan's laptop" system is slooooow for no obvious reason
- Bryan is the only person on the team with access to production servers
- No logstash dashboard, ad hoc queries needed instead
- metrics dashboard at https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m has very little information currently