Page MenuHomePhabricator

Toolhub broken in prod by memcached client library change
Closed, ResolvedPublicPRODUCTION ERROR

Description

error 47 from memcached_set: (0x7f000800eda0) SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY,  host: mcrouter:11213 -> libmemcached/connect.cc:720

  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/utils/deprecation.py", line 93, in __call__
    response = self.process_request(request)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/middleware/locale.py", line 21, in process_request
    language = translation.get_language_from_request(request, check_path=i18n_patterns_used)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/utils/translation/__init__.py", line 236, in get_language_from_request
    return _trans.get_language_from_request(request, check_path)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/utils/translation/trans_real.py", line 532, in get_language_from_request
    lang_code = request.session.get(LANGUAGE_SESSION_KEY)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/contrib/sessions/backends/base.py", line 65, in get
    return self._session.get(key, default)
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/contrib/sessions/backends/base.py", line 194, in _get_session
    self._session_cache = self.load()
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/contrib/sessions/backends/cached_db.py", line 38, in load
    self._cache.set(self.cache_key, data, self.get_expiry_age(expiry=s.expire_date))
  File "/opt/lib/poetry/toolhub-2uZo5AhP-py3.7/lib/python3.7/site-packages/django/core/cache/backends/memcached.py", line 83, in set
    if not self._cache.set(key, value, self.get_backend_timeout(timeout)):

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bd808 triaged this task as High priority.Mar 15 2022, 9:17 PM
bd808 changed the subtype of this task from "Task" to "Production Error".

Change 771068 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] bug(memcached): roll back to python-memcached and pin django-prometheus

https://gerrit.wikimedia.org/r/771068

bd808 changed the task status from Open to In Progress.Mar 15 2022, 9:41 PM

Change 771068 merged by jenkins-bot:

[wikimedia/toolhub@main] bug(memcached): roll back to python-memcached and pin django-prometheus

https://gerrit.wikimedia.org/r/771068

Change 771070 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-214735-production

https://gerrit.wikimedia.org/r/771070

Change 771070 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-214735-production

https://gerrit.wikimedia.org/r/771070

Leaving this open for now. We need to do a retrospective on the good and not so good decisions that went on with today's deploy and make some tasks for things to follow up on. It's after 17:00 local time and I'm not in the mood to be constructively reflective, so I will start on that tomorrow.

Quick dump of things that this event highlighted:

  • Never testing memcached outside of production is not ideal.
  • No monitoring of site for uptime/errors
  • No solid plan for rolling back following a database migration
  • Current "run maintenance actions from Bryan's laptop" system is slooooow for no obvious reason
  • Bryan is the only person on the team with access to production servers
  • No logstash dashboard, ad hoc queries needed instead
  • metrics dashboard at https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m has very little information currently