Page MenuHomePhabricator

ores.wmflabs.org - 503 icinga alerts
Closed, ResolvedPublic

Event Timeline

Dzahn created this task.Wed, Jan 15, 12:59 AM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptWed, Jan 15, 12:59 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Dzahn updated the task description. (Show Details)Wed, Jan 15, 1:00 AM
elukey added a subscriber: Halfak.Wed, Jan 15, 7:45 AM

Looks like celery has shut down on all of the workers. I'm looking into it now.

I think we might be too close to the memory ceiling and an OOM is what's killing them.

That said, when I restart celery, the lowest available memory gets is ~5GB (out of 8GB) so it doesn't look like we're *really* running out of memory. Could there be another reason we see:

MemoryError: [Errno 12] Cannot allocate memory

Something really strange is going on. I cut our celery workers in half an we're still not able to actually start up celery because we get a MemoryError during the startup process. We haven't done a deployment here in a while. What could have changed?

Halfak claimed this task.Wed, Jan 15, 4:38 PM
Dzahn triaged this task as Medium priority.Wed, Jan 15, 8:20 PM

Looks like the OOM error might have been old. Here's what I have now:

$ sudo -u www-data ../venv/bin/python ores_celery.py 
/srv/ores/venv/lib/python3.5/site-packages/smart_open/smart_open_lib.py:398: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
Hspell: can't open /usr/share/hspell/hebrew.wgz.sizes.
Traceback (most recent call last):
  File "ores_celery.py", line 6, in <module>
    application = celery.build()
  File "/srv/ores/config/ores/applications/celery.py", line 41, in build
    config, config['ores']['scoring_system'])
  File "/srv/ores/config/ores/scoring_systems/celery_queue.py", line 232, in from_config
    config, name, section_key=section_key)
  File "/srv/ores/config/ores/scoring_systems/scoring_system.py", line 308, in _kwargs_from_config
    config, section['metrics_collector'])
  File "/srv/ores/config/ores/metrics_collectors/metrics_collector.py", line 62, in from_config
    return Class.from_config(config, name)
  File "/srv/ores/config/ores/metrics_collectors/statsd.py", line 151, in from_config
    return cls.from_parameters(**kwargs)
  File "/srv/ores/config/ores/metrics_collectors/statsd.py", line 131, in from_parameters
    statsd_client = statsd.StatsClient(*args, **kwargs)
  File "/srv/ores/venv/lib/python3.5/site-packages/statsd/client.py", line 146, in __init__
    host, port, fam, socket.SOCK_DGRAM)[0]
  File "/usr/lib/python3.5/socket.py", line 733, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

Looks like it is failing because statsd isn't there to connect to anymore.

Turns out it was the statsd host. It changes from labsmon1001 to cloudmetrics1001. Now that I've done a new deployment with an updated config, we're back online.

Dzahn closed this task as Resolved.Wed, Jan 15, 11:53 PM