Page MenuHomePhabricator

Quarry is down with 502 Bad Gateway message
Closed, ResolvedPublic

Description

All https://quarry.wmflabs.org/ pages give a 502 Bad Gateway error message.

Event Timeline

Framawiki triaged this task as Unbreak Now! priority.
Restricted Application added subscribers: Liuxinyu970226, TerraCodes, Aklapper. · View Herald TranscriptMar 16 2019, 8:21 AM
Framawiki closed this task as Resolved.Mar 16 2019, 8:29 AM
Framawiki claimed this task.

Solved.

framawiki@quarry-web-01:/var/log/nginx$ sudo tail access.log -n 1
172.16.0.164 - - [16/Mar/2019:08:22:28 +0000] "GET /query/20733 HTTP/1.1" 502 173 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)"
framawiki@quarry-web-01:/var/log/nginx$ sudo tail error.log -n 1
2019/03/16 08:22:28 [error] 2447#2447: *3618074 connect() to unix:/run/uwsgi/quarry-web.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 172.16.0.164, server: , request: "GET /query/20733 HTTP/1.1", upstream: "uwsgi://unix:/run/uwsgi/quarry-web.sock:", host: "quarry.wmflabs.org"

The requests were received by uwsgi which processed them but it seems that the answer was not reaching nginx. A restart of uwsgi solved the problem.

I see that in logs, just after the first down alert (02:55:39):

Mar 16 03:01:16 quarry-web-01 nslcd[632]: [13c5e3] <group/member="quarry"> ldap_start_tls_s() failed (uri=ldap://ldap-labs.codfw.wikimedia.org:389): Timed out: Operation now in progress
Mar 16 03:01:16 quarry-web-01 nslcd[632]: [13c5e3] <group/member="quarry"> failed to bind to LDAP server ldap://ldap-labs.codfw.wikimedia.org:389: Timed out: Operation now in progress
Mar 16 03:01:16 quarry-web-01 nslcd[632]: [13c5e3] <group/member="quarry"> connected to LDAP server ldap://ldap-labs.eqiad.wikimedia.org:389
Mar 16 03:01:19 quarry-web-01 nslcd[632]: [6a2fa8] <group/member="puppet"> connected to LDAP server ldap://ldap-labs.codfw.wikimedia.org:389
...
Mar 16 03:04:52 quarry-web-01 puppet-agent[16534]: /usr/bin/timeout -k 5s 20s /bin/mkdir /mnt/nfs/labstore-secondary-project returned 1 instead of one of [0]
Mar 16 03:04:52 quarry-web-01 puppet-agent[16534]: (/Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[project-on-labstore-secondary]/Exec[create-/mnt/nfs/labstore-secondary-project]/returns) change from notrun to 0 failed: /usr/bin/timeout -k 5s 20s /bin/mkdir /mnt/nfs/labstore-secondary-project returned 1 instead of one of [0]
Mar 16 03:04:52 quarry-web-01 puppet-agent[16534]: (/Stage[main]/Role::Labs::Nfsclient/Labstore::Nfs_mount[project-on-labstore-secondary]/Exec[ensure-nfs-project-on-labstore-secondary]) Dependency Exec[create-/mnt/nfs/labstore-secondary-project] has failures: true

Could it be related to T217280?