Page MenuHomePhabricator

Platform unable to resolve database hostname as expected
Open, MediumPublic

Description

The Library Card platform seemingly randomly went down today at 3:05pm UTC. The culprit was Docker being unable to connect to the database (Unknown MySQL server host 'db'). It's not clear why this isn't working as expected, but a hotfix has been deployed.

Now that we've fixed the immediate issue this is going to be parked under our longer term issues to come back to investigate. It's worth noting that the Hashtags tool had a very similar issue - it's deployed slightly different (not using Docker Swarm), but also failed to resolve its database host name through Docker. Wikilink-Tool on the other hand, despite being very similar to Hashtags (more similar than to TWLight), functioned as expected.

Related Objects

Event Timeline

Adding Cloud-VPS in case this is an indicator of a broader issue.

It's still unclear why the database isn't resolving - it works by direct IP connection.

We've deployed a hotfix directly to production and the tool is now up. I'll leave this task open until that gets merged back to the repository.

It's still unclear why the database isn't resolving - it works by direct IP connection.

What is the exact hostname that you were attempting to use? The message above seems to imply that you used the hostname db rather than a fully qualified domain name. Is the database deployed as a Cloud VPS instance or as a Docker container that you are managing?

What is the exact hostname that you were attempting to use? The message above seems to imply that you used the hostname db rather than a fully qualified domain name. Is the database deployed as a Cloud VPS instance or as a Docker container that you are managing?

Yeah, to clarify, this is about the connection between two docker containers in the same instance. The issue is between the twlight and db containers - until now the database has resolved with a hostname of db as expected.

hotfix posted to master. I was able to work around this issue by using the name entry used internally by the docker network manager, eg. s/db/tasks.production_db/g where production is the stack name and db is the service alias
https://github.com/WikipediaLibrary/TWLight/commit/a5c77fdbd0323114fd3c525be56e3dd7b726c6d8
I'll stash the change on prod and pull again once it makes it to the production branch. Will do the same for staging as part of the upcoming rebase.

it's not clear why the alias name spontaneously stopped resolving on both staging and production hosts.

before making that change, I tried a stack redeploy, some manual network dis/reconnects, package updates, and a reboot.

okay, hotfix made it to production branch. dropped the local changes on production host, pulled, and redeployed the stack. Production is looking happy. Staging rebase will be another day.

As an additional data point - Wikilink-Tool uses the same name resolution for its database via django + docker and hasn't run into any issues today.

Samwalton9-WMF renamed this task from Library Card platform is down - Unknown MySQL server host 'db' to Platform unable to resolve database hostname.Mar 10 2020, 5:34 PM
Samwalton9-WMF updated the task description. (Show Details)
Samwalton9-WMF renamed this task from Platform unable to resolve database hostname to Platform unable to resolve database hostname as expected.Mar 10 2020, 5:34 PM
Samwalton9-WMF lowered the priority of this task from High to Medium.

Now that the platform is back up I've rescoped this task for future investigation. If this doesn't seem like a Cloud issue (unlikely, given that other tools have continued to function as expected), feel free to remove Cloud-VPS.

bd808 edited projects, added Library-Card-Platform; removed VPS-Projects.

Sorry did not mean to remove the actual project. Phabriccator sub-projects can be confusing.