Page MenuHomePhabricator

Verify checkwiki tool against excessive DB usage
Closed, ResolvedPublic

Description

We had a DB outage on Feb 14 caused by an excessive number of DB connections from tools. checkwiki was among these.

It has not been entirely established if the cause was the excessive long queries or another matter causing the queries to get stuck and build up, which was then overloading the server. But there were enough from the checkwiki tool to merit investigation along with other problems.

Event Timeline

GTirloni created this task.Feb 14 2019, 7:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2019, 7:07 PM
bd808 added a comment.Feb 14 2019, 7:26 PM

We do not currently have any evidence that this tool's code or job frequency has changed. It is possible that the spike in active connections to ToolsDB is caused by some external issue. The tool did migrate from the Trusty job grid to the Stretch job grid recently.

Bstorm updated the task description. (Show Details)Feb 15 2019, 12:59 AM

I have no idea why I am a maintainer of this project. I would help in anything possible though. Let me check.

I don't know what you want from the maintainer for you to unsuspend the web service for this project.
Maybe this:
Web service migration to stretch: 15 January 2019 per this announcement - https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:WikiProject_Check_Wikipedia&diff=878481252&oldid=877912003
Last web service source code change: Nov 21, 2018 per github - https://github.com/bamyers99/checkwiki/tree/master/cgi-bin

bd808 added a comment.Feb 15 2019, 6:35 PM

I don't know what you want from the maintainer for you to unsuspend the web service for this project.

This really may all be collateral damage from a general system failure of the ToolsDB server. We created this ticket early in the debugging cycle before we became more aware of the extent of the issues that ToolsDB is still having. At the moment all tools are under a concurrent connection cap and the system continues to degrade. We are actively working to build a replacement server and migrate the ToolsDB data to it. Once that work is complete we can revisit the impact of any individual tools.

All of that is a really long way of saying that today I do not know what we want from you as tool maintainers either. The best support you can give today is some patience as we try to correct the global problem and get back to a state where we can evaluate individual usage by each tool.

What is the exact status now? Does the project remain deactivated and thus the quality of the entire project falls by the wayside?

If it should remain deactivated, I ask you to provide appropriate solutions with Cirrus as well, since not all employees are programmers here.

So database owners (or similar) should have enough possibilities to scan or log the database load during productivity. Why isn't this being done?

PS.: Can the increased load be related to the activation of ID16? This ran before only in the English and now for all Wikipedia!

What is the exact status now? Does the project remain deactivated and thus the quality of the entire project falls by the wayside?

@Crazy1880 Following the parent task at T216208: ToolsDB overload and cleanup and/or reading posts on the cloud-announce mailing list is the best way to track the overall ToolsDB outage problem. Once we have that handled we will circle back on this task to find out if anything needs to change in this particular tool's usage of the ToolsDB database.

If it should remain deactivated, I ask you to provide appropriate solutions with Cirrus as well, since not all employees are programmers here.

So database owners (or similar) should have enough possibilities to scan or log the database load during productivity. Why isn't this being done?

I'm not sure that I understand the question here.

PS.: Can the increased load be related to the activation of ID16? This ran before only in the English and now for all Wikipedia!

Possibly? I do not know the codebase of this particular tool, but this may be something for the maintainers to investigate if there are material issues left once the ToolsDB service is fixed and we try to measure impact on the shared service from this tool again.

I have added the following to $HOME/.lighttpd.conf to reduce the server load from bot traffic:

# deny access for bots
$HTTP["useragent"] =~ "(?:spider|bot[\s_+:,\.\;\/\\\-]|[\s_+:,\.\;\/\\\-]bot)" {
  url.access-deny = ( "" )
}

The above code is working on the other project that I just added it to: bambots

Per the announcement at https://lists.wikimedia.org/pipermail/cloud-announce/2019-February/000137.html, ToolsDB has been migrated to new hardware and is currently operating normally. The checkwiki tool can be re-enabled, and we will try to keep an eye on issues that may arise. Hopefully this report was just a false positive that was noticed as the failing server struggled to keep up.

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.Feb 19 2019, 1:09 AM
Bstorm closed this task as Resolved.Feb 22 2019, 5:52 PM
Bstorm claimed this task.
Bstorm added a subscriber: Bstorm.

T216170 may have also made this a less important issue in general. Since any tool making a number of queries during the outage could be shown to hang, even for small queries, they would previously pile up without end. Even killing connections didn't work correctly and would hang, as would some status queries against system databases on testing. I think we are safe to close this for now. If you run out of user connections, everyone currently has a pool of 20 available, and perhaps the tool should be configured around that limit if possible.