Page MenuHomePhabricator

Determine reason for daily increasing proc count on fran1001
Closed, ResolvedPublic

Description

fran1001 has started alerting for total procs counts during evening (PDT) runs. The proc count starts growing around 0200 UTC and returns to normal around 1000 UTC. The count climbs from the steady state of ~430 to right near 2000. This has started in the last 3 evenings. The major changed that happened recently was the upgrade of frdb1003 to buster and mariadb 10.4 (T255066) on 20200622.

What we are seeing is a growth of total procs on fran1001 and a increase of connected db threads on frdb1003. Current hypothesis is that there is a query or set of queries that is backing up the database responses and thus causing the other connections to stack up.

Todo:

  • investigate the db performance, connection status, and queries running during the trouble period
  • check the analytics processes and scripts to see how they are interacting with the db during that time period
  • adjust analytics scripts as needed to eliminate concurrent running processes where possible. Opened T256924 to handled the updating of scripts as needed.

Event Timeline

Temporarily bumping the alert threshold to 3000 while we find the root cause.

[frack::puppet] 9e777eba Adjust nagios check_procs alert upward to 3000

Rolled the prometheus processes collector to buster hosts to collect stats on total procs running and their states.

[frack::puppet] 0f300450 Re-enable the proccesses collector for prometheus on buster

Captured some data during the incident and it looks like backed up queries on the banner_cube_raw table. We'll need to add either a a lock file or some logic to detect these cases. On the upside, nothing bad and totally fixable.

When   Load  Cxns    QPS   Slow   Se/In/Up/De%   QCacheHit  KCacheHit  BpsIn   BpsOut
Now    3.58  72      0.20  0      50/ 0/ 0/ 0        0.00%    100.00%  22.49    3.86k
Total  0.00   4.88k  1.14  1.14k  63/700/150/ 8      0.00%    100.00%   7.65k  17.11k

Cmd    ID      State               User    Host          DB         Time          Query                                                                                                                            
Query   27333  Waiting for table   eyener  10.64.40.110  analytics  01:01:33.769  CREATE OR REPLACE TABLE banner_cube_raw ( id int(10) unsigned ,contact_id int(10) unsigned ,financial_type_id int(10) unsigned ,f
Query   27334  Waiting for table   eyener  10.64.40.110  analytics  01:01:33.764  INSERT INTO banner_cube_raw ( id ,contact_id ,financial_type_id ,financial_type_name ,donation_type ,original_amount ,original_cu
Query   27337  Waiting for table   eyener  10.64.40.110  analytics  01:00:33.639  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27340  Waiting for table   eyener  10.64.40.110  analytics     59:33.946  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27344  Waiting for table   eyener  10.64.40.110  analytics     58:33.737  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27347  Waiting for table   eyener  10.64.40.110  analytics     57:33.522  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27350  Waiting for table   eyener  10.64.40.110  analytics     56:33.290  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27353  Waiting for table   eyener  10.64.40.110  analytics     55:34.087  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27356  Waiting for table   eyener  10.64.40.110  analytics     54:33.558  SELECT MAX(id) FROM analytics.banner_cube_raw
[snip]

Change 608659 had a related patch set uploaded (by Dwisehaupt; owner: Dwisehaupt):
[wikimedia/fundraising/analytics@master] Add logging and start/end timings for banner_cube_raw_update.py

https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/analytics/ /608659

Having dug into this, the cause appears to be long running queries that then have smaller queries that get backed up behind them waiting on the table to not be locked. I'm working with @EYener to add some logging and timing into a few of the scripts to get accurate runtimes. We'll also discuss what each of the processes do and if we should introduce lockfiles to some to keep concurrent processes from running where they don't need to.

Change 608659 merged by Erin Yener:
[wikimedia/fundraising/analytics@master] Add logging and start/end timings for banner_cube_raw_update.py

https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/analytics/ /608659

We have nailed this down to the scripts running concurrently. Going to close this out in favor of T256924 where the work to fix up these scripts is being done.

Dwisehaupt updated the task description. (Show Details)
Dwisehaupt moved this task from In Progress to Done on the fundraising-tech-ops board.