Determine reason for daily increasing proc count on fran1001
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dwisehaupt
	Jun 25 2020, 7:34 PM

Description

fran1001 has started alerting for total procs counts during evening (PDT) runs. The proc count starts growing around 0200 UTC and returns to normal around 1000 UTC. The count climbs from the steady state of ~430 to right near 2000. This has started in the last 3 evenings. The major changed that happened recently was the upgrade of frdb1003 to buster and mariadb 10.4 (T255066) on 20200622.

What we are seeing is a growth of total procs on fran1001 and a increase of connected db threads on frdb1003. Current hypothesis is that there is a query or set of queries that is backing up the database responses and thus causing the other connections to stack up.

Todo:

investigate the db performance, connection status, and queries running during the trouble period
check the analytics processes and scripts to see how they are interacting with the db during that time period
adjust analytics scripts as needed to eliminate concurrent running processes where possible. Opened T256924 to handled the updating of scripts as needed.

Related Objects
Search...

Status	Assigned	Task
Resolved	Dwisehaupt	T256420 Determine reason for daily increasing proc count on fran1001
Resolved	Dwisehaupt	T256924 Add logging and lockfiles to fr analytics scripts where needed
Resolved	Dwisehaupt	T258454 Bugfix for get_lock() function on fran1001 inserts

Event Timeline

Dwisehaupt created this task.Jun 25 2020, 7:34 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2020, 7:34 PM

Temporarily bumping the alert threshold to 3000 while we find the root cause.

[frack::puppet] 9e777eba Adjust nagios check_procs alert upward to 3000

Jgreen moved this task from Triage to Up Next on the fundraising-tech-ops board.Jun 26 2020, 7:00 PM

Rolled the prometheus processes collector to buster hosts to collect stats on total procs running and their states.

[frack::puppet] 0f300450 Re-enable the proccesses collector for prometheus on buster

Captured some data during the incident and it looks like backed up queries on the banner_cube_raw table. We'll need to add either a a lock file or some logic to detect these cases. On the upside, nothing bad and totally fixable.

When   Load  Cxns    QPS   Slow   Se/In/Up/De%   QCacheHit  KCacheHit  BpsIn   BpsOut
Now    3.58  72      0.20  0      50/ 0/ 0/ 0        0.00%    100.00%  22.49    3.86k
Total  0.00   4.88k  1.14  1.14k  63/700/150/ 8      0.00%    100.00%   7.65k  17.11k

Cmd    ID      State               User    Host          DB         Time          Query                                                                                                                            
Query   27333  Waiting for table   eyener  10.64.40.110  analytics  01:01:33.769  CREATE OR REPLACE TABLE banner_cube_raw ( id int(10) unsigned ,contact_id int(10) unsigned ,financial_type_id int(10) unsigned ,f
Query   27334  Waiting for table   eyener  10.64.40.110  analytics  01:01:33.764  INSERT INTO banner_cube_raw ( id ,contact_id ,financial_type_id ,financial_type_name ,donation_type ,original_amount ,original_cu
Query   27337  Waiting for table   eyener  10.64.40.110  analytics  01:00:33.639  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27340  Waiting for table   eyener  10.64.40.110  analytics     59:33.946  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27344  Waiting for table   eyener  10.64.40.110  analytics     58:33.737  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27347  Waiting for table   eyener  10.64.40.110  analytics     57:33.522  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27350  Waiting for table   eyener  10.64.40.110  analytics     56:33.290  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27353  Waiting for table   eyener  10.64.40.110  analytics     55:34.087  SELECT MAX(id) FROM analytics.banner_cube_raw                                                                                    
Query   27356  Waiting for table   eyener  10.64.40.110  analytics     54:33.558  SELECT MAX(id) FROM analytics.banner_cube_raw
[snip]

Dwisehaupt claimed this task.Jun 29 2020, 11:40 PM

Dwisehaupt moved this task from Up Next to In Progress on the fundraising-tech-ops board.

Change 608659 had a related patch set uploaded (by Dwisehaupt; owner: Dwisehaupt):
[wikimedia/fundraising/analytics@master] Add logging and start/end timings for banner_cube_raw_update.py

https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/analytics/ /608659

gerritbot added a project: Patch-For-Review.Jun 30 2020, 4:01 PM

Having dug into this, the cause appears to be long running queries that then have smaller queries that get backed up behind them waiting on the table to not be locked. I'm working with @EYener to add some logging and timing into a few of the scripts to get accurate runtimes. We'll also discuss what each of the processes do and if we should introduce lockfiles to some to keep concurrent processes from running where they don't need to.

Dwisehaupt updated the task description. (Show Details)Jun 30 2020, 6:10 PM

Change 608659 merged by Erin Yener:
[wikimedia/fundraising/analytics@master] Add logging and start/end timings for banner_cube_raw_update.py

https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/analytics/ /608659

Maintenance_bot removed a project: Patch-For-Review.Jul 1 2020, 7:01 PM

Dwisehaupt updated the task description. (Show Details)Jul 1 2020, 10:11 PM

We have nailed this down to the scripts running concurrently. Going to close this out in favor of T256924 where the work to fix up these scripts is being done.

Dwisehaupt closed this task as Resolved.Jul 6 2020, 10:50 PM

Dwisehaupt updated the task description. (Show Details)

Dwisehaupt moved this task from In Progress to Done on the fundraising-tech-ops board.

• EYener closed subtask T256924: Add logging and lockfiles to fr analytics scripts where needed as Resolved.Jul 24 2020, 2:43 PM

Dwisehaupt updated the task description. (Show Details)Jul 24 2020, 3:00 PM

Determine reason for daily increasing proc count on fran1001Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Determine reason for daily increasing proc count on fran1001
Closed, ResolvedPublic
Actions

Related Objects
Search...