Database busted for CiviCRM tests?
Open, NormalPublic

Description

All the CiviCRM tests seem to be failing in CI lately, with mysterious DB errors.

E.g. https://integration.wikimedia.org/ci/job/wikimedia-fundraising-civicrm/7367/console, https://integration.wikimedia.org/ci/job/wikimedia-fundraising-civicrm/7359/console.

Are there new limits on RAM or query timeouts for CI databases?

Ejegg created this task.Oct 2 2018, 3:09 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 3:09 AM
Krenair added a subscriber: Krenair.Oct 2 2018, 3:12 PM

https://integration.wikimedia.org/ci/job/wikimedia-fundraising-civicrm/7367/console says it ran on integration-slave-jessie-1001
Sep 06 00:55:13 <wmf-insecte> maintenance-disconnect-full-disks build 713 integration-slave-jessie-1001: OFFLINE due to disk space
(it may have recovered since then, not sure, this is based on my IRC logs)
df -h shows, among other things

Filesystem                          Size  Used Avail Use% Mounted on
/dev/vda3                            19G   16G  2.2G  88% /
none                                256M  249M  7.9M  97% /var/lib/mysql
/dev/mapper/vd-second--local--disk   21G   14G  5.7G  71% /srv
none                                256M  7.4M  249M   3% /srv/home/jenkins-deploy/tmpfs

so no wonder mysql stuff is failing

I found some fundraising tables lying around on some integration hosts but it doesn't look like they're responsible being only 20MB each:
relevant stuff from krenair@integration-cumin:~$ sudo cumin '*' 'mysql -e "select table_schema, sum((data_length+index_length)/1024/1024) AS MB from information_schema.tables group by 1;"':

===== NODE GROUP =====                                                                                                                                                                                      
(1) integration-slave-jessie-1003.integration.eqiad.wmflabs                                                                                                                                                 
----- OUTPUT of 'mysql -e "select...les group by 1;"' -----                                                                                                                                                 
table_schema    MB                                                                                                                                                                                          
civicrm_jenkins_wikimedia_fundraising_civicrm_7359_3    19.95312500                                                                                                                                         
information_schema	0.00878904
mysql	0.68926619
performance_schema	0.00000000
===== NODE GROUP =====                                                                                                                                                                                      
(1) integration-slave-jessie-1001.integration.eqiad.wmflabs                                                                                                                                                 
----- OUTPUT of 'mysql -e "select...les group by 1;"' -----                                                                                                                                                 
table_schema    MB                                                                                                                                                                                          
civicrm_jenkins_wikimedia_fundraising_civicrm_7351_3    19.95312500                                                                                                                                         
information_schema	0.00878904
mysql	0.68926619
performance_schema	0.00000000
Ejegg added a comment.Oct 2 2018, 4:36 PM

Looks like the ibdata1 file is taking up almost all the space:
root@integration-slave-jessie-1003:/var/lib/mysql# du -sh *
1.3M civicrm_jenkins_wikimedia_fundraising_civicrm_7389_1
6.7M civicrm_jenkins_wikimedia_fundraising_civicrm_7389_2
2.4M civicrm_jenkins_wikimedia_fundraising_civicrm_7389_3
235M ibdata1
5.0M ib_logfile0
5.0M ib_logfile1
1.1M mysql
208K performance_schema

Since there were no databases on the affected boxes that had any schema, I've backed up the ibdata1 file and restarted mysql as there were no hung transactions that were going to be able to finish without their underlying database.

Let me know if this helps with your problem, if not we can take a deeper dive.

Ejegg added a comment.Oct 3 2018, 6:32 PM

@thcipriani thanks! It seems to help temporarily, but those files grow back to consume most of the disk pretty quickly. Any chance we can increase the quota on that partition? And maybe have a nightly job to clear that file out, since none of those dbs need to persist?

Ejegg added a comment.Oct 3 2018, 7:47 PM

Appears to be mostly indexes. Here's the output of the modified innochecksum util from here: https://bugs.mysql.com/bug.php?id=57611&files=1

0 bad checksum
13528 FIL_PAGE_INDEX
398 FIL_PAGE_UNDO_LOG
62 FIL_PAGE_INODE
8 FIL_PAGE_IBUF_FREE_LIST
445 FIL_PAGE_TYPE_ALLOCATED
2 FIL_PAGE_IBUF_BITMAP
130 FIL_PAGE_TYPE_SYS
1 FIL_PAGE_TYPE_TRX_SYS
2 FIL_PAGE_TYPE_FSP_HDR
0 FIL_PAGE_TYPE_XDES
435 FIL_PAGE_TYPE_BLOB
0 FIL_PAGE_TYPE_ZBLOB
0 other
9 max index_id
undo type: 139 insert, 259 update, 0 other
undo state: 0 active, 250 cached, 0 to_free, 101 to_purge, 0 prepared, 47 other

Change 464594 had a related patch set uploaded (by Ejegg; owner: Ejegg):
[wikimedia/fundraising/crm@master] Delete a duplicate table with a big chunk of data

https://gerrit.wikimedia.org/r/464594

Ejegg added a comment.Oct 4 2018, 3:48 PM

...and the index data are mostly on tables that get manipulated during the tests. We just push a lot of data around to get realistic setup for complex scenarios, i.e. de-duplication.

+----------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+

NameEngineVersionRow_formatRowsAvg_row_lengthData_lengthMax_data_lengthIndex_lengthData_freeAuto_incrementCreate_timeUpdate_timeCheck_timeCollationChecksumCreate_optionsComment

+----------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+

civicrm_activityInnoDB10Compact0016384019660816567500812018-10-03 18:01:07NULLNULLutf8_unicode_ciNULL
civicrm_addressInnoDB10Compact0016384016384016567500812018-10-03 18:01:07NULLNULLutf8_unicode_ciNULL
civicrm_contactInnoDB10Compact2819216384032768016567500832018-10-03 18:01:07NULLNULLutf8_unicode_ciNULL
civicrm_contributionInnoDB10Compact0016384024576016567500812018-10-03 18:01:07NULLNULLutf8_unicode_ciNULL
civicrm_contribution_recurInnoDB10Compact0016384014745616567500812018-10-03 18:01:07NULLNULLutf8_unicode_ciNULL
civicrm_loc_blockInnoDB10Compact0016384013107216567500812018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
civicrm_mailingInnoDB10Compact0016384022937616567500812018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
civicrm_option_valueInnoDB10Compact861152131072014745616567500810072018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
civicrm_participantInnoDB10Compact0016384014745616567500812018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
civicrm_state_provinceInnoDB10Compact4418482129920294912165675008140332018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
civicrm_value_prospect_6InnoDB10Compact0016384031129616567500812018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
wmf_contribution_extraInnoDB10Compact0016384039321616567500812018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL
wmf_donorInnoDB10Compact0016384060620816567500812018-10-03 18:01:08NULLNULLutf8_unicode_ciNULL

+----------------------------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+-----------------+----------+----------------+---------+

I did find a dataset that we'd duplicated, and am deleting the obsolete copy. That wasn't one of the heavily indexed tables, though, so I don't think it'll shrink ibdata1 much.

Any chance we can get more space?

Change 464594 merged by jenkins-bot:
[wikimedia/fundraising/crm@master] Delete a duplicate table with a big chunk of data

https://gerrit.wikimedia.org/r/464594

Ejegg added a comment.Oct 16 2018, 8:59 PM

Those partitions still fill up pretty fast. @thcipriani would it be possible to increase the size of /var/lib/mysql at all, at least until we can rewrite the test setup to work on containers?

Would it help to switch to file per table - https://stackoverflow.com/questions/3456159/how-to-shrink-purge-ibdata1-file-in-mysql
ibdata1 is pretty big given that the mysql databases don't have any persistent data in them

greg added a subscriber: greg.Nov 1 2018, 8:00 PM

(I added this to the 'slipway' CI milestone to indicate that migrating this job to the docker-based system would alleviate this issue (single user docker images instead of long-running executers))

greg added a comment.Nov 29 2018, 4:42 AM

(I added this to the 'slipway' CI milestone to indicate that migrating this job to the docker-based system would alleviate this issue (single user docker images instead of long-running executers))

See also: T210287

greg triaged this task as Normal priority.