Page MenuHomePhabricator

Switch MySQL storage to tmpfs
Closed, ResolvedPublic

Related Objects

Event Timeline

Krinkle raised the priority of this task from to Medium.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, tstarling, ori.

The disk I/O on labs are not that nice on labs and I think Precise instances have slightly lower I/O capabilities than Trusty ones. Instances runs on different compute nodes which might have different I/O load as well.

I have created T96249 as a tracking task. We should further tune the innodb settings as well.

Krinkle set Security to None.
Krinkle added a subscriber: coren.

Change 204528 had a related patch set uploaded (by Krinkle):
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528

Change 204528 had a related patch set uploaded (by Krinkle):
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528

Running slave-scripts/bin/mw-install-mysql.sh and slave-scripts/bin/mw-teardown-mysql.sh alternatingly on a slave with /var/lib/mysql as tmpfs and on another slave without tmpfs did not show any notable difference. I ran it several dozen times. On both nodes it took about 5-10 seconds most times.

Since we can't reproduce the stalling of a minute from T96229 reliably, we'll have to see after deployment whether that stalling was caused by an I/O bottleneck in the mysql datadir. If it's still there, we can try investigating further. Perhaps mysql tmpdir comes into play (which is still disk-bound, defaulting to /tmp).

While installation had little to no difference, test execution did go notably faster (as expected). Using:

php phpunit.php --with-phpunitdir /srv/deployment/integration/phpunit/vendor/phpunit/phpunit --exclude-group Broken,ParserFuzz,Stub includes/PrefixSearchTest.php

integration-slave-trusty-1014 (using regular disk for mysql datadir; depooled; no jobs running)

  • [phpunit w/ PrefixSearchTest.php] Time: 10.52 seconds, Memory: 19.08Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 11.54 seconds, Memory: 19.08Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 14.34 seconds, Memory: 19.08Mb

integration-slave-trusty-1012 (using tmpfs for mysql datadir; depooled; no jobs running)

  • [phpunit w/ PrefixSearchTest.php] Time: 6.62 seconds, Memory: 19.07Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 7.33 seconds, Memory: 19.07Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 6.95 seconds, Memory: 19.07Mb

This was rolled out between 17:20 and 18:00 on 2014-04-16. I've took samples from jobs for MediaWiki core master and wmf branches (e.g. REL1_23 is not comparable). I also excluded builds that ran on the slave currently being used for libeatmydata (T96308).

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/
Before:

  • [4626] Time: 12.76 minutes, Memory: 1038.25Mb ; Tests: 9637, Assertions: 405813, Skipped: 21.
  • [4683] Time: 24.16 minutes, Memory: 1034.50Mb; Tests: 9685, Assertions: 369499, Skipped: 21
  • [4801] Time: 20.04 minutes, Memory: 1041.00Mb; Tests: 9859, Assertions: 232016, Skipped: 21
  • [4811] Time: 9.12 minutes, Memory: 1042.00Mb; Tests: 9859, Assertions: 262723, Skipped: 21

After:

  • [4818] Time: 8.54 minutes, Memory: 1035.75Mb; Tests: 9685, Assertions: 410804, Skipped: 21
  • [4822] Time: 7.53 minutes, Memory: 1036.25Mb; Tests: 9685, Assertions: 404107, Skipped: 21
  • [4830] Time: 6.72 minutes, Memory: 1042.00Mb; Tests: 9879, Assertions: 239291, Skipped: 21

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm/
Before:

  • [6433] Time: 10.14 minutes, Memory: 772.88Mb; Tests: 9688, Assertions: 642820, Skipped: 15
  • [6435] Time: 9.12 minutes, Memory: 776.35Mb; Tests: 9862, Assertions: 393158, Skipped: 15
  • [6437] Time: 4.84 minutes, Memory: 776.13Mb; Tests: 9862, Assertions: 340143, Skipped: 15

After:

  • [6481] Time: 4.08 minutes, Memory: 776.17Mb; Tests: 9862, Assertions: 295498, Skipped: 15
  • [6483] Time: 3.2 minutes, Memory: 776.70Mb; Tests: 9862, Assertions: 409342, Skipped: 15
  • [6488] Time: 2.66 minutes, Memory: 773.68Mb; Tests: 9688, Assertions: 833676, Skipped: 15

The trend shows that build times are shorter and more stable (fewer extremes). The arrow indicates the switch. This switch appears further back on the HHVM build graph because those are triggered more often (we only run Zend builds during the gate pipeline).

For Precise/Zend:

Screen_Shot_2015-04-17_at_15.22.41.png (288×505 px, 87 KB)

For Trusty/HHVM:
Screen_Shot_2015-04-17_at_15.24.14.png (289×499 px, 69 KB)

Excellent! I love the arrows on the build time graphs.

Most probably cause T126699 : i.e. mysql randomly restarting / loosing tables etc..

I was wrong in reopening this task. It has been completed and ran well for a while.

Change 204528 merged by Filippo Giunchedi:
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528