Page MenuHomePhabricator

Switch MySQL storage to tmpfs
Closed, ResolvedPublic

Related Objects

Event Timeline

Krinkle created this task.Apr 16 2015, 5:16 AM
Krinkle raised the priority of this task from to Normal.
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Krinkle, tstarling, ori.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 16 2015, 5:16 AM
hashar added a subscriber: hashar.Apr 16 2015, 8:58 AM

The disk I/O on labs are not that nice on labs and I think Precise instances have slightly lower I/O capabilities than Trusty ones. Instances runs on different compute nodes which might have different I/O load as well.

I have created T96249 as a tracking task. We should further tune the innodb settings as well.

Krinkle claimed this task.Apr 16 2015, 5:09 PM
Krinkle set Security to None.
Krinkle added a subscriber: coren.

Change 204528 had a related patch set uploaded (by Krinkle):
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528

Change 204528 had a related patch set uploaded (by Krinkle):
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528

Krinkle added a comment.EditedApr 16 2015, 6:56 PM

Running slave-scripts/bin/mw-install-mysql.sh and slave-scripts/bin/mw-teardown-mysql.sh alternatingly on a slave with /var/lib/mysql as tmpfs and on another slave without tmpfs did not show any notable difference. I ran it several dozen times. On both nodes it took about 5-10 seconds most times.

Since we can't reproduce the stalling of a minute from T96229 reliably, we'll have to see after deployment whether that stalling was caused by an I/O bottleneck in the mysql datadir. If it's still there, we can try investigating further. Perhaps mysql tmpdir comes into play (which is still disk-bound, defaulting to /tmp).

While installation had little to no difference, test execution did go notably faster (as expected). Using:

php phpunit.php --with-phpunitdir /srv/deployment/integration/phpunit/vendor/phpunit/phpunit --exclude-group Broken,ParserFuzz,Stub includes/PrefixSearchTest.php

integration-slave-trusty-1014 (using regular disk for mysql datadir; depooled; no jobs running)

  • [phpunit w/ PrefixSearchTest.php] Time: 10.52 seconds, Memory: 19.08Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 11.54 seconds, Memory: 19.08Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 14.34 seconds, Memory: 19.08Mb

integration-slave-trusty-1012 (using tmpfs for mysql datadir; depooled; no jobs running)

  • [phpunit w/ PrefixSearchTest.php] Time: 6.62 seconds, Memory: 19.07Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 7.33 seconds, Memory: 19.07Mb
  • [phpunit w/ PrefixSearchTest.php] Time: 6.95 seconds, Memory: 19.07Mb
Krinkle updated the task description. (Show Details)Apr 16 2015, 9:57 PM

This was rolled out between 17:20 and 18:00 on 2014-04-16. I've took samples from jobs for MediaWiki core master and wmf branches (e.g. REL1_23 is not comparable). I also excluded builds that ran on the slave currently being used for libeatmydata (T96308).

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/
Before:

  • [4626] Time: 12.76 minutes, Memory: 1038.25Mb ; Tests: 9637, Assertions: 405813, Skipped: 21.
  • [4683] Time: 24.16 minutes, Memory: 1034.50Mb; Tests: 9685, Assertions: 369499, Skipped: 21
  • [4801] Time: 20.04 minutes, Memory: 1041.00Mb; Tests: 9859, Assertions: 232016, Skipped: 21
  • [4811] Time: 9.12 minutes, Memory: 1042.00Mb; Tests: 9859, Assertions: 262723, Skipped: 21

After:

  • [4818] Time: 8.54 minutes, Memory: 1035.75Mb; Tests: 9685, Assertions: 410804, Skipped: 21
  • [4822] Time: 7.53 minutes, Memory: 1036.25Mb; Tests: 9685, Assertions: 404107, Skipped: 21
  • [4830] Time: 6.72 minutes, Memory: 1042.00Mb; Tests: 9879, Assertions: 239291, Skipped: 21

https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm/
Before:

  • [6433] Time: 10.14 minutes, Memory: 772.88Mb; Tests: 9688, Assertions: 642820, Skipped: 15
  • [6435] Time: 9.12 minutes, Memory: 776.35Mb; Tests: 9862, Assertions: 393158, Skipped: 15
  • [6437] Time: 4.84 minutes, Memory: 776.13Mb; Tests: 9862, Assertions: 340143, Skipped: 15

After:

  • [6481] Time: 4.08 minutes, Memory: 776.17Mb; Tests: 9862, Assertions: 295498, Skipped: 15
  • [6483] Time: 3.2 minutes, Memory: 776.70Mb; Tests: 9862, Assertions: 409342, Skipped: 15
  • [6488] Time: 2.66 minutes, Memory: 773.68Mb; Tests: 9688, Assertions: 833676, Skipped: 15

The trend shows that build times are shorter and more stable (fewer extremes). The arrow indicates the switch. This switch appears further back on the HHVM build graph because those are triggered more often (we only run Zend builds during the gate pipeline).

For Precise/Zend:


For Trusty/HHVM:

Excellent! I love the arrows on the build time graphs.

Krinkle moved this task from Backlog to Recently announced on the Developer-notice board.
Krinkle removed a subscriber: gerritbot.
hashar reopened this task as Open.Feb 17 2016, 5:35 PM

Most probably cause T126699 : i.e. mysql randomly restarting / loosing tables etc..

hashar closed this task as Resolved.Feb 18 2016, 10:59 PM

I was wrong in reopening this task. It has been completed and ran well for a while.

Change 204528 merged by Filippo Giunchedi:
contint: Put mysql db on tmpfs for role::ci::slave::labs

https://gerrit.wikimedia.org/r/204528