Page MenuHomePhabricator

Jenkins jobs using MediaWiki frequently fail due to database locks
Closed, ResolvedPublic40 Story Points

Description

It doesn't happen every time, but often. It's intermittent. Even within a single job some writes succeed, and then another one fails.

Tue Feb 10 5:58:38 UTC 2015
	integration-slave1006
	build276
	SqlBagOStuff::set/single-row
	5
	database is locked
REPLACE INTO objectcache (keyname,value,exptime) VALUES ('build276:resourceloader:modulemodifiedhash:mediawiki.language.data:ade3504093df2f931359f1c5028d085d',x'15cacb09c0200c00d05d3241cc879a388d54450f4221bd95ee5efaceaf3af9132e0eb3c68412cee4505b674541e336681827561be954a4dc306bfb9f39dc6bf7b8ebbea02c4f42ac7258cae5fd00','20380119031407')
DatabaseBase::query: Writes done: REPLACE INTO objectcache (keyname,value,exptime) VALUES ('X')
SQL ERROR: database is locked
[SQLBagOStuff] DBError: A database error has occurred. Did you forget to run maintenance/update.php after upgrading?  See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: REPLACE INTO objectcache (keyname,value,exptime) VALUES ('build276:resourceloader:modulemodifiedhash:mediawiki.language.data:ade3504093df2f931359f1c5028d085d',x'15cacb09c0200c00d05d3241cc879a388d54450f4221bd95ee5efaceaf3af9132e0eb3c68412cee4505b674541e336681827561be954a4dc306bfb9f39dc6bf7b8ebbea02c4f42ac7258cae5fd00','20380119031407')
Function: SqlBagOStuff::set/single-row
Error: 5 database is locked

[SQLBagOStuff] SqlBagOStuff::handleWriteError: ignoring query error
LoadBalancer::getConnection: using server  for group
LoadBalancer::getConnection: using server  for group
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
DatabaseBase::query: Writes done: REPLACE INTO objectcache (keyname,value,exptime) VALUES ('X')
SQL ERROR (ignored): database is locked
DatabaseBase::query: Writes done: DELETE FROM l10n_cache WHERE lc_lang = 'X'
LoadBalancer::getConnection: using server  for group
LoadBalancer::getConnection: using server  for group
LoadBalancer::getConnection: using server  for group
LocalisationCache::isExpired(ar): cache missing, need to make one
DatabaseBase::query: Writes done: INSERT INTO l10n_cache (lc_lang,lc_key,lc_value) VALUES ('X')
DatabaseBase::query: Writes done: INSERT INTO l10n_cache (lc_lang,lc_key,lc_value) VALUES ('X')
LocalisationCache::recache: got localisation for ar from source
DatabaseBase::query: Writes done: INSERT INTO l10n_cache (lc_lang,lc_key,lc_value) VALUES ('X')
DatabaseBase::query: Writes done: INSERT INTO l10n_cache (lc_lang,lc_key,lc_value) VALUES ('X')

Related Objects

Event Timeline

Krinkle created this task.Feb 10 2015, 11:05 PM
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added a subscriber: Krinkle.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2015, 11:05 PM
Krinkle triaged this task as High priority.Feb 10 2015, 11:12 PM
Krinkle set Security to None.
Krinkle added a comment.EditedFeb 12 2015, 3:46 AM

During a build, the tmpfs looks like this:

$ l
total 1.3M
drwxrwxrwx  3 jenkins-deploy wikidev   100 Feb 12 03:43 ./
drwxrwxrwt 11 root           root      220 Feb 12 03:13 ../
-rw-rw-rw-  1 jenkins-deploy wikidev  1.3M Feb 12 03:43 build361.sqlite
-rw-rw-rw-  1 www-data       www-data  16K Feb 12 03:43 build361.sqlite-journal
drwxr-xr-x  2 www-data       www-data   40 Feb 12 03:42 locks/
[03:43 UTC] krinkle at integration-slave1006.eqiad.wmflabs in /mnt/home/jenkins-deploy/tmpfs/mediawiki-core-qunit-karma
$ l
total 5.3M
drwxrwxrwx  3 jenkins-deploy wikidev    80 Feb 12 03:43 ./
drwxrwxrwt 11 root           root      220 Feb 12 03:13 ../
-rw-rw-rw-  1 jenkins-deploy wikidev  5.3M Feb 12 03:43 build361.sqlite
drwxr-xr-x  2 www-data       www-data   40 Feb 12 03:42 locks/
[03:44 UTC] krinkle at integration-slave1006.eqiad.wmflabs in /mnt/home/jenkins-deploy/tmpfs/mediawiki-core-qunit-karma
$ l
total 0
drwxrwxrwx  3 jenkins-deploy wikidev   60 Feb 12 03:44 ./
drwxrwxrwt 11 root           root     220 Feb 12 03:13 ../
drwxr-xr-x  2 www-data       www-data  40 Feb 12 03:42 locks/

The sqlite file is owned by jenkins-deploy, but read-writable by group and world. So that shouldn't be an issue.

Besides, the job is consistently passing on gallium/lanthanum. And only sometimes failing on labs instances.

Krinkle added a comment.EditedFeb 12 2015, 5:28 AM

I've ruled out permissions as a possible cause. Digging further in the logs, build logs for failing jobs tend to be much more interleaved then those of passing jobs.

https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-karma/361/ (FAILING)
Heavily filtered/reduced log:

Start request GET /jenkins-mediawiki-core-qunit-karma-361/index.php?title=Special:JavaScriptTest/qunit/export
..
Start request GET /jenkins-mediawiki-core-qunit-karma-361/load.php?debug=false&lang=en&modules=jquery.accessKeyLabel%2CautoEllipsis%2CbyteLength%2CbyteLimit%2CcheckboxShiftClick%2Cclient%2Ccolor%2CcolorUtil%2Ccookie%2CgetAttrs%2Chidpi%2ChighlightText%2Clocalize%2CmakeCollapsible%2Cmw-jump%2CmwExtension%2Cplaceholder%2CtabIndex%2Ctablesorter%2CtextSelection%7Cmediawiki.Title%2CUri%2Capi%2Ccldr%2Ccookie%2CjqueryMsg%2Clanguage%2Cnotify%2Cspecial%2Ctemplate%2Ctoc%2Cuser%2Cutil%7Cmediawiki.api.category%2Cparse%2Cwatch%7Cmediawiki.language.data%2Cinit%2Cmonths%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.special.recentchanges%7Ctest.mediawiki.qunit.suites%2Ctestrunner%7Ctest.sinonjs%7Cuser.defaults&skin=fallback&version=20150212T034325Z&*
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
LoadBalancer::getConnection: using server  for group
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
LoadBalancer::getConnection: using server  for group
LoadBalancer::getConnection: using server  for group
..
Unstubbing $wgLang on call of $wgLang::getCode from Message::inLanguage
IP: 127.0.0.1
[MessageCache] MessageCache::load: Loading en... got from global cache
DatabaseBase::query: Writes done: INSERT OR IGNORE INTO msg_resource (mr_lang,mr_resource,mr_blob,mr_timestamp) VALUES ('X')
..
LoadBalancer::getConnection: using server  for group
..
Start request GET /jenkins-mediawiki-core-qunit-karma-361/load.php?skin=fallback&lang=en&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
Start request GET /jenkins-mediawiki-core-qunit-karma-361/load.php?skin=fallback&lang=ar&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
Start request GET /jenkins-mediawiki-core-qunit-karma-361/load.php?skin=fallback&lang=jp&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Start request GET /jenkins-mediawiki-core-qunit-karma-361/load.php?skin=fallback&lang=zh&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
Fully initialised
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
[caches] LocalisationCache: using store LCStoreDB
Connected to database 0 at
Fully initialised
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Start request GET /jenkins-mediawiki-core-qunit-karma-361/load.php?skin=fallback&lang=fr&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
Connected to database 0 at
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
LocalisationCache::isExpired(jp): cache missing, need to make one
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
Fully initialised
LocalisationCache::isExpired(zh): cache missing, need to make one
LocalisationCache::recache: got localisation for zh from source
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
LocalisationCache::isExpired(ar): cache missing, need to make one
LocalisationCache::recache: got localisation for ar from source
Connected to database 0 at
SQL ERROR: database is locked
[SQLBagOStuff] DBError: A database error has occurred. Did you forget to run maintenance/update.php after upgrading?  See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: REPLACE INTO objectcache (keyname,value,exptime) VALUES ('build361:resourceloader:filter:minify-js:7:13e5f34230336e2347ec3d1b2a411dae',x'..','20380119031407')
Function: SqlBagOStuff::set/single-row
Error: 5 database is locked
[SQLBagOStuff] SqlBagOStuff::handleWriteError: ignoring query error
LocalisationCache::isExpired(fr): cache missing, need to make one
LocalisationCache::recache: got localisation for fr from source
DatabaseBase::query: Writes done: DELETE FROM l10n_cache WHERE lc_lang = 'X'
DatabaseBase::query: Writes done: DELETE FROM msg_resource
DatabaseBase::query: Writes done: DELETE FROM msg_resource_links
..
LocalisationCache::isExpired(zh-hans): cache missing, need to make one
LocalisationCache::recache: got localisation for zh-hans from source
DatabaseBase::query: Writes done: DELETE FROM l10n_cache WHERE lc_lang = 'X'
DatabaseBase::query: Writes done: DELETE FROM msg_resource
DatabaseBase::query: Writes done: DELETE FROM msg_resource_links
..

Compared to:
https://integration.wikimedia.org/ci/job/mediawiki-core-qunit-karma/362/ (PASSING)

Start request GET /jenkins-mediawiki-core-qunit-karma-362/index.php?title=Special:JavaScriptTest/qunit/export
..
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?debug=false&lang=en&modules=jquery.accessKeyLabel%2CautoEllipsis%2CbyteLength%2CbyteLimit%2CcheckboxShiftClick%2Cclient%2Ccolor%2CcolorUtil%2Ccookie%2CgetAttrs%2Chidpi%2ChighlightText%2Clocalize%2CmakeCollapsible%2Cmw-jump%2CmwExtension%2Cplaceholder%2CtabIndex%2Ctablesorter%2CtextSelection%7Cmediawiki.Title%2CUri%2Capi%2Ccldr%2Ccookie%2CjqueryMsg%2Clanguage%2Cnotify%2Cspecial%2Ctemplate%2Ctoc%2Cuser%2Cutil%7Cmediawiki.api.category%2Cparse%2Cwatch%7Cmediawiki.language.data%2Cinit%2Cmonths%7Cmediawiki.libs.pluralruleparser%7Cmediawiki.page.ready%2Cstartup%7Cmediawiki.special.recentchanges%7Ctest.mediawiki.qunit.suites%2Ctestrunner%7Ctest.sinonjs%7Cuser.defaults&skin=fallback&version=20150212T034919Z&*
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
LoadBalancer::getConnection: using server  for group
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
LoadBalancer::getConnection: using server  for group
LoadBalancer::getConnection: using server  for group
..
Unstubbing $wgLang on call of $wgLang::getCode from Message::inLanguage
IP: 127.0.0.1
[MessageCache] MessageCache::load: Loading en... got from global cache
DatabaseBase::query: Writes done: INSERT OR IGNORE INTO msg_resource (mr_lang,mr_resource,mr_blob,mr_timestamp) VALUES ('X')
..
LoadBalancer::getConnection: using server  for group
..
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=fr&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=en&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
[caches] LocalisationCache: using store LCStoreDB
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=ar&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
..
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=jp&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
LocalisationCache::isExpired(ar): cache missing, need to make one
LocalisationCache::isExpired(fr): cache missing, need to make one
LocalisationCache::recache: got localisation for ar from source
LocalisationCache::recache: got localisation for fr from source
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
LocalisationCache::isExpired(jp): cache missing, need to make one
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=zh&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
LocalisationCache::isExpired(zh): cache missing, need to make one
LocalisationCache::recache: got localisation for zh from source
SQL ERROR: database is locked
[SQLBagOStuff] DBError: A database error has occurred. Did you forget to run maintenance/update.php after upgrading?  See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: REPLACE INTO objectcache (keyname,value,exptime) VALUES ('build362:resourceloader:filter:minify-js:7:13e5f34230336e2347ec3d1b2a411dae',x'ad585b73a338167ecfafa0ed2983628560..b7871335f788bc50f74beb8794ba3c57af6382737b3d99a50ebf27a14fc1f','20380119031407')
Function: SqlBagOStuff::set/single-row
Error: 5 database is locked
[SQLBagOStuff] SqlBagOStuff::handleWriteError: ignoring query error
DatabaseBase::query: Writes done: DELETE FROM l10n_cache WHERE lc_lang = 'X'
DatabaseBase::query: Writes done: DELETE FROM msg_resource
DatabaseBase::query: Writes done: DELETE FROM msg_resource_links
..
LocalisationCache::isExpired(zh-hans): cache missing, need to make one
LocalisationCache::recache: got localisation for zh-hans from source
DatabaseBase::query: Writes done: DELETE FROM l10n_cache WHERE lc_lang = 'X'
DatabaseBase::query: Writes done: DELETE FROM msg_resource
DatabaseBase::query: Writes done: DELETE FROM msg_resource_links
..
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=nl&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
Fully initialised
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=ml&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
Start request GET /jenkins-mediawiki-core-qunit-karma-362/load.php?skin=fallback&lang=hi&debug=false&modules=mediawiki.language.data%7Cmediawiki.language&only=scripts&_=
HTTP HEADERS:
[caches] main: EmptyBagOStuff, message: SqlBagOStuff, parser: SqlBagOStuff
[caches] LocalisationCache: using store LCStoreDB
[caches] LocalisationCache: using store LCStoreDB
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
Fully initialised
Fully initialised
LocalisationCache::isExpired(nl): cache missing, need to make one
LocalisationCache::recache: got localisation for nl from source
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
DatabaseBase::query: Writes done: PRAGMA case_sensitive_like = N
Connected to database 0 at
Connected to database 0 at
LocalisationCache::isExpired(hi): cache missing, need to make one
LocalisationCache::isExpired(ml): cache missing, need to make one
LocalisationCache::recache: got localisation for hi from source
LocalisationCache::recache: got localisation for ml from source
DatabaseBase::query: Writes done: DELETE FROM l10n_cache WHERE lc_lang = 'X'
DatabaseBase::query: Writes done: DELETE FROM msg_resource
DatabaseBase::query: Writes done: DELETE FROM msg_resource_links
..

The failing job has messages interleaved from multiple processes writing to this log file.

A unit test in MediaWiki core is unintentionally making these concurrent load.php requests. Possible solution at https://gerrit.wikimedia.org/r/#/c/190154/

Passing array( PDO::ATTR_TIMEOUT => $timeout ) as the $options argument in the PDO constructor, with $timeout set to some sensible value, should reduce the incidence of SQLITE_BUSY errors by making the library wait for a lock instead of immediately returning an error.

gerritbot added a subscriber: gerritbot.

Change 190851 had a related patch set uploaded (by Krinkle):
db: Enable ATTR_TIMEOUT for DatabaseSqlite

https://gerrit.wikimedia.org/r/190851

Patch-For-Review

So it looks like ATTR_TIMEOUT doesn't help. Krinkle helped me to reproduce the issue on integration-slave1009. My slightly reduced test case looks like this:

#!/bin/bash

export MW_SERVER=http://localhost:9412
export MW_SCRIPT_PATH=/jenkins-mwext-VisualEditor-qunit-karma-debug-2 
export CHROME_BIN=/usr/bin/chromium-browser
export DISPLAY=:94
export CHROME_BIN=/usr/bin/chromium-browser
export DISPLAY=:94

echo "delete from msg_resource_links;
delete from msg_resource;" | \
	sudo -u jenkins-deploy php5 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma-debug/src/maintenance/sql.php

for url in $(<~/urls.txt); do
	curl -s "$url" >&/dev/null &
done
wait

Where urls.txt is P298.

I ran strace attaching to all Apache children simultaneously while running this script. Ignoring gettimeofday() calls, which I think were from profiling, the observed sequence was:

Acquire PENDING lock

[pid 23850] 1424150534.958507 fcntl(22, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1} <unfinished ...>
[pid 23850] 1424150534.958707 <... fcntl resumed> ) = 0

Acquire SHARED lock

[pid 23850] 1424150534.958763 fcntl(22, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741826, len=510} <unfinished ...>
[pid 23850] 1424150534.958926 <... fcntl resumed> ) = 0

Release PENDING lock

[pid 23850] 1424150534.958988 fcntl(22, F_SETLK, {type=F_UNLCK, whence=SEEK_SET, start=1073741824, len=1} <unfinished ...>
[pid 23850] 1424150534.959047 <... fcntl resumed> ) = 0

Query RESERVED lock

[pid 23850] 1424150534.960212 fcntl(22, F_GETLK <unfinished ...>
[pid 23850] 1424150534.960334 <... fcntl resumed> , {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1, pid=24091}) = 0

It is held by PID 24091 so is not available. Then it reads from the DB for a while, and writes some MW debug log messages. After about 300ms of this, it tries to acquire the RESERVED lock, which fails:

[pid 23850] 1424150535.291681 fcntl(22, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1}) = -1 EAGAIN (Resource temporarily unavailable)

And then it immediately writes to the DB error log:

[pid 23850] 1424150535.294242 write(23, "Tue Feb 17 5:22:15 UTC 2015\tintegration-slave1009\tbuild2\tMessageBlobStore::insertMessageBlob/single-row\t\t5\tdatabase is locked\tINSERT OR IGNORE INTO msg_resource (mr_lang,mr_resource,mr_blob,mr_timestamp) VALUES ('en','jquery.ui.button','{}','20150217052215')\n", 259) = 259

No sleeping is attempted. Contrast this to an attempt by another process to obtain the SHARED lock. A sleep is triggered:

[pid 22740] 1424150535.065787 fcntl(22, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1} <unfinished ...>
[pid 22740] 1424150535.065853 <... fcntl resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 22740] 1424150535.065930 nanosleep({0, 1000000},  <unfinished ...>

This shared/exclusive/reserved range lock code in SQLite is fairly new, it wasn't in 3.6 but it is in 3.8 which we are using. I guess it is buggy.

SQLite is very pleasant to read, with lots of comments. This is a possible location for the issue, in sqlite3PagerBegin():

/* Obtain a RESERVED lock on the database file. If the exFlag parameter
** is true, then immediately upgrade this to an EXCLUSIVE lock. The
** busy-handler callback can be used when upgrading to the EXCLUSIVE
** lock, but not when obtaining the RESERVED lock.
*/
rc = pagerLockDb(pPager, RESERVED_LOCK);
if( rc==SQLITE_OK && exFlag ){
  rc = pager_wait_on_lock(pPager, EXCLUSIVE_LOCK);
}

So it is documented, the busy-handler callback (which waits) cannot be used when obtaining a RESERVED lock.

I attempted to file a bug upstream, which involves subscribing to the sqlite-users mailing list and posting a message to it. Subscription apparently requires moderator approval: I didn't receive any welcome message and then my post was held.

Workaround ideas:

  • Downgrade to SQLite 3.6
  • Reduce apache concurrency
  • Create tables in the temp.* namespace and use "PRAGMA temp_store = MEMORY". This should avoid hitting the lock code in pager.c.

Change 190851 abandoned by Krinkle:
db: Enable ATTR_TIMEOUT for DatabaseSqlite

Reason:
Yeah, this isn't the issue. Continuing to investigate other avenues at T89180.

https://gerrit.wikimedia.org/r/190851

Krinkle added a comment.EditedFeb 18 2015, 12:47 PM

Workaround ideas:

  • Downgrade to SQLite 3.6
  • Reduce apache concurrency
  • Create tables in the temp.* namespace and use "PRAGMA temp_store = MEMORY". This should avoid hitting the lock code in pager.c.

Apache: Each build has its own db, but builds on the same slave do share their Apache vhost. We need parallel builds on a slave. Apache's MaxClients config (now MaxRequestWorkers) can be set to e.g. 1 to let requests be handled serially from a queue.

tmp_store: That sounds promising. We'd add this pragma to either DatabaseSqlite, or in sqlite-specific handling in the CloneDatabase used by our PHPUnit test suite.

I attempted to file a bug upstream, which involves subscribing to the sqlite-users mailing list [..]

From sqlite-users:

In #mediawiki-core on Freenode IRC:

TimStarling: the BEGIN IMMEDIATE workaround would probably work for testing at least
TimStarling: just make all transactions BEGIN IMMEDIATE instead of BEGIN
TimStarling: it'll serialize the transactions instead of having concurrent readers, but that's OK for this test, right?

Unfortunately, it seems this is abstracted by PDO. Unlike DatabaseBase::doBegin (as for MySQL) which does query( 'BEGIN' ), for SQLite we use PDO::beginTransaction, which doesn't take any parameters (https://bugs.php.net/42766).

Symfony's PdoSessionHandler encountered the same:

[..] we have to acquire a reserved lock on the database immediately. Because of https://bugs.php.net/42766 we have to create such a transaction manually which also means we cannot use PDO::commit or PDO::rollback or PDO::inTransaction for SQLite.

$this->pdo->exec('BEGIN IMMEDIATE TRANSACTION');

We'd abandon all use of native beginTransaction(), commit(), rollback() etc since PDO internally keeps track of transactions the same way MediaWiki does (e.g. without talking to the database). We already have logic for this in place in the DatabaseBase class for MySQL (we'd only override doBegin to query send BEGIN IMMEDIATE instead of BEGIN). Though I'm not sure if those are otherwise compatible with SQLite.

Change 191516 had a related patch set uploaded (by Krinkle):
database: Replace use of PDO transaction system for SQLite

https://gerrit.wikimedia.org/r/191516

Patch-For-Review

Change 191517 had a related patch set uploaded (by Krinkle):
database: Use BEGIN IMMEDIATE instead of BEGIN in SQLite

https://gerrit.wikimedia.org/r/191517

Patch-For-Review

Change 191516 merged by jenkins-bot:
database: Avoid use of PDO transaction system for SQLite

https://gerrit.wikimedia.org/r/191516

Change 192476 had a related patch set uploaded (by Aaron Schulz):
Made the SQLite installer make a separate objectcache DB

https://gerrit.wikimedia.org/r/192476

Patch-For-Review

Change 192480 had a related patch set uploaded (by Aaron Schulz):
Allow configuring the transaction locking mode for SQLite

https://gerrit.wikimedia.org/r/192480

Patch-For-Review

Change 192476 merged by jenkins-bot:
Made the SQLite installer make a separate objectcache DB

https://gerrit.wikimedia.org/r/192476

Change 192480 merged by jenkins-bot:
Allow configuring the transaction locking mode for SQLite

https://gerrit.wikimedia.org/r/192480

Change 191517 abandoned by Krinkle:
database: Use BEGIN IMMEDIATE instead of BEGIN in SQLite

https://gerrit.wikimedia.org/r/191517

Krinkle added a subscriber: aaron.Feb 26 2015, 3:55 AM

Change 192480 merged by jenkins-bot:
Allow configuring the transaction locking mode for SQLite
https://gerrit.wikimedia.org/r/192480

This patch by @aaron has been merged and using a separate db in wal mode for objectcache removed all remaining db errors that were consistently occurring in the builds. I'll observe for another few days to see if anything pops up. But this seems resolved for now. Also Ib18347299ea7 allows code to provide a transaction mode to avoid this error where possible.

Krinkle closed this task as Resolved.Feb 26 2015, 3:56 AM
Krinkle claimed this task.
Krinkle removed a project: Patch-For-Review.
Krinkle removed a subscriber: Unknown Object (MLST).
Krinkle reopened this task as Open.EditedMar 12 2015, 5:26 PM

https://integration.wikimedia.org/ci/job/mwext-VisualEditor-qunit-karma/lastBuild/ is still failing.

I imagine @aaron's https://gerrit.wikimedia.org/r/193261 rMW80a3a30d6a90: Made insertMessageBlob use (start|end)Atomic in case DBO_TRX is off was supposed to address this, but it had to be reverted as it caused the failure to be more prominent and affect even more mediawiki-core builds.

mw-dberror.log:

 Thu Mar 12 16:05:25 UTC 2015	integration-slave1405	my_wiki	MessageBlobStore::insertMessageBlob/single-row		5	database is locked
 INSERT OR IGNORE INTO msg_resource (mr_lang,mr_resource,mr_blob,mr_timestamp) VALUES ('en','Base64.js','{}','20150312160525')

mw-www-debug.log

SQL ERROR: database is locked
MessageBlobStore::insertMessageBlob failed to update DB: exception 'DBQueryError' with message 'A database error has occurred. Did you forget to run maintenance/update.php after upgrading?  See: https://www.mediawiki.org/wiki/Manual:Upgrading#Run_the_update_script
Query: INSERT OR IGNORE INTO msg_resource (mr_lang,mr_resource,mr_blob,mr_timestamp) VALUES ('en','Base64.js','{}','20150312160525')
Function: MessageBlobStore::insertMessageBlob/single-row
Error: 5 database is locked
' in /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/db/Database.php:1249
Stack trace:
#0 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/db/Database.php(1205): DatabaseBase->reportQueryError('database is loc...', 5, 'INSERT OR IGNOR...', 'MessageBlobStor...', false)
#1 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/db/Database.php(2090): DatabaseBase->query('INSERT OR IGNOR...', 'MessageBlobStor...')
#2 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/db/DatabaseSqlite.php(584): DatabaseBase->insert('msg_resource', Array, 'MessageBlobStor...', Array)
#3 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/MessageBlobStore.php(104): DatabaseSqlite->insert('msg_resource', Array, 'MessageBlobStor...', Array)
#4 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/MessageBlobStore.php(68): MessageBlobStore->insertMessageBlob('Base64.js', Object(ResourceLoaderFileModule), 'en')
#5 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/resourceloader/ResourceLoader.php(875): MessageBlobStore->get(Object(ResourceLoader), Array, 'en')
#6 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/includes/resourceloader/ResourceLoader.php(620): ResourceLoader->makeModuleResponse(Object(ResourceLoaderContext), Array, Array)
#7 /mnt/jenkins-workspace/workspace/mwext-VisualEditor-qunit-karma/src/load.php(44): ResourceLoader->respond(Object(ResourceLoaderContext))
#8 {main}
SQL ERROR (ignored): database is locked
SQL ERROR (ignored): database is locked
SQL ERROR (ignored): database is locked
Unstubbing $wgLang on call of $wgLang::getCode from Message::inLanguage
IP: 127.0.0.1
[MessageCache] MessageCache::load: Loading en... got from global cache
SQL ERROR (ignored): database is locked
SQL ERROR (ignored): database is locked
SQL ERROR (ignored): database is locked
SQL ERROR (ignored): database is locked
Krinkle removed Krinkle as the assignee of this task.Mar 12 2015, 5:27 PM
Krinkle renamed this task from MediaWiki installs in Jenkins frequently fail to access their sqlite database due to locks to Jenkins jobs using MediaWiki frequently fail due to database locks .Mar 18 2015, 4:31 PM
Krinkle removed a subscriber: gerritbot.

Making MediaWiki core work properly with current versions of SQLite is not a Continuous Integration goal. That is for MediaWiki core to prioritise accordingly (eg. T90001 and other undecorated atomic/locking queries in MediaWiki).

Let's resolve this by switching to MySQL instead, which will also help resolve T37912: Jenkins: Set up PHPUnit testing on MySQL backend which is a separate goal for CI that we'll be doing at some point anyway.

Krinkle closed this task as Resolved.Mar 20 2015, 8:20 AM

Underlying MediaWiki core issue tracked as T90001 and T93097.

CI issue resolved for Karma jobs via T57788.

Jdforrester-WMF moved this task from Nominated to Done on the VisualEditor 2014/15 Q3 blockers board.