Summary
The Beta-Cluster-Infrastructure is a farm of wikis we use for experimentation and integration testing. It is updated continuously: new code is every ten minutes and the databases every hour by running MediaWiki maintenance/update.php. The scheduling and running are driven by Jenkins jobs which statuses can be seen on the Beta view:
On top of that, Jenkins will emit notification messages to IRC as long as one of the update job fails. One of them started failing on July 25th and this is how I was seeing it the alarm (times are for France, UTC+2):
(wmf-insecte is the Jenkins bot, insecte is french for bug (animals), and the wmf- prefix identifies it as a Wikimedia Foundation robot).
Clicking on the link gives the output of the update script which eventually fails with:
+ /usr/local/bin/mwscript update.php --wiki=wikifunctionswiki --quick --skip-config-validation 20:31:09 ...wikilambda_zlanguages table already exists. 20:31:09 ...have wlzl_label_primary field in wikilambda_zobject_labels table. 20:31:09 ...have wlzl_return_type field in wikilambda_zobject_labels table. 20:31:09 /usr/local/bin/mwscript: line 27: 1822 Segmentation fault sudo -u "$MEDIAWIKI_WEB_USER" $PHP "$MEDIAWIKI_DEPLOYMENT_DIR_DIR_USE/multiversion/MWScript.php" "$@"
The important bit is Segmentation fault which indicates the program (php) had a fatal fault and it got rightfully killed by the Linux Kernel. Looking at the instance Linux Kernel messages via dmesg -T:
[Mon Jul 24 23:33:55 2023] php[28392]: segfault at 7ffe374f5db8 ip 00007f8dc59fc807 sp 00007ffe374f5da0 error 6 in libpcre2-8.so.0.7.1[7f8dc59b9000+5d000] [Mon Jul 24 23:33:55 2023] Code: ff ff 31 ed e9 74 fb ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 41 54 55 48 89 d5 53 44 89 c3 48 81 ec 98 52 00 00 <48> 89 7c 24 18 4c 8b a4 24 d0 52 00 00 48 89 74 24 10 48 89 4c 24 [Mon Jul 24 23:33:55 2023] Core dump to |/usr/lib/systemd/systemd-coredump 28392 33 33 11 1690242166 0 php pipe failed
With those data, I had enough to the most urgent step: file a task (T342769) which can be used as an audit trail and reference for the future. It is the single most important step I am doing whenever I am debugging an issue, since if I have to stop due to time constraint or lack of technical abilities, others can step in and continue. It also provides an historical record that can be looked up in the future, and indeed this specific problem already got investigated and fully documented a couple years ago. Having a task is the most important thing one must do whenever debugging, it is invaluable. For PHP segmentation fault, we even have a dedicated project php-segfault
With the task filed, I have continued the investigation. The previous successful build had:
19:30:18 ...have wlzl_label_primary field in wikilambda_zobject_labels table. 19:30:18 ...have wlzl_return_type field in wikilambda_zobject_labels table. 19:30:18 ❌ Unable to make a page for Z7138: The provided content's label clashes with Object 'Z10138' for the label in 'Z1002'. 19:30:18 ❌ Unable to make a page for Z7139: The provided content's label clashes with Object 'Z10139' for the label in 'Z1002'. 19:30:18 ❌ Unable to make a page for Z7140: The provided content's label clashes with Object 'Z10140' for the label in 'Z1002'. 19:30:18 ...site_stats is populated...done.
The successful build started at 19:20 UTC and the failing one finished at 20:30 UTC which gives us a short time window to investigate. Since the failure seems to happen after updating the WikiLambda MediaWiki extension, I went to inspect the few commits that got merged at that time. I took advantage of Gerrit adding review actions as git notes, notably the exact time a change got submitted and subsequently merged. The process:
Clone the suspect repository:
git clone https://gerrit.wikimedia.org/r/extensions/WikiLambda cd WikiLambda
Fetch the Gerrit review notes:
git fetch origin refs/notes/review:refs/notes/review
The review notes can be shown below the commit by passing --notes=review to git log or git show, an example for the current HEAD of the repository:
$ git show -q --notes=review commit c7f8071647a1aeb2cef6b9310ccbf3a87af2755b (HEAD -> master, origin/master, origin/HEAD) Author: Genoveva Galarza <ggalarzaheredero@wikimedia.org> Date: Thu Jul 27 00:34:03 2023 +0200 Initialize blank function when redirecting to FunctionEditor from DefaultView Bug: T342802 Change-Id: I09d3400db21983ac3176a0bc325dcfe2ddf23238 Notes (review): Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org> Verified+2: jenkins-bot Code-Review+2: Jforrester <jforrester@wikimedia.org> Submitted-by: jenkins-bot Submitted-at: Wed, 26 Jul 2023 22:47:59 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/942026 Project: mediawiki/extensions/WikiLambda Branch: refs/heads/master
Which shows this change has been approved by Jforrester and entered the repository on Wed, 26 Jul 2023 22:47:59 UTC. Then to find the commits in that range, I ask git log to list:
I can then scroll to the commits having a Submitted-at in the time window of 19:20 UTC - 20:30 UTC. I have amended the below output to remove most of the review notes except for the first commit:
$ git log --oneline --since=2023/07/25 --reverse --notes=review --no-merges --topo-order <scroll> 653ea81a Handle oldid url param to view a particular revision Notes (review): Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org> Verified+2: jenkins-bot Code-Review+2: Jforrester <jforrester@wikimedia.org> Submitted-by: jenkins-bot Submitted-at: Tue, 25 Jul 2023 19:26:53 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941482 Project: mediawiki/extensions/WikiLambda Branch: refs/heads/master fe4b0446 AUTHORS: Update for July 2023 Notes (review): Submitted-at: Tue, 25 Jul 2023 19:49:43 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941507 73fcb4a4 Update function-schemata sub-module to HEAD (1c01f22) Notes (review): Submitted-at: Tue, 25 Jul 2023 19:59:23 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941384 598f5fcc PageRenderingHandler: Don't make 'read' selected if we're on the edit tab Notes (review): Submitted-at: Tue, 25 Jul 2023 20:16:05 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941456
Or in a Phabricator task and human friendly way:
The Update function-schemata sub-module to HEAD (1c01f22) has a short log of changes it introduces:
Since the update script fail on WikiLambda I have reached out to its developers so they can investigate their code and maybe find what can trigger the issue.
On the PHP side we need a trace. That can be done by configuring the Linux Kernel to take a dump of the program before terminating it and having it stored on disk, it did not quite work due to a configuration issue on the machine and in the first attempt we forgot to run the command by asking bash to allow the dump generation (ulimit -c unlimited). From a past debugging session, I went to run the command directly under the GNU debugger: gdb.
There are a few preliminary step to debug the PHP program, at first one needs to install the debug symbols which lets the debugger map the binary entries to lines of the original source code. Since error mentions libpcre2 I also installed its debugging symbols:
$ sudo apt-get -y install php7.4-common-dbgsym php7.4-cli-dbgsym libpcre2-dbg
I then used gdb to start a debugging session:
sudo -s -u www-data gdb --args /usr/bin/php /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=wikifunctionswiki --quick --skip-config-validation gdb>
Then ask gdb to start the program by entering in the input prompt: run ⏎. After several minutes, it caught the segmentation fault:
gdb> run <output> <output freeze for several minutes while update.php is doing something> Thread 1 "php" received signal SIGSEGV, Segmentation fault. 0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, start_offset=start_offset@entry=0, options=0, match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870) at src/pcre2_match.c:6001 6001 src/pcre2_match.c: No such file or directory.
I could not find a debugging symbol package containing src/pcre2_match.c but that was not needed afterall.
To retrieve the stacktrace enter to the gdb prompt bt ⏎:
gdb> bt #0 0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, start_offset=start_offset@entry=0, options=0, match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870) at src/pcre2_match.c:6001 #1 0x00005555556a3b24 in php_pcre_match_impl (pce=0x7fffe83685a0, subject_str=0x7fffcb410a80, return_value=0x7fffcb44b220, subpats=0x0, global=0, use_flags=<optimized out>, flags=0, start_offset=0) at ./ext/pcre/php_pcre.c:1300 #2 0x00005555556a493b in php_do_pcre_match (execute_data=0x7fffcb44b710, return_value=0x7fffcb44b220, global=0) at ./ext/pcre/php_pcre.c:1149 #3 0x00007ffff216a3cb in tideways_xhprof_execute_internal () from /usr/lib/php/20190902/tideways_xhprof.so #4 0x000055555587ddee in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1732 #5 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #6 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #7 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #8 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #9 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #10 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #11 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #12 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #13 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #14 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #15 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #16 0x000055555587c63c in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1602 #17 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53535 #18 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #19 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #20 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #21 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #22 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1714 #23 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539 #24 0x00007ffff2169c89 in tideways_xhprof_execute_ex () from /usr/lib/php/20190902/tideways_xhprof.so #25 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.Quit CONTINUING
Which is not that helpful. Thankfully the PHP project provides a set of macro for gdb which lets one map the low level C code to the PHP code that was expected. It is provided in their source repository /.gdbinit and one should use the version from the PHP branch being debugged, since we use php 7.4 I went to use the version from the latest 7.4 series (7.4.30 at the time of this writing): https://raw.githubusercontent.com/php/php-src/php-7.4.30/.gdbinit
Download the file to your home directory (ex: /home/hashar/gdbinit) and ask gdb to import it with, for example, source /home/hashar/gdbinit ⏎:
(gdb) source /home/hashar/gdbinit
This provides a few new commands to show PHP Zend values and to generate a very helpfull stacktrace (zbacktrace):
(gdb) zbacktrace [0x7fffcb44b710] preg_match("\7^Z[1-9]\d*$\7u", "Z1002") [internal function] [0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219 [0x7fffcb44a760] Opis\JsonSchema\Validator->validateProperties(reference, reference, array(0)[0x7fffcb44a7d0], array(7)[0x7fffcb44a7e0], object[0x7fffcb44a7f0], object[0x7fffcb44a800], object[0x7fffcb44a810], NULL) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:943 [0x7fffcb44a4c0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb44a530], array(7)[0x7fffcb44a540], object[0x7fffcb44a550], object[0x7fffcb44a560], object[0x7fffcb44a570]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:519 [0x7fffcb44a310] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb44a380], array(7)[0x7fffcb44a390], object[0x7fffcb44a3a0], object[0x7fffcb44a3b0], object[0x7fffcb44a3c0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 [0x7fffcb449350] Opis\JsonSchema\Validator->validateConditionals(reference, reference, array(0)[0x7fffcb4493c0], array(7)[0x7fffcb4493d0], object[0x7fffcb4493e0], object[0x7fffcb4493f0], object[0x7fffcb449400]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:703 [0x7fffcb4490b0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb449120], array(7)[0x7fffcb449130], object[0x7fffcb449140], object[0x7fffcb449150], object[0x7fffcb449160]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:523 [0x7fffcb448f00] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb448f70], array(7)[0x7fffcb448f80], object[0x7fffcb448f90], object[0x7fffcb448fa0], object[0x7fffcb448fb0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 <loop>
The stacktrace shows the code entered an infinite loop while validating a Json schema up to a point it is being stopped.
The arguments can be further inspected by using printz and giving it as argument an object reference. For the line:
For [0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219
(gdb) printzv 0x7fffcb44ac10 [0x7fffcb44ac10] (refcount=2) array: Hash(0)[0x5555559d7f00]: { }
(gdb) printzv 0x7fffcb44ac20 [0x7fffcb44ac20] (refcount=21) array: Packed(7)[0x7fffcb486118]: { [0] 0 => [0x7fffcb445748] (refcount=17) string: Z2K2 [1] 1 => [0x7fffcb445768] (refcount=18) string: Z4K2 [2] 2 => [0x7fffcb445788] long: 1 [3] 3 => [0x7fffcb4457a8] (refcount=15) string: Z3K3 [4] 4 => [0x7fffcb4457c8] (refcount=10) string: Z12K1 [5] 5 => [0x7fffcb4457e8] long: 1 [6] 6 => [0x7fffcb445808] (refcount=6) string: Z11K1 }
(gdb) printzv 0x7fffcb44ac30 [0x7fffcb44ac30] (refcount=22) object(Opis\JsonSchema\Schema) #485450 { id => [0x7fffcb40f508] (refcount=3) string: /Z6# draft => [0x7fffcb40f518] (refcount=1) string: 07 internal => [0x7fffcb40f528] (refcount=1) reference: [0x7fffcb6704e8] (refcount=1) array: Hash(1)[0x7fffcb4110e0]: { [0] "/Z6#" => [0x7fffcb71d280] (refcount=1) object(stdClass) #480576 }
(gdb) printzv 0x7fffcb44ac40 [0x7fffcb44ac40] (refcount=5) object(stdClass) #483827 Properties Hash(1)[0x7fffcb6aa2a0]: { [0] "pattern" => [0x7fffcb67e3c0] (refcount=1) string: ^Z[1-9]\d*$ }
(gdb) printzv 0x7fffcb44ac50 [0x7fffcb44ac50] (refcount=5) object(Opis\JsonSchema\ValidationResult) #486348 { maxErrors => [0x7fffcb4393e8] long: 1 errors => [0x7fffcb4393f8] (refcount=2) array: Hash(0)[0x5555559d7f00]: { }
Extracting the parameters was enough for WikiLambda developers to find the immediate root cause, they have removed some definitions which triggered the infinite loop and manually ran a script to reload the data in the Database. Eventually the Jenkins job managed to update the wiki database:
16:30:26 <wmf-insecte> Project beta-update-databases-eqiad build #69029: FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/69029/
One problem solved!
References:
If you've submitted patches for MediaWiki core, skins or extensions, you've seen this output in Gerrit:
That is a list of links to each job's console output for a patch that failed verification.
You can see a job that failed at 1m 54s. But jenkins-bot does not post a comment on the patch until all jobs have completed. That means you won't get email/IRC notifications for test failures on your patch until the longest running job completes, in this case, after 14m 57s.[0] ⏳⏱️
With all due respect to xkcd/303... wouldn't it be nice to get notified as soon as a failure occurs, so you can fix your patch earlier to avoid context switching, or losing time during a backport window?
IMHO, yes, and, now it's possible!
Commit a quibble.yaml file (documentation, example patch) to your MediaWiki project[1]:
earlywarning: should_vote: 1 should_comment: 1
The next time that there is a test failure[2] in your repository, you will see a comment from the Early warning bot and a Verified: -1 vote.
Here's an example of how that might look in practice:
[3]So, the bot announces 2 minutes after the patch is updated that there's a problem, with the output of the failed command. The full report from jenkins-bot arrives 14 minutes later.
For details on how this works, please see the documentation for the Early warning bot. Your feedback and contributions are very welcome on T323750: Provide early feedback when a patch has job failures (feel free to tag T323750 with patches adding quibble.yaml to your project.)
Cheers,
Kosta
[0] An alternative for getting real time progress is to watch Zuul TV
. There is also the excellent work in T214068: Display Zuul status of jobs for a change on Gerrit UI but this does not generate email/IRC notifications or set a verification label.Our code review system Gerrit has several caches, the largest ones being backed up on disk. The disk caches offload memory usage and persist the data between restarts. As a Java application, the caches are stored in H2 database files and I recently had to find how to connect to them in order to inspect their content and reduce their size.
In short: java -Dh2.maxCompactTime=15000 ... would cause the H2 driver to compact the database upon disconnection.
During an upgrade, the Gerrit installation filed up the system root partition entirely (incident report for Gerrit 3.5 upgrade). The reason is two caches occupying 9G and 11G out of a the 40G system partition. Those caches hold differences to files made by patchsets and are stored in two files:
/var/lib/gerrit2/review_site/cache/ | Size (MB) |
---|---|
git_file_diff.h2.db | 8376 |
gerrit_file_diff.h2.db | 11597 |
An easy fix would have been to stop the service, delete all caches, restart the service and let the application refile the cold caches. It is a short term solution, long term what if it is an issue in the application and we have to do the same all over again in the next few weeks? The large discrepancy also triggered my curiosity and I had to know the exact root cause to find a definitive fix to it. There started my journey of debugging.
When looking at the cache through the application shows caches are way smaller at around 150MBytes:
Name |Entries | AvgGet |Hit Ratio| | Mem Disk Space| |Mem Disk| --------------------------------+---------------------+---------+---------+ D gerrit_file_diff | 24562 150654 157.36m| 14.9ms | 72% 44%| D git_file_diff | 12998 143329 158.06m| 14.8ms | 3% 14%| ^^^^^^^
One could assume some overhead but there is no reason for metadata to occupy hundred times more space than the actual data they are describing. Specially given each cached item is a file diff which is more than a few bytes. To retrieve the files locally I compressed them with gzip and they shrunk to a mere 32 MBytes! It is a strong indication those files are filled mostly with empty data which suggests the database layer never reclaims no more used blocks. Reclaiming is known as compacting in H2 database or vacuuming in Sqlite.
Once I retrieved the files, I have tried to connect to them using the H2 database jar and kept doing mistakes after mistakes due to my completely lack of knowledge on that front:
Version matters
At first I tried with the latest version h2-2.1.214.jar and it did not find any data. I eventually found out the underlying storage system has been changed compared to version 1.3.176 used by Gerrit.I thus had to use an older version which can be retrieved from the Gerrit.war package.
File parameter which is not a file
I then wanted to a SQL dump of the database to inspect it using the Script java class: java -cp h2-1.3.176.jar org.h2.tools.Script, it requires a -url option which is a jdbc URI containing the database name. Intuitively I gave the full file name:
java -cp h2-1.3.176.jar org.h2.tools.Script -url jdbc:h2:git_file_diff.h2.db'
It returns instantly and generate the dump:
CREATE USER IF NOT EXISTS "" SALT '' HASH '' ADMIN;
Essentially an empty file. Looking at file on disk it created a git_file_diff.h2.db.h2.db file which is 24kbytes. Lesson learned, the h2.db suffix must be removed from the URI. I was then able to create the dump using:
java -cp h2-1.3.176.jar org.h2.tools.Script -url jdbc:h2:git_file_diff'
Which resulted in a properly sized backup.sql.
Web based admin
I have altered the SQL to make it fit Sqlite in order to load it in SqliteBrowser (a graphical interface which is very convenient to inspect those databases). Then I found invoking the jar directly starts a background process attached to the database and open my web browser to a web UI: java -jar h2-1.3.176.jar -url jdbc:h2:git_file_diff:
That is very convenient to inspect the file. The caches are are key value storages with a column keeping track of the size of each record. Summing them is how gerrit show-caches finds out the size of the caches (roughly 150Mbytes for the two diff caches).
The H2 Database feature page mentions empty space is to be re-used which is not the case as seen above. The document states when the database connection is closed, it compact it for up to 200 milliseconds. Gerrit establish the connection on start up and keep it up until it is shutdown at which point the compaction occurs. It is not frequent enough, and the small delay is apparently not sufficient to compact our huge databases. To run a full compaction several methods are possible:
SHUTDOWN COMPACT: this request an explicit compaction and terminates the connection. The documentation implies it is not subject to the time limit. That would have required a change in the Gerrit Java code to issue the command.
org.h2.samples.Compact script: H2 has a org.h2.samples.Compact to manually compact a given database, it would need some instrumentation to trigger it against each file after Gerrit is shutdown, possibly as a systemd.service ExecStopPost and iterating through each files.
jdbc URL parameter MAX_COMPACT_TIME: the 200 milliseconds can be bumped by adding the parameter to the JDBC connection URL (separated by a semi column ;). Again it would require a change in Gerrit Java code to modify the way it connects.
The beauty of open source is I could access the database source code. It is hosted in https://github.com/h2database/h2database in the version-1.3 tag which holds a subdirectory for each sub version. When looking at a setting, the database driver uses the following piece of code (code licensed under Mozilla Public License Version 2.0 or Eclipse Public License 1.0):
60 /** 61 * Get the setting for the given key. 62 * 63 * @param key the key 64 * @param defaultValue the default value 65 * @return the setting 66 */ 67 protected String get(String key, String defaultValue) { 68 StringBuilder buff = new StringBuilder("h2."); 69 boolean nextUpper = false; 70 for (char c : key.toCharArray()) { 71 if (c == '_') { 72 nextUpper = true; 73 } else { 74 // Character.toUpperCase / toLowerCase ignores the locale 75 buff.append(nextUpper ? Character.toUpperCase(c) : Character.toLowerCase(c)); 76 nextUpper = false; 77 } 78 } 79 String sysProperty = buff.toString(); 80 String v = settings.get(key); 81 if (v == null) { 82 v = Utils.getProperty(sysProperty, defaultValue); 83 settings.put(key, v); 84 } 85 return v; 86 }
When retrieving the setting MAX_COMPACT_TIME it forges a camel case version of the setting name prefixed by h2. which gives h2.maxCompactTime then look it up in the JVM properties an if set pick its value.
Raising the compact time limit to 15 seconds is thus all about passing to java: -Dh2.maxCompactTime=15000.
7f6215e039 in our Puppet applies the fix and summarize the above. Once I applied, I restart Gerrit once to have the setting taken in account and restarted it a second time to have it disconnect from the databases with the setting applied. The results are without appeal. Here are the largest gains:
File | Before | After |
approvals.h2.db | 610M | 313M |
gerrit_file_diff.h2.db | 12G | 527M |
git_file_diff.h2.db | 8.2G | 532M |
git_modified_files.h2.db | 899M | 149M |
git_tags.h2.db | 1.1M | 32K |
modified_files.h2.db | 905M | 208M |
oauth_tokens.h2.db | 1.1M | 32K |
pure_revert.h2.db | 1.1M | 32K |
The gerrit_file_diff and git_file_diff went from respectively 12GB and 8.2G to 0.5G which addresses the issue.
Setting the Java property -Dh2.maxCompactTime=15000 was a straightforward fix which does not require any change to the application code. It also guarantee the database will keep being compacted each time Gerrit is restarted and the issue that has lead to a longer maintenance window than expect would not reappear.
Happy end of year 2022!
References:
Mediawiki developers, have you ever thought, “I wish I could deploy my own code for Mediawiki”? Now you can! More deploys! More fun!
Next time you want to get some code deployed, why not try scap backport?
scap backport is one command that will +2 your patch, deploy to mwdebug and wait for your approval, and finally sync to all servers. You only need to provide the change number or gerrit url of your change.
You can run scap backport on patches that have already merged, or re-run scap backport if you decided to cancel in the middle of a run. scap backport can also handle multiple patches at a time. After all the patches have been merged, they’ll be deployed all together. scap backport will confirm that your patches are deployable before merging, and double check no extra patches have sneaked into your deployment.
And if your code didn’t work out, don’t worry, there’s scap backport —revert, which will create a revert patch, send it to Gerrit, and run all steps of scap backport to revert your work. You’re offered the choice to give a reason for revert, which will show up in the commit message. Just be aware that you'll need to wait for tests to run and your code to merge before it gets synced, so in an emergency this might not be the best option.
You can also list available backports or reverts using the —list flag!
If you'd like some guidance on deploying backports, please sign up here to join us for backport training, which happens once a week on Thursday during the UTC late backport window!
For comparison, the previous way to backport would require the user to enter the following commands on the deployment host:
cd /srv/mediawiki-staging/php-<version> git status git fetch git log -p HEAD..@{u} git rebase
Then, if there were changes to an extension: git submodule update [extensions|skins]/<name>
Then, log in to mwdebug and run scap pull
Then, back on the deployment host: scap sync-file php-<version>/<path to file> 'Backport: [[gerrit:<change no>|<subject> (<bug no>)]]' for each changed file
List backports
scap backport --list
Backport change(s)
scap backport 1234
scap backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1234
scap backport 1234 5678
Merge but do not sync
scap backport --stop-before-sync 1234
List revertable changes
scap backport --revert --list
Revert change(s)
scap backport --revert 1234
scap backport --revert 1234 5678
That's all for now, and happy backporting!
How are we doing in our strive for operational excellence? Read on to find out!
7 documented incidents in July, and 4 in August (Incident graphs). Read more about past incidents at Incident status on Wikitech.
2022-07-03 shellbox
Impact: For 16 minutes, edits and previews for pages with Score musical notes were slow or unavailable.
2022-07-10 thumbor
Impact: For several days, Thumbor p75 service response times gradually regressed by several seconds.
2022-07-11 FrontendUnavailable cache text
Impact: For 5 minutes, the MediaWiki API cluster in eqiad responded with higher latencies or errors.
2022-07-11 Shellbox and parsoid saturation
Impact: For 13 minutes, the mobileapps service was serving HTTP 503 errors to clients.
2022-07-12 codfw A5 power cycle
Impact: No observed public-facing impact. Internal clean up took some work, e.g. for Ganeti VMs.
2022-07-13 eqsin bandwidth
Impact: For 20 minutes, there was a small increase in error responses for thumbnails served from the Eqsin data center (Singapore).
2022-07-20 eqiad network
Impact: For 10-15 minutes, a portion of wiki traffic from Eqiad-served regions was lost (about 1M uncached requests). For ~30 minutes, Phabricator was unable to access its database.
2022-08-10 cassandra disk space
Impact: During planned downtime, other hosts ran out of space due to accumulating logs. No external impact.
2022-08-10 confd all hosts
Impact: No external impact.
2022-08-16 Beta Cluster 502
Impact: For 7 hours, all Beta Cluster sites were unavailable.
2022-08-16 x2 database replication
Impact: For 36 minutes, errors were noticeable for some editors. Saving edits was unaffected.
Recently completed incident follow-up:
Replace certificate on elastic09 in Beta Cluster
Brian (@bking, WMF Search) noticed during an incident review that an internal server used an expired cert and renewed it in accordance with a documented process.
Localisation cache must be purged after train deploy
@Tchanders (WMF AHT) filed this in 2020 after a recurring issue with stale interface labels. Work led by Ahmon (@dancy, WMF RelEng).
Remember to review and schedule Incident Follow-up work in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded.
Highlight from the "Oldest incident follow-up" query:
The month of July saw 22 new production errors of which 9 are still open today. In August we encountered 29 new production errors of which 10 remain open today and have carried over to September.
Take a look at the Wikimedia-production-error workboard and look for tasks that could use your help.
For the month-over-month numbers, refer to the spreadsheet data.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How are we doing in our strive for operational excellence? Read on to find out!
There were 6 incidents in June this year. That's double the median of three per month, over the past two years (Incident graphs).
2022-06-01 cloudelastic
Impact: For 41 days, Cloudelastic was missing search results about files from commons.wikimedia.org.
2022-06-10 overload varnish haproxy
Impact: For 3 minutes, wiki traffic was disrupted in multiple regions for cached and logged-in responses.
2022-06-12 appserver latency
Impact: For 30 minutes, wiki backends were intermittently slow or unresponsive, affecting a portion of logged-in requests and uncached page views.
2022-06-16 MariaDB password
Impact: For 2 hours, a current production database password was publicly known. Other measures ensured that no data could be compromised (e.g. firewalls and selective IP grants).
2022-06-21 asw-a2-codfw power
Impact: For 11 minutes, one of the Codfw server racks lost network connectivity. Among the affected servers was an LVS host. Another LVS host in Codfw automatically took over its load balancing responsibility for wiki traffic. During the transition, there was a brief increase in latency for regions served by Codfw (Mexico, and parts of US/Canada).
2022-06-30 asw-a4-codfw power
Impact: For 18 minutes, servers in the A4-codfw rack lost network connectivity. Little to no external impact.
Recently completed incident follow-up:
Audit database usage of GlobalBlocking extension
Filed by Amir (@Ladsgroup) in May following an outage due to db load from GlobalBlocking. Amir reduced the extensions' DB load by 10%, through avoiding checks for edit traffic from WMCS and Toolforge. And he implemented stats for monitoring GlobalBlocking DB queries going forward.
Reduce Lilypond shellouts from VisualEditor
Filed by Reuven (@RLazarus) and Kunal (@Legoktm) after a shellbox incident. Ed (@Esanders) and Sammy (@TheresNoTime) improved the Score extension's VisualEditor plugin to increase its debounce duration.
Remember to review and schedule Incident Follow-up work in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.
In June and July (which is almost over), we reported 27 new production errors and 25 production errors respectively. Of these 52 new issues, 27 were closed in weeks since then, and 25 remain unresolved and will carry over to August.
We also addressed 25 stagnant problems that we carried over from previous months, thus the workboard overall remains at exactly 299 unresolved production errors.
Take a look at the Wikimedia-production-error workboard and look for tasks that could use your help.
For the month-over-month numbers, refer to the spreadsheet data.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
"Mr. Vice President. No numbers, no bubbles."
— 🔴🟠🟡🟢🔵🟣
Release Engineering's "GitLab-a-thon" sprint for May 10th-24th (roughly) focused on the mechanics of migrating a Wikimedia service to GitLab, setting up a CI pipeline, building container images from that service, and publishing images to the Wikimedia registry. We selected the Blubber project as a good candidate for experimentation:
We evaluated build mechanisms including GitLab's suggested docker-in-docker, Kaniko, Podman, and BuildKit:
We ultimately landed on BuildKit as the least constraining for future options, and the most in line with features we'd like to offer.
We explored a range of options for building and publishing, including variations on:
We eventually landed on this latter, and work is well underway on implementation: T308501: Authenticate trusted runners for registry access against GitLab using temporary JSON Web Token
Other work included implementing CI for Blubber on GitLab (T307534), improvements to user-facing documentation (T307535, T307538), enforcing the allowlist for container images in GitLab CI (T291978), experimentation with the GitLab Container Registry (T307537), and extensive discussions with ServiceOps on GitLab infrastructure.
How’d we do in our strive for operational excellence last month? Read on to find out!
By golly, we've had quite the month! 10 documented incidents, which is more than three times the two-year median of 3. The last time we experienced ten or more incidents in one month, was June 2019 when we had eleven (Incident graphs, Excellence monthly of June 2019).
I'd like to draw your attention to something positive. As you read the below, take note of incidents that did not impact public services, and did not have lasting impact or data loss. For example, the Apache incident benefited from PyBal's automatic health-based depooling. The deployment server incident recovered without loss thanks to Bacula. The Etcd incident impact was limited by serving stale data. And, the Hadoop incident recovered by resuming from Kafka right where it left off.
2022-05-01 etcd
Impact: For 2 hours, Conftool could not sync Etcd data between our core data centers. Puppet and some other internal services were unavailable or out of sync. The issue was isolated, with no impact on public services.
2022-05-02 deployment server
Impact: For 4 hours, we could not update or deploy MediaWiki and other services, due to corruption on the active deployment server. No impact on public services.
2022-05-05 site outage
Impact: For 20 minutes, all wikis were unreachable for logged-in users and non-cached pages. This was due to a GlobalBlocks schema change causing significant slowdown in a frequent database query.
2022-05-09 Codfw confctl
Impact: For 5 minutes, all web traffic routed to Codfw received error responses. This affected central USA and South America (local time after midnight). The cause was human error and lack of CLI parameter validation.
2022-05-09 exim-bdat-errors
Impact: During five days, about 14,000 incoming emails from Gmail users to wikimedia.org were rejected and returned to sender.
2022-05-21 varnish cache busting
Impact: For 2 minutes, all wikis and services behind our CDN were unavailable to all users.
2022-05-24 failed Apache restart
Impact: For 35 minutes, numerous internal services that use Apache on the backend were down. This included Kibana (logstash) and Matomo (piwik). For 20 of those minutes, there was also reduced MediaWiki server capacity, but no measurable end-user impact for wiki traffic.
2022-05-25 de.wikipedia.org
Impact: For 6 minutes, a portion of logged-in users and non-cached pages experienced a slower response or an error. This was due to increased load on one of the databases.
2022-05-26 m1 database hardware
Impact: For 12 minutes, internal services hosted on the m1 database (e.g. Etherpad) were unavailable or at reduced capacity.
2022-05-31 Analytics Hadoop failure
Impact: For 1 hour, all HDFS writes and reads were failing. After recovery, ingestion from Kafka resumed and caught up. No data loss or other lasting impact on the Data Lake.
Recently completed incident follow-up:
Invalid confctl selector should either error out or select nothing
Filed by Amir (@Ladsgroup) after the confctl incident this past month. Giuseppe (@Joe) implemented CLI parameter validation to prevent human error from causing a similar outage in the future.
Backup opensearch dashboards data
Filed back in 2019 by Filippo (@fgiunchedi). The OpenSearch homepage dashboard (at logstash.wikimedia.org) was accidentally deleted last month. Bryan (@bd808) tracked down its content and re-created it. Cole (@colewhite) and Jaime (@jcrespo) worked out a strategy and set up automated backups going forward.
Remember to review and schedule Incident Follow-up work in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.
In May we discovered 28 new production errors, of which 20 remain unresolved and have come with us to June.
Last month the workboard totalled 292 tasks still open from prior months. Since the last edition, we completed 11 tasks from previous months, gained 11 additional errors from May (some of May was counted in last month), and have 7 fresh errors in the current month of June. As of today, the workboard houses 299 open production error tasks (spreadsheet, phab report).
Take a look at the workboard and look for tasks that could use your help.
View Workboard
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
Last month we experienced 2 (public) incidents. This is below the three-year median of 3 incidents a month (Incident graphs).
2022-04-06 esams network
Impact: For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. Esams is one of two DCs primarily serving Europe, Middle East, and Africa.
2022-04-26 cr2-eqord down
Impact: No external impact. Internally, for 2 hours we were unable to access our Eqord routers by any means. This was due to a fiber cut on a redundant link to Eqiad, which then coincided with planned vendor maintenance on the links to Ulsfo and Eqiad. See also Network design.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.
Recently resolved incident follow-up:
Reduce mysql grants for wikiadmin scripts
Filed in 2020 after the wikidata drop-table incident (details). Carried out over the last six months by Amir @Ladsgroup (SRE Data Persistence).
Improve reliability of Toolforge k8s cron jobs and Re-enable CronJobControllerV2
Filed earlier this week after a Toolforge incident and carried out by Taavi @Majavah.
During the month of April we reported 27 new production errors. Of these new errors, we resolved 14, and the remaining 13 are still open and have carried over to May.
Last month, the workboard totalled 298 unresolved error reports. Of these older reports that carried over from previous months, 16 were resolved. Most of these were reports from before 2019.
The new total, including some tasks for the current month of May, is 292. A slight decrease! (spreadsheet).
Take a look at the workboard and look for tasks that could use your help.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
We've had quite the month, with 8 documented incidents. That's more than double the two-year median of three a month (Incident graphs).
2022-03-01 ulsfo network
Impact: For 20 minutes, clients normally routed to Ulsfo were unable to reach our projects. This includes New Zealand, parts of Canada, and the United States west coast.
2022-03-04 esams availability banner sampling
Impact: For 1.5 hours, all wikis were largely unreachable from Europe (via Esams), with more limited impact across the globe via other data centers as well.
2022-03-06 wdqs-categories
Impact: For 1.5 hours, some requests to the public Wikidata Query Service API were sporadically blocked.
2022-03-10 site availability
Impact: For 12 min, all wikis were unreachable to logged-in users, and to unregistered users trying to access uncached content.
2022-03-27 api
Impact: For ~4 hours, in three segments of 1-2 hours each over two days, there were higher levels of failed or slow MediaWiki API requests.
2022-03-27 wdqs outage
Impact: For 30 minutes, all WDQS queries failed due to an internal deadlock.
2022-03-29 network
Impact: For approximately 5 minutes, Wikipedia and other Wikimedia sites were slow or inaccessible for many users, mostly in Europe/Africa/Asia. (Details not public at this time.)
2022-03-31 api errors
Impact: For 22 minutes, API server and app server availability were slightly decreased (~0.1% errors, all for s7-hosted wikis such as Spanish Wikipedia), and the latency of API servers was elevated as well.
Remember to review and schedule Incident Follow-up (Sustainability) in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech. Some recently completed sustainability work:
Add linecard diversity to router-to-router interconnect at Codfw
Filed by Chris @CDanis (SRE Infra) in 2020 after an incident where all hosts in the Codfw data center lost connectivity at once. Completed by Arzhel @ayounsi and Cathal cmooney (SRE Infra), and @Papaul (DC Ops); including in Esams where the same issue existed.
Expand parser tests to cover language conversation variants in table-of-contents output
Suggested and carried out by @cscott (Parsoid) after reviewing an incident in November. The TOC on wikis that rely on the LanguageConverter service (such as Chinese Wikipedia) were no longer localized
Fix unquoted URL parameters in Icgina health checks
Suggested by Riccardo @Volans (SRE Infra) in response to an early warning signal for TLS certificate expiry. He realized that automated checks for a related cluster were still claiming to be in good health, when they in fact should have been firing a similar warning. Carried out by Filippo @fgiunchedi and Daniel @Dzahn.
Provide automation to quickly show replication status when primary is down
Filed in April by Jaime (SRE Data Persistence), carried out by John @jbond and Amir @Ladsgroup.
Since the last edition, we resolved 24 of the 301 unresolved errors that carried over from previous months.
In March, we created 54 new production errors. That's quite high compared to the twenty-odd reports we find most months. Of these, 17 remain open today a month later.
In the month of April, so far, we reported 20 new errors of which also 17 remain open today.
The production error workboard once again adds up to exactly 298 open tasks (spreadsheet).
Take a look at the workboard and look for tasks that could use your help.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Developers should own the process of putting their code into production. They should decide when to deploy, monitor their deployment, and make decisions about rollback.
But that’s not how we work at Wikimedia today, and we on Release Engineering aren’t sure how to get there, so we’ve decided to experiment.
Typically a deployment takes us a full week to complete—the week of March 21st, 2022, we deployed MediaWiki four times.
We called that week 🚂🧪Trainsperiment Week.
MediaWiki's mainline branch is changing constantly, but we deploy MediaWiki weekly (kind of). We keep stats that measure how far our main branch is from production.
The trainsperiment changed our deployment frequency, which affected all the other metrics, too. Faster deployment means smaller batch size, and shorter change lead time.
The number that we knew would change during trainsperiment week was change lead time—the time from merge to deploy. If I merge a change, then a minute later I deploy it, that change’s lead time is one minute.
This chart shows the average lead time of all patches in a given train:
The chart below compares a typical week (1.38.0-wmf.1) to trainsperiment week (1.38.0-wmf.2, wmf.3, and wmf.4). Each dot is a change in a particular version—fewer dots mean fewer changes.
During trainsperiment week, we deployed faster. Each deployment was smaller, and the lead time of each patch in a release was shorter.
Here’s the same data on a logarithmic scale. During trainsperiment week there were only a few hours between trains, so the lead time could be measured in hours, not days!
At the end of the week, we asked for feedback via the Wikitech-l mailing list. We collected comments from the mediawiki.org talk page and the summaries of candid conversations.
A small number of people took the time to respond to the survey—20 people answered our questions.
Almost everyone who took the survey seemed satisfied with communication. Most were satisfied with the experiment overall.
There were concerns on the talk page and in the survey responses about testing. Testing felt time-crunched, and everyone was worried about the time pressure on our Quality and Test Engineering Team (QTE).
Less than half of our respondents felt that the Trainsperiment positively impacted their work, with one respondent strongly disagreeing that there was a positive impact.
Most people were neutral about the impact of this experiment on their work.
The person who felt that there was a negative impact was concerned about the lack of time allotted for testing—they urged us to rethink testing if we wanted to try this again.
The survey contained free-form prompts for feedback. Below is a smattering of representative responses. Most of the comments below are amalgamations and simplifications, but the reactions in quotes are verbatim.
What should RelEng have done differently
What would you need to change if we did this every week?
Other Feedback
We talked individually to people who had concerns about the experiment on Slack and IRC, in meetings, in the survey feedback, and on the talk page.
People were concerned about shortening the time for review. This is understandable given that we shortened a 168-hour process to a 12-hour process.
Our QA process takes time. Our overburdened principal engineers take time to review code going live on a weekly basis. Due to some esoteric details, even our CI system gives us more confidence given more time—it was possible that MediaWiki could have broken compatibility with an extension without alerting anyone.
We have come to rely on the weekly cadence to make a careful release, and a faster process would mean rethinking our process pipeline to production.
The weekly train hides a lot of technical debt—it’s a giant feature flag and the missing testing environment rolled into one. It goes out every week (mostly), and Release Engineering spends about 20% of its time monitoring the release.
During trainsperiment week, we spent 100% of our time deploying—that’s not sustainable for our team.
We surfaced process pain points with this experiment, which was a success. We added to the already overlarge burdens of our principal engineers and quality engineers, which was a failure.
But this isn’t the end of the experiments. We endeavor to bring developers and production closer together—preferably with us standing back a healthy distance. If you’d like to help us get there—get in touch.
Thanks to @kchapman, @brennen, and @Krinkle for reading earlier drafts of this post and offering their feedback.
Over here in the Release-Engineering-Team, Train Deployment is usually a rotating duty. We've written about it before, so I won't go into the exact process, but I want to tell you something new about it.
It's awful, incredibly stressful, and a bit lonely.
And last week we ran an experiment where we endeavored to perform the full train cycle four times in a single week... What is wrong with us? (Okay. I need to own this. It was technically my idea.) So what is wrong with me? Why did I wish this on my team? Why did everyone agree to it?
First I think it's important to portray (and perhaps with a little more color) how terrible running the train can be.
Here's a little chugga-choo with a captain and a crew. Would the llama like a ride? Llama Llama tries to hide.
―Llama Llama, Llama Llama Misses Mama
At the outset of many a week I have wondered why, when the kids are safely in childcare and I'm finally in a quiet house well fed and preparing a nice hot shower to not frantically use but actually enjoy, my shoulder is cramping and there's a strange buzzing ballooning in my abdomen.
Am I getting sick? Did I forget something? This should be nice. Why can't I have nice things? Why... Oh. Yes. Right. I'm on train this week.
Train begins in the body before it terrorizes the mind, and I'm not the only one who feels that way.
A week of periodic drudgery which at any moment threatens to tip into the realm of waking nightmare.
―Stoic yet Hapless Conductor
Aptly put. The nightmare is anything from a tiny visual regression to taking some of the largest sites on the Internet down completely.
Giving a presentation but you have no idea what the slides are.
―Bravely Befuddled Conductor
Yes. There's no visibility into what we are deploying. It's a week's worth of changes, other teams' changes, changes from teams with different workflows and development cycles, all touching hundreds of different codebases. The changes have gone through review, they've been hammered by automated tests, and yet we are still too far removed from them to understand what might happen when they're exposed to real world conditions.
It's like throwing a penny into a well, a well of snakes, bureaucratic snakes that hate pennies, and they start shouting at you to fill out oddly specific sounding forms of which you have none.
―Lost Soul been 'round these parts
Kafkaesque.
When under the stress and threat of the aforementioned nightmare, it's difficult to think straight. But we have to. We have to parse and investigate intricate stack traces, run git blames on the deployment server, navigate our bug reporting forms and try to recall which teams are responsible for which parts of the aggregate MediaWiki codebase we've put together which itself is highly specific to WMF's production installation and really only becomes that long after changes merge to main branches of the constituent codebases.
We have to exercise clear judgement and make decisive calls of whether to rollback partially (previous group) or completely (all groups to previous version). We may have to halt everything and start hollering in IRC, Slack channels, mailing lists, to get the signal to the right folks (wonderful and gracious folks) that no more code changes will be deployed until what we're seeing is dealt with. We have to play the bad guys and gals to get the train back on track.
Study after study shows that having a good support network constitutes the single most powerful protection against becoming traumatized. Safety and terror are incompatible. When we are terrified, nothing calms us down like a reassuring voice or the firm embrace of someone we trust.
―Bessel Van Der Kolk, M.D., The Body Keeps the Score
Four trains in a single week and everyone in Release Engineering is onboard. What could possibly be better about that?
Well there is a safety in numbers as they say, and not in some Darwinistic way where most of us will be picked off by the train demons and the others will somehow take solace in their incidental fitness, but in a way where we are mutually trusting, supportive, and feeling collectively resourced enough to do the needful with aplomb.
So we set up video meetings for all scheduled deployment windows, had synchronous hand offs between our European colleagues and our North American ones. We welcomed folks from other teams into our deployments to show them the good, the bad, and the ugly of how their code gets its final send off 'round the bend and into the setting hot fusion reaction that is production. We found and fixed longstanding and mysterious bugs in our tooling. We deployed four full trains in a single week.
And it felt markedly different.
One of those barn raising projects you read about where everybody pushes the walls up en masse.
―Our Stoic Now Softened but Still Sardonic Conductor
Yes! Lonely and unwitnessed work is de facto drudgery. Toiling safely together we have a greater chance at staving off the stress and really feeling the accomplishment.
Giving a presentation with your friends and everyone contributes one slide.
―Our No Longer Befuddled but Simply Brave Conductor
Many hands make light work!
It was like throwing a handful of pennies into a well, a well of snakes, still bureaucratic and shouty, oh hey but my friends are here and they remind me these are just stack traces, words on a screen, and my friends happen to be great at filling out forms.
―Our Once Lost Now Found Conductor
When no one person is overwhelmed or unsafe, we all think and act more clearly.
So how should what we've learned during our Trainsperiment Week inform our future deployment strategies and process. How should train deployments change?
The known hypothesis we wanted to test by performing this experiment was in essence:
I don't know if we've proved that yet but we got an inkling that yes, the smaller subsequent deployments of the week did seem to go more smoothly. One week, however, even a week of four deployment cycles is not a large enough sample to say definitively whether doing train more frequently will for sure result in safer, more frequent deployments with fewer failures.
What was not apparent until we did our retrospective, however, is that it simply felt easier to do deployments together. It was still a kind of drudgery, but it was not abjectly terrible.
My personal takeaway is that a conductor who feels resourced and safe is the basis for all other improvements to the deployment process, and I want conductors to not only have tooling that works reliably with actionable logging at their disposal, but to feel a sense of community there with them when they're pushing the buttons. I want them to feel that the hard calls of whether or not to halt everything and rollback are not just their calls but shared in the moment among numerous people with intimate knowledge of the overall MediaWiki software ecosystem.
Better tooling—particularly around error reporting and escalation—is a barrier to entry for sure. Once we've made sufficient improvements there we need to get that tooling into other people's hands and show them that this process does not have to be so terrifying. And I think we're on the right track here with increased frequency and smaller sets of changes, but we can't lose sight of the human/social element and foundational basis of safety.
More than anything else, I want wider participation in the train deployment process by engineers in the entire organization along with volunteers.
Thanks to @thcipriani for reading my drafts and unblocking me from myself a number of times. Thanks to @jeena and @brennen for the inspirational analogies.
I'll start with a bit of general administrivia. First, our migration of Wikimedia code review & CI to GitLab continues, and we're mindful that people could use regular updates on progress. Second, I need to think through some stuff about the project, and doing that in writing is helpful for all involved. I'm going to try writing occasional blog entries here for both purposes.
Now on to the main topic of this post: Access control for groups and projects on the Wikimedia GitLab instance.
The tl;dr: We've been modeling access to things on GitLab by using groups under /people to contain individual users and then granting those groups access to things under /repos. This has been tricky to explain and doesn't work as well at a technical level as we'd hoped, so we're mostly scrapping the distinction, and moving control of project access to individual memberships in groups under /repos. This should be easier to think about, simpler to manage, and seems like it will suit our needs better. Read on for the nitty-gritty detail.
(Thanks to @Dzahn, @Majavah, @bd808, @AntiCompositeNumber, and @thcipriani for helping me think through the issues underlying this post.)
During the GitLab consultation, when we were working on building up a model of how we'd use GitLab for Wikimedia projects, we wrote up a draft policy for managing users and their access to projects.
GitLab supports Groups. GitLab groups are similar to GitHub's concept of organizations, although the specifics differ. Groups can contain:
We've since changed the original draft policy in some small ways - in particular, we decided to move most projects into a top-level /repos group in order to offer shared CI runners (see T292094). You can read the policy we landed on at the latest revision of GitLab/Policy on mediawiki.org.
The basic idea was that we would separate groups out into:
Groups in /people could then be given access to projects under /repos.
Our hope was that this would let us decouple the management of groups of humans from the individual projects they work on, and ease onboarding for new contributors. A new member of the WMF Release Engineering team, for example, could be added to a single group and then have access to all the things they need to do their job.
We intended for most /people groups to be owned by their members, who would in turn have ownership-level access to their projects under /repos, allowing for contributors to a project to manage access and invite new contributors.
As a concrete example:
I've been proceeding under this plan as people request the creation of GitLab project groups, but there turn out to be some problems.
First, it doesn't seem like permission inheritance for nested groups with other groups as members works the way you'd expect & hope: See T300939 - "GitLab group permissions are not inherited by sub-groups for groups of users invited to the parent repo".
Second, users have concerns about equity of access and tight coupling of things like employment with a specific organization to project access. We didn't have any intention of modeling any group of users as second-class citizens within this scheme, but it seems to create the impression of one all the same. It's also striking that the set of projects people work on just isn't that cleanly mapped to any particular organizational structure. Once you've been a technical contributor for a while, you've almost certainly collected responsibilities that no org chart reflects accurately.
Finally, and maybe most importantly, this is a complicated way to do things. People have a hard time thinking about it, and it requires a lot of explanation. That seems bad for an abstraction that we'd like to be basically self-serve for most users.
Mostly, my plan is to use groups closer to how they seem to be designed:
There are some unanswered questions here, but I plan to redraft the policy doc, move existing project layouts to this scheme, and start creating new project groups on this basis in the coming week or so.
My main philosophical takeaway here is that I work with a bunch of anarchists, and it's always best to plan accordingly.
Originally, one of our goals for this migration was avoiding a repeat of the weird, nested morass that is our current set of Gerrit permissions. While it would be a good idea to keep the structure of things on GitLab flatter and easier to think about, I'm no longer that worried about it. Some of the complexity is inherent to any large set of projects and contributors; some of it just reflects a long-lived technical culture that's emergent and largely self-governing, tendencies that nearly always resist well-intentioned efforts to rationalize and map structure to things like official organizational layout.
If you’ve ever experienced the pride of seeing your name on MediaWiki's contributor list, you've been involved in our deployment process (whether you knew it or not).
The Wikimedia deployment process — 🚂🌈 The Train — pushed over 13,000 developer changes to production in 2021 . That's more than a change per hour for every single hour of the year—24 hours per day, seven days per week!
As you deploy more software to production, you may begin to wonder: is anything I've been working on going to be deployed this week? What's the status of production? Where can I find data about any of this?
Bryan Davis (@bd808) created the versions toolforge tool in 2017. The versions tool is a dashboard showing the current status of Wikimedia's more than 900 wikis.
Other places to find info about the current deployment:
There's an aphorism in management: you can't manage what you can't measure. For years the train chugged along steadily, but it's only recently that we've begun to collect data on its chuggings.
The train stats project started in early 2021 and contains train data going back to March 2016.
Now we're able to talk about our deployments informed by the data. Release-Engineering-Team partnered with Research late last year to explore the data we have.
We're able to see metrics like Lead time and Cycle time
We measured product delivery lead time as the time it takes to go from code committed to code running in production.
– Accelerate (pg. 14, 15)
Our lead time — the time to go from commit in mainline to production — is always less than a week. In the scatter plots above, we can see some evidence of work-life balance: not many patches land two days before deployment — that's the weekend!
For the software delivery process, the most important global metric is cycle time. This is the time between deciding that a feature needs to be implemented and having that feature released to users.
– Continuous Delivery (pg 138)
Our cycle time — the time between a patch requesting code review and its deployment — varies. Some trains have massive outliers. In the chart above, for example, you can see one train that had a patch that was five years old!
It is now possible to see what we on Release Engineering had long suspected: the number of patches for each train has slowly been ticking up over time:
Also shown above: as the number of patches continues to rise, the number of comments per patch — that is, code-review comments per patch — has dropped.
The data also show that the average number of lines of code per patch is slightly going up:
The train-stats repo has data on blockers and delays. Most trains have a small number of blockers and deploy without fanfare. Other trains are plagued by problems that explode into an endless number of blockers — cascading into a series of psychological torments, haunting deployers like the train-equivalent of ringwraiths. Trainwraiths, let’s say.
The shape of the histogram of this data shows that blockers per train follows a power law — most trains have a few blockers:
Surprisingly, most of our blockers happen before we even start a train. Bugs from the previous week that we couldn't justify halting everything to fix, but need to be fixed before we lay down more code on top.
The data also let us correlate train characteristics with failure signals. Here we see that the number of patches (“patches”) per train (trending ↑) positively correlates with blockers, and lines of code review (“loc_per_train_bug”) per patch (trending ↓) negatively correlates with blockers — more patches and less code review are both correlated with more blockers:
Contrast this with Facebook's view of train risk. In a 2016 paper entitled "Development and Deployment at Facebook," Facebook's researchers documented how their Release Engineering team quantified deployment risk:
Inputs affecting the amount of oversight exercised over new code are the size of the change and the amount of discussion about it during code reviews; higher levels for either of these indicate higher risk.
– Development and Deployment at Facebook (emphasis added)
In other words, to Facebook, more code, and more discussion about code, means riskier code. Our preliminary data seem to only partially support this: more code is riskier, but more discussion seems to lessen our risk.
This train data is open for anyone to explore. You can download the sqlite database that contains all train data from our gitlab repo, or play with it live on our datasette install.
There are a few Jupyter notebooks that explore the data:
An audacious dream for the future of this data is to build a model to quantify exactly how risky a patchset is. We keep data on everything from bugs to rollbacks. Perhaps in future a model will help us roll out code faster and safer.
Thanks to @Miriam, @bd808, and @brennen for reading early drafts of this post: it'd be wronger without their input 💖.
How’d we do in our strive for operational excellence last month? Read on to find out!
3 documented incidents last month.
2022-02-01 ulsfo network
Impact: For 3 minutes, clients served by the ulsfo POP were not able to contribute or display un-cached pages.
2022-02-22 wdqs updater codfw
Impact: For 2 hours, WDQS updates failed to be processed. Most bots and tools were unable to edit Wikidata during this time.
2022-02-22 vrts
Impact: For 12 hours, incoming emails to a specific recently created VRTS queue were not processed with senders receiving a bounce with an SMTP 550 Error.
Figure from Incident graphs.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.
Recently conducted incident follow-up:
Create a dashboard for Prometheus metrics about health of Prometheus itself.
Pitched by CDanis after an April 2019 incident, carried out by Filippo (@fgiunchedi).
Improve wording around AbuseFilter messages about throttling functionality.
Originally filed in 2018. This came up last month during an incident where the wording may've led to a misunderstanding. Now resolved by @Daimona.
Exclude restart procedure from automated Elasticsearch provisioning.
There can be too much automation! Filed after an incident last September. Fixed by @RKemper.
Take a look at the workboard and look for tasks that could use your help.
I skip breakdowns most months as each breakdown has its flaws. However, I hear people find them useful, so I'll try to do them from time to time with my noted caveats. The last breakdown was in the December edition, which focussed on throughput during a typical month. Important to recognise is that neither high nor low throughput is per-se good or bad. It's good when issues are detected, reported, and triaged correctly. It's also good if a team's components are stable and don't produce any errors. A report may be found to be invalid or a duplicate, which is sometimes only determined a few weeks later.
The below "after six months" breakdown takes more of that into consideration by looking at what's on the table after six months (tasks upto Sept 2021). This may be considered "fairer" in some sense, although has the drawback of suffering from hindsight bias, and possibly not highlighting current or most urgent areas.
WMF Product:
WMF Tech:
WMDE:
Other:
In February, we reported 25 new production errors. Of those, 13 have since been resolved, and 12 remain open as of today (two weeks into the following month). We also resolved 22 errors that remained open from previous months. The overall workboard has grown slightly to a total of 301 outstanding error reports.
For the month-over-month numbers, refer to the spreadsheet data.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
There were no incidents this January. Pfew! Remember to review and schedule Incident Follow-up work in Phabricator. These are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.
During 2021, I compared us to the median of 4 incidents per month, as measured over the two years prior (2019-2020).
I'm glad to announce our median has lowered to 3 per month over the past two years (2020-2021). For more plots and numbers about our incident documentation, refer to Incident stats.
Since the previous edition, we resolved 17 tasks from previous months. In January, there were 45 new error reports of which 28 have been resolved within the same month, the remaining 17 have carried over to February.
With precisely 17 tasks both closed and added, the workboard remains at the exact total of 298 open tasks, for the third month in a row. That's quite the coincidence.
Take a look at the workboard and look for tasks that could use your help.
For the month-over-month numbers, refer to the spreadsheet data.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
One documented incident last month:
2021-12-03 mx
Impact: A portion of outgoing email from wikimedia.org was delivered with a delay of upto 24 hours. This affected staff Gmail, and Znuny/Phabricator notifications. No mail was lost, it was eventually delivered.
Image from Incident graphs.
Remember to review and schedule Incident Follow-up work in Phabricator. These are preventive measures and tech debt mitigations written down after an incident. Read about past incidents at Incident status on Wikitech.
Recently resolved incident follow-up:
Create paging alert for high MX queues.
Filed in December after the mail delivery incident, resolved later that month by Keith (Herron).
Limit db execution time of expensive MW special pages.
Filed in December after various incidents due to high DB/appserver load, carried out by Amir (Ladsgroup).
In December we reported 22 new errors in December, of which 5 have since been resolved, and 17 remain open and have carried over to January. From the 298 issues previously carried over, we also resolved 17, thus the workboard still adds up to 298 in total.
In previous editions, we sometimes looked at the breakdown of tasks that remained unresolved. This time, I'd like to draw attention to the throughput and distribution of tasks that did get resolved.
Production errors resolved in the month of December, by team and component (query):
For the month-over-month numbers, refer to the spreadsheet data.
Take a look at the workboard and look for tasks that could use your help.
View Workboard
Oldest unresolved errors:
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
6 documented incidents last month. That's above the two-year and five-year median of 4 per month (per Incident graphs).
2021-11-04 large file upload timeouts
Impact: For 9 months, editors were unable to upload large files (e.g. to Commons). Editors would receive generic error messages, typically after a timeout. In retrospect, a dozen different distinct production errors had been reported and regularly observed that were related and provided different clues, however most of these remained untriaged and uninvestigated for months. This may be related to the affected components having no active code steward.
2021-11-05 TOC language converter
Impact: For 6 hours, wikis experienced a blank or missing table of contents on many pages. For up to 3 days prior, wikis that have multiple language variants (such as Chinese Wikipedia) displayed the table of contents in an incorrect or inconsistent language variant (which are not understandable to some readers).
2021-11-10 cirrussearch commonsfile outage
Impact: For ~2.5 hours, the Search results page was unavailable on many wikis (except English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.
2021-11-18 codfw ipv6 network
Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6 connectivity for upload.wikimedia.org. This did not affect availability of the service because the "Happy Eyeballs" algorithm ensures browsers (and other clients) automatically fallback to IPv4. The Codfw cluster generally serves Mexico and parts of the US and Canada. The upload.wikimedia.org service serves photos and other media/document files, such as displayed in Wikipedia articles.
2021-11-23 core network routing
Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data centers via public IP addresses. This was due to a BGP routing error. There was no impact on end-user traffic, and impact on internal traffic was limited (only Icinga alerts themselves) because internal traffic generally uses local IP subnets which we currently route with OSPF instead of BGP.
2021-11-25 eventgate-main outage
Impact: For about 3 minutes, eventgate-main was down. This resulted in 25,000 MediaWiki backend errors due to inability to queue new jobs. About 1000 user-facing web requests failed (HTTP 500 Error). Event production briefly dropped from ~3000 per second to 0 per second.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.
Recently resolved incident follow-up:
Disable DPL on wikis that aren't using it.
Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal (Legoktm).
Create easy access to MySQL ports for faster incident response and maintenance.
Filed in Sep 2021, and carried out by Stevie (Kormat).
Create paging alert for primary DB hosts.
Filed after a Sept 2019 incident, done by Stevie (Kormat).
November saw 27 new production error reports of which 14 were resolved, and 13 remain open and carry over to the next month.
Of the 301 errors still open from previous months, 16 were resolved. Together with the 13 carried over from November that brings the workboard to 298 unresolved tasks.
For the month-over-month numbers, refer to the spreadsheet data.
Issues carried over from recent months:
Apr 2021 | 9 of 42 issues left. |
May 2021 | 16 of 54 issues left. |
Jun 2021 | 9 of 26 issues left. |
Jul 2021 | 11 of 31 issues left. |
Aug 2021 | 10 of 46 issues left. |
Sep 2021 | 10 of 24 issues left. |
Oct 2021 | 20 of 49 issues left. |
Nov 2021 | 13 of 27 new issues are carried forward. |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).
2021-10-08 network provider
Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. It was caused by a routing problem with one of several redundant network providers.
2021-10-22 eqiad networking
Impact: For ~40 minutes clients that are normally geographically routed to Eqiad experienced connection or timeout errors. We lost about 7K req/s during this time. After initial recovery, Eqiad was ready and repooled in ~10 minutes.
2021-10-25 s3 db replica
Impact: For ~30min MediaWiki backends were slower than usual. For ~12 hours, many wiki replicas were stale for Wikimedia Cloud Services such as Toolforge.
2021-10-29 graphite
Impact: During a server upgrade, historical data was lost for a subset of Graphite metrics. Some were recovered via the redundant server, but others were lost as the redundant was also upgraded since then and lost some in a similar fashion.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.
In October, 49 new tasks were reported as production errors. Of these, we resolved 26, and 23 remain unresolved and carry forward to the next month.
Previously, the production error workboard held an accumulated total of 298 still-open error reports. We resolved 20 of those. Together with the 23 new errors carried over from October, this brings us to 301 unresolved errors on the board.
For the month-over-month numbers, refer to the spreadsheet data.
Take a look at the workboard and look for tasks that could use your help.
Issues carried over from recent months:
Apr 2021 | 9 of 42 issues left. |
May 2021 | 16 of 54 issues left. |
Jun 2021 | 9 of 26 issues left. |
Jul 2021 | 12 of 31 issues left. |
Aug 2021 | 12 of 46 issues left. |
Sep 2021 | 11 of 24 issues left. |
Oct 2021 | 23 of 49 new issues are carried forward. |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
We've had quite an eventful month, with 8 documented incidents in September. That's the highest since last year (Feb 2020) and one of the three worst months of the last five years.
Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded.
Image from Incident graphs.
The month of September saw 24 new production error reports of which 11 have since been resolved, and today, three to six weeks later, 13 remain open and have thus carried over to the next month. This is about average, although it makes it no less sad that we continue to introduce (and carry over) more errors than we rectify in the same time frame.
On the other hand, last month we did have a healthy focus on some of the older reports. The workboard stood at 301 unresolved errors last month. Of those, 16 were resolved. With the 13 new errors from September, this reduces the total slightly, to 298 open tasks.
For the month-over-month numbers, refer to the spreadsheet data.
Take a look at the workboard and look for tasks that could use your help.
Summary over recent months:
Jan 2021 (50 issues) | 3 left. Unchanged. |
Feb 2021 (20 issues) | 5 > 4 left. |
Mar 2021 (48 issues) | 10 > 9 left. |
Apr 2021 (42 issues) | 17 > 10 left. |
May 2021 (54 issues) | 20 > 17 left. |
Jun 2021 (26 issues) | 10 > 9 left. |
Jul 2021 (31 issues) | 12 left. Unchanged. |
Aug 2021 (46 issues) | 17 > 12 left. |
Sep 2021 (24 issues) | 13 unresolved issues remaining. |
Tally | |
---|---|
301 | issues open, as of Excellence #35 (August 2021) |
-16 | issues closed, of the previous 301 open issues. |
+13 | new issues that survived September 2021. |
298 | issues open, as of today (19 Oct 2021). |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
This post gives a quick introduction to a benchmarking tool, phpbench, ready for you to experiment with in core and skins/extensions.[1]
From their documentation:
PHPBench is a benchmark runner for PHP analagous to PHPUnit but for performance rather than correctness.
In other words, while a PHPUnit test will tell you if your code behaves a certain way given a certain set of inputs, a PHPBench benchmark only cares how long that same piece of code takes to execute.
The tooling and boilerplate will be familiar to you if you've used PHPUnit. There's a command-line runner at vendor/bin/phpbench, benchmarks are discoverable by default in tests/Benchmark, a configuration file (benchmark.json) allows for setting defaults across all benchmarks, and the benchmark tests classes and tests look pretty similar to PHPUnit tests.
Here's an example test for the Html::openElement() function:
namespace MediaWiki\Tests\Benchmark; class HtmlBench { /** * @Assert("mode(variant.time.avg) < 85 microseconds +/- 10%") */ public function benchHtmlOpenElement() { \Html::openElement( 'a', [ 'class' => 'foo' ] ); } }
So, taking it line by line:
If we run the test with composer phpbench, we will see that the test passes. One thing to be careful with, though, is adding assertions that are too strict – you would not want a patch to fail CI because the assertion for execution was not flexible enough (more on this later on).
One neat feature in PHPBench is the ability to tag current results and compare with another run. Looking at the HTMLBench benchmark test from above, for example, we can compare the work done in rMW5deb6a2a4546: Html::openElement() micro-optimisations to get before and after comparisons of the performance changes.
Here's a benchmark of e82c5e52d50a9afd67045f984dc3fb84e2daef44, the commit before the performance improvements added to Html::openElement() in rMW5deb6a2a4546: Html::openElement() micro-optimisations
❯ git checkout -b html-before-optimizations e82c5e52d50a9afd67045f984dc3fb84e2daef44 # get the old HTML::openElement code before optimizations ❯ git review -x 727429 # get the core patch which introduces phpbench support ❯ composer phpbench -- tests/Benchmark/includes/HtmlBench.php --tag=original
And the output [2]:
Note that we've used --tag=original to store the results. Now we can check out the newer code, and use --ref=original to compare with the baseline:
❯ git checkout -b html-after-optimizations 5deb6a2a4546318d1fa94ad8c3fa54e9eb8fc67c # get the new HTML::openElement code with optimizations ❯ git review -x 727429 # get the core patch which introduces phpbench support ❯ composer phpbench -- tests/Benchmark/includes/HtmlBench.php --ref=original --report=aggregate
And the output [3]:
We can see that the execution time roughly halved, from 18 microseconds to 8 microseconds. (For understanding the other columns in the report, it's best to read through the Quick Start guide for phpbench.) PHPBench can also provide an error exit code if the performance decreased. One way that PHPBench might fit into our testing stack would be to have a job similar to Fresnel, where a non-voting comment on a patch alerts developers whether the PHPBench performance decreased in the patch.
A slightly more complex example is available in GrowthExperiments (patch). That patch makes use of setUp/tearDown methods to prepopulate the database entries needed for the code being benchmarked:
/** * @BeforeMethods ("setUpLinkRecommendation") * @AfterMethods ("tearDownLinkRecommendation") * @Assert("mode(variant.time.avg) < 20000 microseconds +/- 10%") */ public function benchFilter() { $this->linkRecommendationFilter->filter( $this->tasks ); }
The setUpLinkRecommendation and tearDownLinkRecommendation methods have access to MediaWikiServices, and generally you can do similar things you'd do in an integration test to setup and teardown the environment. This test is towards the opposite end of the spectrum from the core test discussed above which looks at Html::openElement(); here, the goal is to look at a higher level function that involves database queries and interacting with MediaWiki services.
You can experiment with the tooling and see if it is useful to you. Some open questions:
Looking forward to your feedback! [4]
[1] thank you, @hashar, for working with me to include this in Quibble and roll out to CI to help with evaluation!
[2]
> phpbench run --config=tests/Benchmark/phpbench.json --report=aggregate 'tests/Benchmark/includes/HtmlBench.php' '--tag=original' PHPBench (1.1.2) running benchmarks... with configuration file: /Users/kostajh/src/mediawiki/w/tests/Benchmark/phpbench.json with PHP version 7.4.24, xdebug ✔, opcache ❌ \MediaWiki\Tests\Benchmark\HtmlBench benchHtmlOpenElement....................R1 I1 ✔ Mo18.514μs (±1.94%) Subjects: 1, Assertions: 1, Failures: 0, Errors: 0 Storing results ... OK Run: 1346543289c75373e513cc3b11fbf5215d8fb6d0 +-----------+----------------------+-----+------+-----+----------+----------+--------+ | benchmark | subject | set | revs | its | mem_peak | mode | rstdev | +-----------+----------------------+-----+------+-----+----------+----------+--------+ | HtmlBench | benchHtmlOpenElement | | 50 | 5 | 2.782mb | 18.514μs | ±1.94% | +-----------+----------------------+-----+------+-----+----------+----------+--------+
[3]
> phpbench run --config=tests/Benchmark/phpbench.json --report=aggregate 'tests/Benchmark/includes/HtmlBench.php' '--ref=original' '--report=aggregate' PHPBench (1.1.2) running benchmarks... with configuration file: /Users/kostajh/src/mediawiki/w/tests/Benchmark/phpbench.json with PHP version 7.4.24, xdebug ✔, opcache ❌ comparing [actual vs. original] \MediaWiki\Tests\Benchmark\HtmlBench benchHtmlOpenElement....................R5 I4 ✔ [Mo8.194μs vs. Mo18.514μs] -55.74% (±0.50%) Subjects: 1, Assertions: 1, Failures: 0, Errors: 0 +-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+ | benchmark | subject | set | revs | its | mem_peak | mode | rstdev | +-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+ | HtmlBench | benchHtmlOpenElement | | 50 | 5 | 2.782mb 0.00% | 8.194μs -55.74% | ±0.50% -74.03% | +-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+
[4] Thanks to @zeljkofilipin for reviewing a draft of this post.
Last week I spoke to a few of my Wikimedia Foundation (WMF) colleagues about how we deploy code—I completely botched it. I got too complex too fast. It only hit me later—to explain deployments, I need to start with a lie.
M. Jagadesh Kumar explains:
Every day, I am faced with the dilemma of explaining some complex phenomena [...] To realize my goal, I tell "lies to students."
This idea comes from Terry Pratchett's "lies-to-children" — a false statement that leads to a more accurate explanation. Asymptotically approaching truth via approximation.
Every section of this post is a subtle lie, but approximately correct.
The first lie I need to tell is that we deploy code once a week.
Every Thursday, Release-Engineering-Team deploys a MediaWiki release to all 978 wikis. The "release branch" is 198 different branches—one branch each for mediawiki/core, mediawiki/vendor, 188 MediaWiki extensions, and eight skins—that get bundled up via git submodule.
The next lie gets a bit closer to the truth: we don't deploy on Thursday; we deploy Tuesday through Thursday.
The cleverly named TrainBranchBot creates a weekly train branch at 2 am UTC every Tuesday.
Progressive rollouts give users time to spot bugs. We have an experienced user-base—as Risker attested on the Wikitech-l mailing list:
It's not always possible for even the best developer and the best testing systems to catch an issue that will be spotted by a hands-on user, several of whom are much more familiar with the purpose, expected outcomes and change impact on extensions than the people who have written them or QA'd them.
Now I'm nearing the complete truth: we deploy every day except for Fridays.
Brace yourself: we don't write perfect software. When we find serious bugs, they block the release train — we will not progress from Group1 to Group2 (for example) until we fix the blocking issue. We fix the blocking issue by backporting a patch to the release branch. If there's a bug in this release, we patch that bug in our mainline branch, then git cherry-pick that patch onto our release branch and deploy that code.
We deploy backports three times a day during backport deployment windows. In addition to backports, developers may opt to deploy new configuration or enable/disable features in the backport deployment windows.
Release engineers train others to deploy backports twice a week.
We deploy on Fridays when there are major issues. Examples of major issues are:
We avoid deploying on Fridays because we have a small team of people to respond to incidents. We want those people to be away from computers on the weekends (if they want to be), not responding to emergencies.
There are 42 microservices on Kubernetes deployed via helm. And there are 64 microservices running on bare metal. The service owners deploy those microservices outside of the train process.
We coordinate deployments on our deployment calendar wiki page.
We progressively deploy a large bundle of MediaWiki patches (between 150 and 950) every week. There are 12 backport windows a week where developers can add new features, fix bugs, or deploy new configurations. There are microservices deployed by developers at their own pace.
Thanks to @brennen, @greg, @KSiebert, @Risker, and @VPuffetMichel for reading early drafts of this post. The feedback was very helpful. Stay tuned for "How we deploy code: Part II."
How’d we do in our strive for operational excellence last month? Read on to find out!
Zero documented incidents last month. Isn't that something!
Learn about past incidents at Incident status on Wikitech. Remember to review and schedule Incident Follow-up in Phabricator, which are preventive measures and other action items to learn from.
Image from Incident graphs.
In August we resolved 18 of the 156 reports that carried over from previous months, and reported 46 new failures in production. Of the new ones, 17 remain unresolved as of writing and will carry over to next month.
The number of new errors reports in August was fairly high at 46, compared to 31 reports in July, and 26 reports in June.
The backlog of "Old" issues saw no progress this past month and remained constant at 146 open error reports.
Unified graph:
Take a look at the workboard and look for tasks that could use your help.
Last few months in review:
Jan 2021 (50 issues) | 3 left. |
Feb 2021 (20 issues) | 6 > 5 left. |
Mar 2021 (48 issues) | 13 > 10 left. |
Apr 2021 (42 issues) | 18 > 17 left. |
May 2021 (54 issues) | 22 > 20 left. |
Jun 2021 (26 issues) | 11 > 10 left. |
Jul 2021 (31 issues) | 16 > 12 left. |
Aug 2021 (46 issues) | + 17 new unresolved issues. |
Tally:
156 | issues open, as of Excellence #34 (July 2021). |
-18 | issues closed, of the previously open issues. |
+17 | new issues that survived August 2021. |
155 | issues open, as of today (3 Sep 2021). |
For more month-over-month numbers refer to the spreadsheet.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
How’d we do in our strive for operational excellence last month? Read on to find out!
3 documented incidents last month. That's at the median for the past twelve months, and slightly below the median of 4 over the past five years (Incident stats).
Learn about past incidents at Incident status on Wikitech. Remember to review and schedule Incident Follow-up in Phabricator, which are preventive measures and other action items filed after an incident.
Last month the workboard held 154 non-old unresolved error reports. Over the past thirty days, the collective efforts of our volunteers and engineering teams have closed 14 of those.
In the month of July we've also introduced or discovered thirty-one new error reports (that's an average of one production regression every day!). Of those new error reports, fifteen were resolved and 16 remain unresolved. The workboard now tallies up to 156 tasks.
Take a look at the workboard and look for tasks that could use your help.
Over on the backlog, we're continuing to ploddingly present progress on production problems from phantoms of christmases past.
For more month-over-month numbers refer to the spreadsheet data.
Below are various older issues that may have fallen by the wayside, taken from somewhat-random stab-in-the-dark queries.
Oldest unresolved errors that are still reproducible (Phab query):
Stalled error reports (Phab query):
Oldest error with a patch for review (Phab query):
Jan 2021 (3 of 50 issues left) | ⚠️ Unchanged. Have a look-see! |
Feb 2021 (6 of 20 issues left) | ⚠️ Unchanged. Take a gander! |
Mar 2021 (13 of 48 issues left) | ⚠️ Unchanged. Check it out! |
Apr 2021 (18 of 42 issues left) | -1 |
May 2021 (22 of 54 issues left) | -3 |
June 2021 (11 of 26 issues left) | -4 |
July 2021 (16 of 31 issues left) | +31; -15 |
154 | issues open, as of Excellence #33 (June 2021). |
-14 | issues closed, of the previous 154 open issues. |
+16 | new issues that survived July 2021. |
156 | issues open, as of today. |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
The release engineering team triages tasks flagged Release-Engineering-Team on a weekly basis. It is an all hands on deck one hour meeting in which we pick tasks one by one and find out what to do with them. We have started with more than a hundred of them and are now down to just a dozen or so, most filed since the last meeting.
I have been doing those routine triages for the projects I closely manage, often on Friday afternoon. I have recently started being a bit more serious about it and even allocated a couple weeks entirely dedicated to act on the backlog. This post summarizes some of my discoveries, will hopefully inspire the reader to tackle their own backlogs, technical debt and hopefully in the end we will have improved our ecosystem.
I keep filing tasks rather than taking notes or writing emails, I find Phabricator interface convenient since it lets me flag a task with labels however I want (Technical-Debt , Documentation, MediaWiki-General), subscribe individuals or even a whole team. It is great. With time those tasks pill up and it is easy to forget old ones, they have to be revisited from time to time. It as easy I searching for any open tasks I have filed and order them by creation date:
Authors | Current viewer |
Statuses | Open Stalled |
Group By | None |
Order By | Creation (oldest First) |
https://phabricator.wikimedia.org/maniphest/query/Wws2E0C7IaFd/#R
The first bug in the list is the oldest you have created and most probably deserve to be acted on. From there pick the tasks one by one.
Some will surely be obsolete since they have been acted on or the underlying infrastructure entirely changed. An example of a 6 years old task I declined is T100099, it followed a meeting to deploy MediaWiki services to Beta-Cluster-Infrastructure . The task has been partially achieved for a few services (notably Parsoid) and was left open since we never moved all services to the same system. Nowadays developers deploy a Docker image and restart the Docker container. The notes are obsolete and the task has thus no purpose anymore.
T149924 came from deploying static web assets using git directly to /srv. However the partition also hosted dynamically generated content such as all the content from https://doc.wikimedia.org/ , https://integration.wikimedia.org/ or state from a CI daemon. The issue is problematic when we reimage the server, specially during OS upgrades which we do every two years, and the task history reflect that:
I completed it because that task showed up in my list of oldest bugs, it thus kept showing up whenever I did the triage and that was an incentive to get it gone. We are in a much better shape, the services have been decoupled on different machines, the static assets are deployed using our deployment tool: Scap.
Beside your team projects, you surely have side pet projects or legacy tags you might want to revisit. They can be found in search for your projects you are a member of (assuming you made yourself a member): https://phabricator.wikimedia.org/project/query/JS0zmX.yalpI/#R
I for example introduced Doxygen to generate the MediaWiki PHP documentation, git-review to assist interactions with Gerrit for which bugs are tracked in a column of the Gerrit project, and I am probably the one one actively acting on this task.
You can again list tasks filed against each project sorted by creation dates, and since you are a member of the project you will most probably be able to act on those old tasks.
One of the oldest tasks I had was T48148, which is to hide CI or robot comments from Gerrit change. The task has been filed in 2013, I found the upstream proposed solution back in 2019 and well *cough* forgot about it. Since I encountered the task during a triage, I went to tackle it and in short the required code boils down to add a single line in the CI configuration:
gerrit:
verified: 2
+ tag: autogenerated:ci
That took almost 9 months, since I was not actively triaging old tasks.
Just like we have the generic Documentation tag for any tasks relating to documentation, we have Technical-Debt to mark a task as requiring an extra effort and bring us to modernity. When triaging your own or your projects tasks, you can flag them as technical debt to easily find them later on.
Some tasks can immediately be filed as being a technical debt, that was the case of T141324 which is to send logging of the Gerrit code review system to logstash and thus make them easier to dig through or discover. Sounds simple? Well not that much.
The story is a bit complicated, but in short Gerrit is a java application and our team does not necessarily have much experience with it, the state of Java logging is a bit unclear (Gerrit uses log4j). Luckily we had some support from actual Java developers and managed to do some injecting, though the fields were not properly formatted, it was a progress.
After I got assigned as the primary maintainer of our Gerrit setup, I definitely needed proper logging. When we upgraded Gerrit to 3.2, the library we used to format the logs to Json was no longer provided by upstream, forcing us to maintain a fork of Gerrit just for that purpose.
Luckily upstream has made improvements and I found out it supports json logging out of the box while our logging infrastructure learned to ingest json logs. We even got as far as supporting Elastic Common Schema to use predefined field names.
That task has been a technical debt for 5 years, but since I kept seeing it I kept remembering about it and managed to address it.
Some tasks can not be acted on cause they depend on an upstream change that might be delayed for some reasons. A massive issue we have encountered since at least 2015 was slowness when doing a git fetch from our busiest repository. I previously blogged about it Blog Post: Faster source code fetches thanks to git protocol version 2 and Google addressed it by proposing a version 2 of the git protocol. It was one of the incentives for us to upgrade Gerrit, and as soon as we upgraded I made a point to test the fix and make it well known to our developers (do use protocol.version=2 in your .gitconfig).
When processing old tasks, you can find it hard to tackle ones that need to focus for a few days if not weeks as in the example above. But there are also a bunch of little annoying tasks that are surprisingly very easy to solve and give immediate reward. The positive feedback loop would get you in the mood of finding more easy tasks and thus reducing your backlog. A few more examples:
T221510, filed in 2019 and addressed two years later, was requesting to expose a machine readable test coverage report. The file was there (clover.xml) it was simply not exposed in the web page, a simple <a href="clover.xml">clover.xml</a> is the only thing that was required.
My favorite tasks are obviously the ones that already have been solved and are just pending the paperwork to mark them resolved. T138653 was for a user unable to login to Gerrit due to a duplicate account, 3 years after it had been filed the user reported he was able to login properly and I marked it resolved one hour later. I guess that user was grooming their old tasks as well.
And finally, some old tasks might not be worth fixing. We are probably too kind with those and should probably be more strict in declining very old tasks. An example is T63733, the MediaWiki source code is deployed to the Wikimedia production cluster under a directory named php-<version>. Surely the php- prefix does not offer any meaningful information. However, since it is hardcoded in various places and would require moving files around on the whole fleet of servers, it might be a bit challenging and would definitely be a risky change. Should we drop that useless prefix? For sure. Is it worth facing outage and possibly multiple degraded services? Definitely not and I have thus just declined it.
How’d we do in our strive for operational excellence last month? Read on to find out!
3 documented incidents. That's lower than June in the previous five years where the month saw 5-9 incidents. I've added a new panel ⭐️ to the Incident statistics tool. This one plots monthly statistics on top of previous years, to more easily compare them:
Learn more from the Incident documents on Wikitech, and remember to review and schedule Incident Follow-up in Phabricator, which are preventive measures and other action items filed after an incident.
In June, work on production errors appears to have stagnated a bit. Or more precisely, the work only resulted in relatively few tasks being resolved. 15 of the 26 new tasks are still open as of writing.
Of the tasks from previous months, only 11 were resolved, leaving most columns unchanged. See the table further down for a more detailed breakdown and links to Phabricator queries for the tasks in question.
With the 15 remaining new tasks, and the 11 tasks resolved from our backlog, this raises the chart from 150 to 154 tasks.
Take a look at the workboard and look for tasks that could use your help.
Month-over-month plots based on spreadsheet data.
Summary over recent months:
Jan 2020 (1 of 7 left) | ⚠️ Unchanged (over one year old). | |
Mar 2020 (2 of 2 left) | ⚠️ Unchanged (over one year old). | |
Apr 2020 (4 of 14 left) | ⚠️ Unchanged (over one year old). | |
May 2020 (5 of 14 left) | ⚠️ Unchanged (over one year old). | |
Jun 2020 (5 of 14 left) | ⚠️ Unchanged (over one year old). | |
Jul 2020 (4 of 24 issues) | ⚠️ Unchanged (over one year old). | |
Aug 2020 (11 of 53 issues) | ⬇️ One task resolved. | -1 |
Sep 2020 (7 of 33 issues) | ⚠️ Unchanged (over one year old). | |
Oct 2020 (19 of 69 issues) | ⚠️ Unchanged (over one year old). | |
Nov 2020 (8 of 38 issues) | ⚠️ Unchanged (over one year old). | |
Dec 2020 (7 of 33 issues) | ⚠️ Unchanged (over one year old). | |
Jan 2021 (3 of 50 issues) | ⚠️ Unchanged (over one year old). | |
Feb 2021 (6 of 20 issues) | ⬇️ One task resolved. | -1 |
Mar 2021 (13 of 48 issues) | ⬇️ One task resolved. | -1 |
Apr 2021 (19 of 42 issues) | ⬇️ Four tasks resolved. | -4 |
May 2021 (25 of 54 issues) | ⬇️ Four tasks resolved. | -4 |
June 2021 (15 of 26 issues) | 📌 26 new issues, of which 11 were closed. | +26, -11 |
Tally | |
---|---|
150 | issues open, as of Excellence #32 (May 2021). |
-11 | issues closed, of the previous 150 open issues. |
+15 | new issues that survived June 2021. |
154 | issues open as of yesterday. |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
🕳 O'Neill: We've done this!
Dr Jackson: We do this every day.
O'Neill: I'm not talking about briefings in general, Daniel, I'm talking about this briefing; I'm talking about this day.
Teal'c: Col. O'Neill is correct. Events do appear to be repeating themselves.
How’d we do in our strive for operational excellence last month? Read on to find out!
Zero incidents recorded in the past month. Yay! That's only five months after November 2020, the last month without documented incidents (Incident stats).
Remember to review Preventive measures in Phabricator, which are action items filed after an incident.
In May, we unfortunately saw a repeat of the worrying pattern we saw in April, but with higher numbers. We found 54 new errors. This is the most new errors in a single month, since the Excellence monthly began three years ago in 2018. About half of these (29 of 54) remain unresolved as of writing, two weeks into the following month.
Month-over-month plots based on spreadsheet data.
Below is a snapshot of just the 54 new issues found last month, listed by their code steward.
Be mindful that the reporting of errors is not itself a negative point per-se. I think it should be celebrated when teams have good telemetry, detect their issues early, and address them within their development cycle. It might be more worrisome when teams lack telemetry or time to find such issues, or can't keep up with the pace at which issues are found.
Anti Harassment Tools | None. | |
---|---|---|
Community Tech | None. | |
Editing Team | +2, -1 | Cite (T283755); OOUI (T282176). |
Growth Team | +17, -4 | Add-Link (T281960); GrowthExperiments (T281525 T281703 T283546 T283638 T283924); Echo (T282446); Recent-changes (T282047 T282726); StructuredDiscussions (T281521 T281523 T281782 T281784 T282069 T282146 T282599 T282605). |
Language Team | +1 | Translate extension (T283828). |
Parsing Team | +1 | Parsoid (T281932). |
Reading Web | None. | |
Structured Data | None. | |
Product Infra Team | +1 | WikimediaEvents (T282580). |
Analytics | None. | |
Performance Team | None. | |
Platform Engineering | +16, -11 | MediaWiki-API (T282122); MediaWiki-General (T282173); MediaWiki-Page-derived-data (T281714 T281802 T282180 T283282), MediaWiki-Revision-backend (T282145 T282723 T282825 T283170); MediaWiki-User-management (T283167); MW Expedition (T281526 T281981 T282038 T282181 T283196). |
Search Platform | +3, -2 | CirrusSearch (T282036 T282207); GeoData (T282735). |
WMDE TechWish | +2, -1 | Revision-Slider (T282067); VisualEditor Template dialog (T283511). |
WMDE Wikidata | +3, -1 | Wikibase (T282534 T283198 T283862). |
No owner | +7, -6 | CentralAuth (T282834 T283635); Change-tagging (T283098 T283099); MapSources (T282833); MediaWiki-Page-information (T283751); Other (T283252). |
Take a look at the workboard and look for tasks that could use your help.
Summary over recent months:
Aug 2019 (0 of 14 left) | ✅ Last task resolved! | -1 |
Jan 2020 (1 of 7 left) | ⚠️ Unchanged (over one year old). | |
Mar 2020 (2 of 2 left) | ⚠️ Unchanged (over one year old). | |
Apr 2020 (4 of 14 left) | ⬇️ One task resolved. | -1 |
May 2020 (5 of 14 left) | ⚠️ Unchanged (over one year old). | |
Jun 2020 (5 of 14 left) | ⚠️ Unchanged (over one year old). | |
Jul 2020 (4 of 24 issues) | ⏸ — | |
Aug 2020 (12 of 53 issues) | ⬇️ One task resolved. | -1 |
Sep 2020 (7 of 33 issues) | ⏸ — | |
Oct 2020 (19 of 69 issues) | ⬇️ One task resolved. | -1 |
Nov 2020 (8 of 38 issues) | ⬇️ One task resolved. | -1 |
Dec 2020 (7 of 33 issues) | ⏸ — | |
Jan 2021 (3 of 50 issues) | ⏸ — | |
Feb 2021 (7 of 20 issues) | ⬇️ One task resolved. | -1 |
Mar 2021 (14 of 48 issues) | ⬇️ Four tasks resolved. | -4 |
Apr 2021 (23 of 42 issues) | ⬇️ Two tasks resolved. | -2 |
May 2021 (29 of 54 issues) | 54 new issues found, of which 29 remain open. | +54; -25 |
Tally | |
---|---|
133 | issues open, as of Excellence #31 (12 May 2021). |
-12 | issues closed, of the previous 133 open issues. |
+29 | new issues that survived May 2021. |
150 | issues open, as of today (12 June 2021). |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
Incident status, Wikitech.
Wikimedia incident stats by Krinkle, CodePen.
Production error data (spreadsheet and plots).
Phabricator report charts for Wikimedia-production-error project.
How’d we do in our strive for operational excellence last month? Read on to find out!
6 documented incidents. That's above the historical average of 3–4 per month.
Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.
In April, we saw a continuation of the healthy trend that started this January — a trend where the back of the line is moving forward at least as quickly as the front of the line. We did take a little breather in March where we almost broke even, but otherwise the trend is going well.
Last month we bade farewell to the production errors we found in July 2019. This month we cleared out the column for October 2019.
One point of concern is that we did encounter a high number of new production errors — errors that we failed to catch during development, code review, continuous integration, beta testing, or pre-deployment checks. Where we used to discover about a dozen of those a month, we found 42 during this month. As of writing, 17 of the 42 April-discovered errors have been resolved.
The "Old" column (generally tracking pre-2019 tasks) grew for the first time in six months. This increase can largely be attributed to improved telemetry of client-side errors uncovering issues in under-resourced products, such as the old Kaltura video player.
Month-over-month plots based on spreadsheet data.
Summary over recent months, per spreadsheet:
Aug 2019 (1 of 14 left) | ⚠️ Unchanged (over one year old). | |
Oct 2019 (0 of 12 left) | ✅ Last three tasks resolved! | -3 |
Jan 2020 (1 of 7 left) | ⚠️ Unchanged (over one year old). | |
Mar 2020 (2 of 2 left) | ⚠️ Unchanged (over one year old). | |
Apr 2020 (5 of 14 left) | ⚠️ Unchanged (over one year old). | |
May 2020 (5 of 14 left) | ⏸ — | |
Jun 2020 (5 of 14 left) | ⬇️ One task resolved. | -1 |
Jul 2020 (4 of 24 issues) | ⬇️ One task resolved. | -1 |
Aug 2020 (13 of 53 issues) | ⬇️ Two tasks resolved. | -2 |
Sep 2020 (7 of 33 issues) | ⏸ — | |
Oct 2020 (20 of 69 issues) | ⬇️ Two tasks resolved. | -2 |
Nov 2020 (9 of 38 issues) | ⏸ — | |
Dec 2020 (7 of 33 issues) | ⬇️ Four tasks resolved. | -4 |
Jan 2021 (3 of 50 issues) | ⬇️ One task resolved. | -1 |
Feb 2021 (8 of 20 issues) | ⬇️ One task resolved. | -1 |
Mar 2021 (18 of 48 issues) | ⬇️ Sixteen tasks resolved. | -16 |
Apr 2021 (25 of 42 issues) | 42 new issues found, of which 25 remained open. | +42; -17 |
Tally | |
---|---|
139 | issues open, as of Excellence #30 (March 2021). |
-31 | issues closed, of the previously open issues. |
+25 | new issues that survived April 2021. |
133 | issues open, as of today (12 May 2021). |
Take a look at the workboard and look for tasks that could use your help:
Thank you to everyone who helped by reporting, investigating, or resolving problems in production!
Until next time,
– Timo Tijhof
🎥 McMurphy: That nurse, man... she, uh, she ain't honest.
Doctor: Ah now, look. Miss Ratched is one of the finest nurses we've got in this institution.
McMurphy: Ha! Well […] She likes a rigged game, know what I mean?
One of the critical pieces of our infrastructure is Gerrit. It hosts most of our git repositories and is the primary code review interface. Gerrit is written in the Java programming language which runs in the Java Virtual Machine (JVM). For a couple years we have been struggling with memory issues which eventually led to an unresponsive service and unattended restarts. The symptoms were the usual ones: the application responses being slower and degrading until server side errors render the service unusable. Eventually the JVM terminates with:
java.lang.OutOfMemoryError: Java heap space
This post is my journey toward identifying the root cause and having it fixed up by the upstream developers. Given I barely knew anything about Java and much less about its ecosystem and tooling, I have learned more than a few things on the road and felt like it was worth sharing.
The first meaningful task was in June 2019 (T225166) which over several months has led us to:
All of those were sane operations that are part of any application life-cycle, some were meant to address other issues. Raising the maximum heap size (20G to 32G) definitely reduced the frequency of crashes.
Still, we had memory filing over and over. The graph below shows the memory usage from September 2019 to September 2020. The increase of maximum heap usage in October 2020 is the JVM heap being raised from 20G to 32G. Each of the "little green hills" correspond to memory filing up until we either restarted Gerrit or the JVM unattended crash:
Zooming on a week, it is clearly seen the memory was almost entirely filled until we had to restart:
This had to stop. Complaints about Gerrit being unresponsive, SRE having to respond to java.lang.OutOfMemoryError: Java heap space or us having to "proactively" restart before a week-end. They were not good practices. Back and fresh from vacations, I filed a new task T263008 in September 2020 and started to tackle the problem on my spare time. Would I be able to find my way in an ecosystem totally unknown to me?
Challenge accepted!
stuff learned
Since the JVM runs out of memory, lets look at memory allocation. The JDK provides several utilities to interact with a running JVM. Be it to attach a debugger, writing a copy of the whole heap or sending admin commands to the JVM.
jmap lets one take a full capture of the memory used by a Java virtual machine. It has to run as the same user as the application (we use Unix username gerrit2) and when having multiple JDKs installed, one has to make sure to invoke the jmap that is provided by the Java version running the targeted JVM.
Dumping the memory is then a magic:
sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap \ -dump:live,format=b,file=/var/lib/gerrit-202009170755.hprof <pid of java process here>
It takes a few minutes depending on the number of objects. The resulting .hprof file is a binary format, which can be interpreted by various tools.
jhat, a Java heap analyzer, is provided by the JDK along jmap. I ran it disabling tracking of of object allocations (-stack false) as well as references to object (|-refs false) since even with 64G of RAM and 32 core it took a few hours and eventually crashed. That is due to the insane amount of live objects. On the server I thus ran:
/usr/lib/jvm/java-8-openjdk-amd64/bin/jhat -stack false -refs false gerrit-202009170755.hprof
It spawns a web service which I can reach from my machine over ssh using some port redirection and open a web browser for it:
ssh -C -L 8080:ip6-localhost:7000 gerrit1001.wikimedia.org & xdg-open http://ip6-localhost:8080/
Instance Counts for All Classes (excluding native types)
2237744 instances of class org.eclipse.jgit.lib.ObjectId 2128766 instances of class org.eclipse.jgit.lib.ObjectIdRef$PeeledNonTag 735294 instances of class org.eclipse.jetty.util.thread.Locker 735294 instances of class org.eclipse.jetty.util.thread.Locker$Lock 735283 instances of class org.eclipse.jetty.server.session.Session ...
And an other view shows 3.5G of byte arrays.
I got pointed to https://heaphero.io/ however the file is too large to upload and it contains sensitive information (credentials, users personal information) which we can not share with a third party.
Nothing really conclusive at this point, the heap dump has been taken shortly after a restart and Gerrit was not in trouble.
Eventually I found Javamelody has a view providing the exact same information without all the trouble of figuring out jmap, jhat and ssh proper set of parameters. Just browse to the monitoring page and:
stuff learned
An idea was to take a heap dump whenever the JVM encounters an out of memory error. That can be turned on by passing the extended option HeapDumpOnOutOfMemoryError to the JVM and specifying where the dump will be written to with HeapDumpPath:
java \ -XX:+HeapDumpOnOutOfMemoryError \ -XX:HeapDumpPath=/srv/gerrit \ -jar gerrit.war ...
And surely next time it ran out of memory:
Nov 07 13:43:35 gerrit2001 java[30197]: java.lang.OutOfMemoryError: Java heap space
Nov 07 13:43:35 gerrit2001 java[30197]: Dumping heap to /srv/gerrit/java_pid30197.hprof ...
Nov 07 13:47:02 gerrit2001 java[30197]: Heap dump file created [35616147146 bytes in 206.962 secs]
Which results in a 34GB dump file which was not convenient for a full analysis. Even with 16G of heap for the analyze and a couple hours of CPU churning it was not any helpful
And at this point the JVM is still around, the java process is still there and thus systemd does not restart the service for us even though we have instructed it to do so:
[Service] ExecStart=java -jar gerrit.war Restart=always RestartSec=2s
That lead to our Gerrit replica being down for a whole weekend with no alarm whatsoever (T267517). I imagine the reason for the JVM not exiting on an OutOfMemoryError is to let one investigate the reason. Just like heap dump, the behavior can be configured via the ExitOnOutOfMemoryError extended option:
java -XX:+ExitOnOutOfMemoryError
Next time the JVM will exit causing systemd to notice the service went away and so it will happily restart it again.
stuff learned
When I filed the task, I suspected enabling git protocol version 2 (J199) on CI might have been the root cause. That eventually lead me to look at how Gerrit caches git operations. Being a Java application it does not use the regular git command but a pure Java implementation jgit, a project started by the same author as Gerrit (Shawn Pearce).
To speed up operations, jgit keeps git objects in memory with various tuning settings. You can read more about it at T263008#6601490 , but in the end it was of no use for the problem. @thcipriani would later point out that jgit cache does not overgrow past its limit:
The investigation was not a good lead, but surely it prompted us to have a better view as to what is going on in the jgit cache. But to do so we would need to expose historical metrics of the status of the cache.
Stuff learned
We always had trouble determining whether our jgit cache was properly sized and tuned it randomly with little information. Eventually I found out that Gerrit does have a wide range of metrics available which are described at https://gerrit.wikimedia.org/r/Documentation/metrics.html . I always wondered how we could access them without having to write a plugin.
The first step was to add the metrics-reporter-jmx plugin. It registers all the metrics with JMX, a Java system to manage resources. That is then exposed by JavaMelody and at least let us browse the metrics:
I long had a task to get those metrics exposed (T184086) but never had a strong enough incentive to work it. The idea was to expose those metrics to the Prometheus monitoring system which would scrape them and make them available in Grafana. They can be exposed using the metrics-reporter-prometheus plugin. There is some configuration required to create an authentication token that lets Prometheus scrape the metrics and it is then all set and collected.
In Grafana, discovering which metrics are of interest might be daunting. Surely for the jgit cache it is only a few metrics we are interested in and crafting a basic dashboard for it is simple enough. But since we now collect all those metrics, surely we should have dashboards for anything that could be of interest to us.
While browsing the Gerrit upstream repositories, I found an unadvertised repository: gerrit/gerrit-monitoring. The project aims at deploying to Kubernetes a monitoring stack for Gerrit composed of Grafana, Loki, Prometheus and Promtail. While browsing the code, I found out they already had a Grafana template which I could import to our Grafana instance with some little modifications.
During the Gerrit Virtual Summit I raised that as a potentially interesting project for the whole community and surely a few days later:
In the end we have a few useful Grafana dashboards, the ones imported from the gerrit-monitoring repo are suffixed with (upstream): https://grafana.wikimedia.org/dashboards/f/5AnaHr2Mk/gerrit
And I crafted one dedicated to jgit cache: https://grafana.wikimedia.org/d/8YPId9hGz/jgit-block-cache
Stuff learned
After a couple months, there was no good lead. The issue has been around for a while, in a programming language I don't know with assisting tooling completely alien to me. I even found jcmd to issue commands to the JVM, such as dumping a class histogram, the same view provided by JavaMelody:
$ sudo -u gerrit2 jcmd 2347 GC.class_histogram num #instances #bytes class name 3 ---------------------------------------------- 4 5: 10042773 1205132760 org.eclipse.jetty.server.session.SessionData 5 8: 10042773 883764024 org.eclipse.jetty.server.session.Session 6 11: 10042773 482053104 org.eclipse.jetty.server.session.Session$SessionInactivityTimer$1 7 13: 10042779 321368928 org.eclipse.jetty.util.thread.Locker 8 14: 10042773 321368736 org.eclipse.jetty.server.session.Session$SessionInactivityTimer 9 17: 10042779 241026696 org.eclipse.jetty.util.thread.Locker$Lock
That is quite handy when already in a terminal, saves a few click to switch to a browser, head to JavaMelody and find the link.
But it is the last week of work of the year.
Christmas is in two days.
Kids are messing up all around the home office since we are under lockdown.
Despair.
Out of rage I just stall the task shamelessly hoping for Java 11 and Gerrit 3.3 upgrades to solve this. Much like we hoped the system would be fixed by upgrading.
Wait..
1 million?
ONE MILLION ??
TEN TO THE POWER OF SIX ???
WHY IS THERE A MILLION HTTP SESSIONS HELD IN GERRIT !!!!!!?11??!!??
10042773 org.eclipse.jetty.server.session.SessionData
There. Right there. It was there since the start. In plain sight. And surely 19 hours later Gerrit had created 500k sessions for 56 MBytes of memory. It is slowly but surely leaking memory.
stuff learned
At this point it was just an intuition, albeit a strong one. I know not much about Java or Gerrit internals and went to invoke upstream developers for further assistance. But first, I had to reproduce the issue and investigate a bit more to give as many details as possible when filing a bug report.
I copied a small heap dump I took just a few minutes after Gerrit got restarted, it had a manageable size making it easier to investigate. Since I am not that familiar with the Java debugging tools, I went with what I call a clickodrome interface, a UI that lets you interact solely with mouse clicks: https://visualvm.github.io/
Once the heap dump is loaded, I could easily access objects. Notably the org.eclipse.jetty.server.session.Session objects had a property expiry=0, often an indication of no expiry at all. Expired sessions are cleared by Jetty via a HouseKeeper thread which inspects sessions and deletes expired ones. I have confirmed it does run every 600 seconds, but since sessions are set to not expire, they pile up leading to the memory leak.
On December 24th, a day before Christmas, I filed a private security issue to upstream (now public): https://bugs.chromium.org/p/gerrit/issues/detail?id=13858
After the Christmas and weekend break upstream acknowledged and I did more investigating to pinpoint the source of the issue. The sessions are created by a SessionHandler and debug logs show: dftMaxIdleSec=-1 or Default maximum idle seconds set to -1, which means that by default the sessions are created without any expiry. The Jetty debug log then gave a bit more insight:
DEBUG org.eclipse.jetty.server.session : Session xxxx is immortal && no inactivity eviction
It is immortal and is thus never picked up by the session cleaner:
DEBUG org.eclipse.jetty.server.session : org.eclipse.jetty.server.session.SessionHandler ==dftMaxIdleSec=-1 scavenging session ids [] ^^^ --- empty array
Our Gerrit instance has several plugins and the leak can potentially come from one of them. I then booted a dummy Gerrit on my machine (java -jar gerrit-3.3.war) cloned the built-in All-Projects.git repository repeatedly and observed objects with VisualVM. Jetty sessions with no expiry were created, which rules out plugins and point at Gerrit itself. Upstream developer Luca Milanesio pointed out that Gerrit creates a Jetty session which is intended for plugins. I have also narrowed down the leak to only be triggered by git operations made over HTTP. Eventually, by commenting out a single line of Gerrit code, I eliminated the memory leak and upstream pointed at a change released a few versions ago that may have been the cause.
Upstream then went on to reproduce on their side, took some measurement before and after commenting out and confirmed the leak (750 bytes for each git request made over HTTP). Given the amount of traffic we received from humans, systems or bots, it is not surprising we ended up hitting the JVM memory limit rather quickly.
Eventually the fix got released and new Gerrit versions were released. We upgraded to the new release and haven't restarted Gerrit since then. Problem solved!
Stuff learned
Thank you upstream developers Luca Milanesio and David Ostrovsky for fixing the issue!
Thank you @dancy for the added clarifications as well as typos and grammar fixes.
How’d we do in our strive for operational excellence last month? Read on to find out!
2 documented incidents. That's average for this time of year, when we usually had 1-4 incidents.
Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.
In March we made significant progress on the outstanding errors of previous months. Several of the 2020 months are finally starting to empty out. But with over 30 new tasks from March itself remaining, we did not break even, and ended up slightly higher than last month. This could be reversing two positive trends, but I hope not.
Firstly, there was a steep increase in the number of new production errors that were not resolved within the same month. This is counter the positive trend we started in November. The past four months typically saw 10-20 errors outlive their month of discovery, and this past month saw 34 of its 48 new errors remain unresolved.
Secondly, we saw the overall number of unresolved errors increase again. This January began a downward trend for the first time in thirteen months, which continued nicely through February. But, this past month we broke even and even pushed upward by one task. I hope this is just a breather and we can continue our way downward.
Month-over-month plots based on spreadsheet data.
Take a look at the workboard and look for tasks that could use your help:
Summary over recent months, per spreadsheet:
Jul 2019 (0 of 18 left) | ✅ Last two tasks resolved! | -2 |
Aug 2019 (1 of 14 left) | ⚠️ Unchanged (over one year old). | |
Oct 2019 (3 of 12 left) | ⬇️ One task resolved. | -1 |
Nov 2019 (0 of 5 left) | ✅ Last task resolved! | -1 |
Dec 2019 (0 of 9 left) | ✅ Last task resolved! | -1 |
Jan 2020 (1 of 7 left) | ⬇️ One task resolved. | -1 |
Feb 2020 (0 of 7 left) | ✅ Last task resolved! | -1 |
Mar 2020 (2 of 2 left) | ⚠️ Unchanged (over one year old). | |
Apr 2020 (5 of 14 left) | ⬇️ Four tasks resolved. | -4 |
May 2020 (5 of 14 left) | ⬇️ One task resolved. | -1 |
Jun 2020 (6 of 14 left) | ⬇️ One task resolved. | -1 |
Jul 2020 (5 of 24 issues) | ⬇️ Four tasks resolved. | -4 |
Aug 2020 (15 of 53 issues) | ⬇️ Five tasks resolved. | -5 |
Sep 2020 (7 of 33 issues) | ⬇️ One task resolved. | -1 |
Oct 2020 (22 of 69 issues) | ⬇️ Four tasks resolved. | -4 |
Nov 2020 (9 of 38 issues) | ⬇️ Two tasks resolved. | -2 |
Dec 2020 (11 of 33 issues) | ⬇️ One task resolved. | -1 |
Jan 2021 (4 of 50 issues) | ⬇️ One task resolved. | -1 |
Feb 2021 (9 of 20 issues) | ⬇️ Two tasks resolved. | -2 |
Mar 2021 (34 of 48 issues) | 34 new tasks survived and remain unresolved. | +48; -14 |
Tally | |
---|---|
138 | issues open, as of Excellence #29 (6 Mar 2021). |
-33 | issues closed, of the previous 138 open issues. |
+34 | new issues that survived March 2021. |
139 | issues open, as of today (2 Apr 2021). |
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
Incident status, Wikitech.
Wikimedia incident stats by Krinkle, CodePen.
Production Excellence: Month-over-month spreadsheet and plot.
Report charts for Wikimedia-production-error project, Phabricator.
How’d we do in our strive for operational excellence last month? Read on to find out!
3 documented incidents last month, [1] which is average for the time of year. [2]
Learn about these incidents at Incident status on Wikitech, and their Preventive measures in Phabricator.
For those with NDA-restricted access, there may be additional private incident reports 🔒 available.
In February we saw a continuation of the new downward trend that began this January, which came after twelve months of continued rising. Let's make sure this trend sticks with us as we work our way through the debt, whilst also learning to have a healthy week-to-week iteration where we monitor and follow-up on any new developments such that they don't introduce lasting regressions.
The recent tally (issues filed since we started reporting in March 2019) is down to 138 unresolved errors, from 152 last month. The old backlog (pre-2019 issues) also continued its 5-month streak and is down to 148, from 160 last month. If this progress continues we'll soon have fewer "Old" issues than "Recent" issues, and possibly by the start of 2022 we may be able to report and focus only on our rotation through recent issues as hopefully we are then balancing our work such that issues reported this month are addressed mostly in the same month or otherwise later that quarter within 2-3 months. Visually that would manifest as the colored chunks having a short life on the chart with each drawn at a sharp downwards angle – instead of dragged out where it was building up an ever-taller shortcake. I do like cake, but I prefer the kind I can eat. 🍰
Month-over-month plots based on spreadsheet data. [3] [4]
Summary over recent months:
Recent tally | |
---|---|
152 | issues open, as of Excellence #28 (16 Feb 2021). |
-25 | issues closed since, of the previous 152 open issues. |
+11 | new issues that survived Feb 2021. |
138 | issues open, as of today 5 Mar 2021. |
For the on-going month of March 2021, we've got 12 new issues so far.
Take a look at the workboard and look for tasks that could use your help!
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.
How’d we do in our strive for operational excellence last month? Read on to find out!
1 documented incident last month. That's the third month in a row that we are at or near zero major incidents – not bad! [1] [2]
Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.
This January saw a small recovery in our otherwise negative upward trend. For the first time in twelve month more reports were closed than new reports having outlived the previous month without resolution. What happened twelve months ago? In January 2020, we also saw a small recovery during the otherwise upward trend before and after it.
Perhaps it's something about the post-December holidays that temporarily improves the quality and/or reduces the quantity — of code changes. Only time will tell if this is the start of a new positive trend, or merely a post-holiday break. [3]
While our month-to-month trend might not (yet) be improving, we do see persistent improvements in our overall backlog of pre-2019 reports. This is in part because we generally don't file new reports there, so it makes sense that it doesn't go back up, but it's still good to see downward progress every month, unlike with reports from more recent months which often see no change month-to-month (see "Outstanding errors" below, for example).
This positive trend on our "Old" backlog started in October 2020 and has consistently progressed every month since then (refer to the "Old" numbers in red on the below chart, or the same column in the spreadsheet). [3][4]
Summary over recent months:
Recent tally | |
---|---|
160 | issues open, as of Excellence #27 (4 Feb 2021). |
-15 | issues closed since, of the previous 160 open issues. |
+7 | new issues that survived January 2021. |
152 | issues open, as of today (16 Feb 2021). |
January saw +50 new production errors reported in a single month, which is an unfortunate all-time high. However, we've also done remarkably well on addressing 43 of them within a month, when the potential root cause and diagnostics data were still fresh in our minds. Well done!
For the on-going month of February, there have been 16 new issues reported so far.
Take a look at the workboard and look for tasks that could use your help!
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.
How’d we do in our strive for operational excellence last month? Read on to find out!
1 documented incident in December. [1] In previous years, December typically had 4 or fewer documented incidents. [3]
Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.
Month-over-month plots based on spreadsheet data. [4] [2]
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Summary over recent months:
Recent tally | |
---|---|
149 | as of Excellence #26 (15 Dec 2020). |
-11 | closed of the 149 recent issues. |
+22 | new issues survived December 2020. |
160 | as of 27 Jan 2020. |
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incident documentation 2020, Wikitech.
[2] Open tasks, Wikimedia-prod-error, Phabricator.
[3] Wikimedia incident stats by Krinkle, CodePen.
[4] Month-over-month, Production Excellence spreadsheet.
How’d we do in our strive for operational excellence last month? Read on to find out!
Zero documented incidents in November. [1] That's the only month this year without any (publicly documented) incidents. In 2019, November was also the only such month. [3]
Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.
The overall increase in errors was relatively low this past month, similar to the November-December period last year.
What's new is that we can start to see a positive trend emerging in the backlogs where we've shrunk issue count three months in a row, from the 233 high in October, down to the 181 we have in the ol' backlog today.
Month-over-month plots based on spreadsheet data. [4]
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Summary over recent months:
Recent tally | |
---|---|
142 | as of Excellence #25 (23 Oct 2020). |
-12 | closed of the 142 recent tasks. |
+19 | survived November 2020. |
149 | as of today, 15 Dec 2020. |
The on-going month of December, has 19 unresolved tasks so far.
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incident documentation 2020, Wikitech.
[2] Open tasks, Wikimedia-prod-error, Phabricator.
[3] Wikimedia incident stats, Krinkle, CodePen.
[4] Month-over-month, Production Excellence (spreadsheet).
Recently there has been a small effort on the Release-Engineering-Team to encode some of our institutional knowledge as runbooks linked from a page in the team's wiki space.
What are runbooks, you might ask? This is how they are described on the aforementioned wiki page:
This is a list of runbooks for the Wikimedia Release Engineering Team, covering step-by-step lists of what to do when things need doing, especially when things go wrong.
So runbooks are each essentially a sequence of commands, intended to be pasted into a shell by a human. Step by step instructions that are intended to help the reader accomplish an anticipated task or resolve a previously-encountered issue.
Presumably runbooks are created when someone encounters an issue, and, recognizing that it might happen again, helpfully documents the steps that were used to resolve said issue.
This all seems pretty sensible at first glance. This type of documentation can be really valuable when you're in an unexpected situation or trying to accomplish a task that you've never attempted before and just about anyone reading this probably has some experience running shell commands pasted from some online tutorials, setup instructions for a program, etc.
Despite the obvious value runbooks can provide, I've come to harbor a fairly strong aversion to the idea of encoding what are essentially shell scripts as individual commands on a wiki page. As someone who's job involves a lot of automation, I would usually much prefer a shell script, a python program, or even a "maintenance script" over a runbook.
After a lot of contemplation, I've identified a few reasons that I don't like runbooks on wiki pages:
I do realize that mediawiki does version control. I also realize that sometimes you just can't be bothered to write and debug a robust shell script to address some rare circumstances. The cost is high and it's uncertain whether the script will be worth such an effort. In those situations a runbook might be the perfect way to contribute to collective knowledge without investing a lot of time into perfecting a script.
My favorite web comic, xkcd, has a lot few things to say about this subject:
"The General Problem" xkcd #974. "Automation" xkcd #1319. "Is It Worth the Time?" xkcd #1205.I've been pondering a solution to these issues for a long time. Mostly motivated by the pain I have experienced (and the mistakes I've made) while executing the biggest runbook of all on a regular basis.
Over the past couple of years I've come across some promising ideas which I think can help the problems I've identified with runbooks. I think that one of the most interesting is Do-nothing scripting. Dan Slimmon identifies some of the same problems that I've detailed here. He uses the term *slog* to refer to long and tedious procedures like the Wikimedia Train Deploys. The proposed solution comes in the form of a do-nothing script. You should go read that article, it's not very long. Here are a few relevant quotes:
Almost any slog can be turned into a do-nothing script. A do-nothing script is a script that encodes the instructions of a slog, encapsulating each step in a function.
...
At first glance, it might not be obvious that this script provides value. Maybe it looks like all we’ve done is make the instructions harder to read. But the value of a do-nothing script is immense:
- It’s now much less likely that you’ll lose your place and skip a step. This makes it easier to maintain focus and power through the slog.
- Each step of the procedure is now encapsulated in a function, which makes it possible to replace the text in any given step with code that performs the action automatically.
- Over time, you’ll develop a library of useful steps, which will make future automation tasks more efficient.
A do-nothing script doesn’t save your team any manual effort. It lowers the activation energy for automating tasks, which allows the team to eliminate toil over time.
I was inspired by this and I think it's a fairly clever solution to the problems identified. What if we combined the best aspects of gradual automation with the best aspects of a wiki-based runbook? Others were inspired by this as well, resulting in tools like braintree/runbook, codedown and the one I'm most interested in, rundoc.
My ideal tool would combine code and instructions in a free-form "literate programming" style. By following some simple conventions in our runbooks we can use a tool to parse and execute the embedded code blocks in a controlled manner. With a little bit of tooling we can gain many benefits:
I've found a few projects that already implement many of these ideas. Here are a few of the most relevant:
The one I'm most interested in is Rundoc. It's almost exactly the tool that I would have created. In fact, I started writing code before discovering rundoc but once I realized how closely this matched my ideal solution, I decided to abandon my effort. Instead I will add a couple of missing features to Rundoc in order to get everything that I want and hopefully I can contribute my enhancements back upstream for the benefit of others.
Demo: https://asciinema.org/a/MKyiFbsGzzizqsGgpI4Jkvxmx
Source: https://github.com/20after4/rundoc
[1]: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Runbooks "runbooks"
[2]: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys "Train deploys"
[3]: https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-the-key-to-gradual-automation/ "Do-nothing scripting: the key to gradual automation by Dan Slimmon"
[4]: https://github.com/braintree/runbook "runbook by braintree"
[5]: https://github.com/earldouglas/codedown "codedown by earldouglas"
[6]: https://github.com/eclecticiq/rundoc "rundoc by eclecticiq"
[7]: https://rich.readthedocs.io/en/latest/ "Rich python library"
How’d we do in our strive for operational excellence last month? Read on to find out!
2 documented incidents in October. [1] Historically, that's just below the median of 3 for this time of year. [3]
Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.
Month-over-month plots based on spreadsheet data. [5]
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Summary over recent months:
Recent tally | |
---|---|
110 | as of Excellence #24 (23rd Oct). |
-13 | closed of the 110 recent tasks. |
+45 | survived October 2020. |
142 | as of today, 23rd Nov. |
For the on-going month of November, there are 25 new tasks so far.
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incident documentation 2020, Wikitech
[2] Open tasks in Wikimedia-prod-error, Phabricator
[3] Wikimedia incident stats by Krinkle, CodePen
[4] Month-over-month, Production Excellence (spreadsheet)
If you're making changes to a service that is deployed to Kubernetes, it sure is annoying to have to update the helm deployment-chart values with the newest image version before you deploy. At least, that's how I felt when developing on our dockerfile-generating service, blubber.
Over the last two months we've added
And I'm excited to say that CI can now handle updating image versions for you (after your change has merged), in the form of a change to deployment-charts that you'll need to +2 in Gerrit. Here's what you need to do to get this working in your repo:
Add the following to your .pipeline/config.yaml file's publish stage:
promote: true
The above assumes the defaults, which are the same as if you had added:
promote: - chart: "${setup.projectShortName}" # The project name environments: [] # All environments version: '${.imageTag}' # The image published in this stage
You can specify any of these values, and you can promote to multiple charts, for example:
promote: - chart: "echostore" environments: ["staging", "codfw"] - chart: "sessionstore"
The above values would promote the production image published after merging to all environments for the sessionstore service, and only the staging and codfw environments for the echostore service. You can see more examples at https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Promote
If your containerized service doesn't yet have a .pipeline/config.yaml, now is a great time to migrate it! This tutorial can help you with the basics: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial#Publishing_Docker_Images
This is just one step closer to achieving continuous delivery of our containerized services! I'm looking forward to continuing to make improvements in that area.
How’d we do in our strive for operational excellence last month? Read on to find out!
5 documented incidents. [1] Historically, that's right on average for the time of year. [3]
For more about recent incidents see Incident documentation on Wikitech, or Preventive measures in Phabricator.
Month-over-month plots based on spreadsheet data. [5]
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Summary over recent months:
Recent tally | |
---|---|
106 | as of Excellence #23 (Sep 23rd). |
-13 | closed of the 106 recent tasks. |
+17 | survived September 2020. |
110 | as of today, Oct 23rd. |
Previously, we had 106 unresolved production errors from the recent months up to August. Since then, 13 of those were closed. But, the 18 errors surviving September raise our recent tally to 110.
The workboard overall (including errors from 2019 and earlier) holds 343 open tasks in total, an increase of +47 compared to the 296 total on Sept 23rd.
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation
[2] Open tasks. – phabricator.wikimedia.org/maniphest/query…
[3] Wikimedia incident stats. – codepen.io/Krinkle/full/wbYMZK
[4] Month-over-month plots. – docs.google.com/spreadsheets/d/1tRC…
How’d we do in our strive for operational excellence last month? Read on to find out!
4 documented incidents in July, and 2 documented incidents in August. [1] Historically, that's on average for this time of year. [5]
For more about recent incidents see Incident documentation on Wikitech, or Preventive measures in Phabricator.
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Summary over recent months:
Recent tally | |
---|---|
72 | open, as of Excellence #22 (Jul 23rd). |
-16 | closed, of the previous 72 recent tasks. |
+13 | opened and survived July 2020. |
+37 | opened and survived August 2020. |
106 | open, as of today (Sep 23rd). |
Previously, we had 72 open production errors over the recent months up to June. Since then, 16 of those were closed. But, the 13 and 37 errors surviving July and August raise our recent tally to 106.
The workboard overall (including tasks from 2019 and earlier) held 192 open production errors on July 23rd. As of writing, the workboard holds 296 open tasks in total. [4] This +104 increase is largely due to the merged backlog of JavaScript client errors, which were previously untracked. Note that we backdated the majority of these JS errors under “Old”, and thus are not amongst the elevated numbers of July and August.
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query…
[5] Wikimedia incident stats. – https://codepen.io/Krinkle/full/wbYMZK
How’d we do in our strive for operational excellence last month? Read on to find out!
For more about recent incidents see Incident documentation, on Wikitech or Preventive measures in Phabricator.
Breakdown of new errors reported in June that are still open today:
Summary over recent months:
At the end of May the number of open production errors over recent months was 68. Of those, 10 got closed, but with 14 new tasks from June still open, the total has grown further to 72.
The workboard had 192 open tasks last month, which saw another increase, to now 203 open tasks (this includes tasks from 2019 and earlier).
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
⛰ ATC: “Do you want to report a UFO?” Pilot: “Negative. We don't want to report.”
ATC: “Do you wish to file a report of any kind to us?” Pilot: “I wouldn't know what kind of report to file.”
ATC: “Me neither…”
Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/VTpmvaJLYVL1/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/qn5yeURqyl3D/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/Fw3RdXt1Sdxp/#R
In 2015 I noticed git fetches from our most active repositories to be unreasonably slow, sometimes up to a minute which hindered fast development and collaboration. You can read some of the debugging details I have conducted at the time on T103990. Gerrit upstream was aware of the issue and a workaround was presented though we never went to implement it.
When fetching source code from a git repository, the client and server conduct a negotiation to discover which objects have to be sent. The server sends an advertisement that lists every single reference it knows about. For a very active repository in Gerrit it means sending references for each patchset and each change ever made to the repository, or almost 200,000 references for mediawiki/core. That is a noticeable amount of data resulting in a slow fetch, especially on a slow internet connection.
Gerrit originated at Google and has full time maintainers. In 2017 a team at Google went to tackle the problem and proposed a new protocol to address the issue, and they closely worked with git maintainers while doing so. The new protocol makes git smarter during the advertisement phase, notably to filter out references the client is not interested in. You can read Google introduction post at https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html
Since June 28th 2020, our Gerrit has been upgraded and now supports git protocol version 2. But to benefit from faster fetches, your client also needs to know about the newer protocol and have it explicitly enabled. For git, you will want version 2.18 or later. Enable the new protocol by setting git configuration protocol.version to 2.
It can be done either on an on demand basis:
git -c protocol.version=2 fetch
Or enabled in your user configuration file:
[protocol] version = 2
On my internet connection, fetching for mediawiki/core.git went from ~15 seconds to just 3 seconds. A noticeable difference in my day to day activity.
If you encounter any issue with the new protocol, you can file a task in our Phabricator and tag it with git-protocol-v2.
How’d we do in our strive for operational excellence last month? Read on to find out!
For more about recent incidents see Incident documentation on Wikitech, or Preventive measures in Phabricator.
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Breakdown of recent months:
At the end of April the total of open production errors over recent months was 61. Of those, 7 got closed, but with 14 new tasks from May still open, the total has grown to 68.
The workboard had 178 open tasks in April, which saw a steep increase to now 192 open tasks (this includes June 2020 so far, and pre-2019 tasks).
Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/7Z4Us2BS02Uo/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/FoIFMu5UO8pw/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/Fw3RdXt1Sdxp/#R
Earlier today, the 600,000th commit was pushed to Wikimedia's Gerrit server. We thought we'd take this moment to reflect on the developer services we offer and our community of developers, be they Wikimedia staff, third party workers, or volunteers.
At Wikimedia, we currently use a self-hosted installation of Gerrit to provide code review workflow management, and code hosting and browsing. We adopted this in 2011–12, replacing Apache Subversion.
Within Gerrit, we host several thousand repositories of code (2,441 as of today). This includes MediaWiki itself, plus all the many hundreds of extensions and skins people have created for use with MediaWiki. Approximately 90% of the MediaWiki extensions we host are not used by Wikimedia, only by third parties. We also host key Wikimedia server configuration repositories like puppet or site config, build artefacts like vetted docker images for production services or local .deb build repos for software we use like etherpad-lite, ancillary software like our special database exporting orchestration tool for dumps.wikimedia.org, and dozens of other uses.
Gerrit is not just (or even primarily) a code hosting service, but a code review workflow tool. Per the Wikimedia code review policy, all MediaWiki code heading to production should go through separate development and code review for security, performance, quality, and community reasons. Reviewers are required to use their "good judgement and careful action", which is a heavy burden, because "[m]erging a change to the MediaWiki core or an extension deployed by Wikimedia is a big deal". Gerrit helps them do this, providing clear views of what is changing, supporting itemised, character-level, file-level, or commit-level feedback and revision, and allowing series of complex changes to be chained together across multiple repositories, and ensuring that forthcoming and merged changes are visible to product owners, development teams, and other interested parties.
Across all of repositories, we average over 200 human commits a day, though activity levels vary widely. Some repositories have dozens of patches a week (MediaWiki itself gets almost 20 patches a day; puppet gets nearly 30), whereas others get a patch every few years. There are over 8,000 accounts registered with Gerrit, although activity is not distributed uniformly throughout that cohort.
To focus engineer time where it's needed, a fair amount of low-risk development work is automated. This happens in both creating patches and also, in some cases, merging them.
For example, for many years we have partnered with TranslateWiki.net's volunteer community to translate and maintain MediaWiki interfaces in hundreds of languages. Exports of translators' updates are pushed and merged automatically by one of the TWN team each day, helping our users keep a fresh, usable system whatever their preferred language.
Another key area is LibraryUpgrader, a custom tool to automatically upgrade the libraries we use for continuous integration across hundreds of repositories, allowing us to make improvements and increase standards without a single central breaking change. Indeed, the 600,000th commit was one of these automatic commits, upgrading the version of the mediawiki-codesniffer tool in the GroupsSidebar extension to the latest version, ensuring it is written following the latest Wikimedia coding conventions for PHP.
Right now, we're working on upgrading our installation of Gerrit, moving from our old version based on the 2.x branch through 2.16 to 3.1, which will mean a new user interface and other user-facing changes, as well as improvements behind the scenes. More on those changes will be coming in later posts.
Header image: A vehicle used to transport miners to and from the mine face by 'undergrounddarkride', used under CC-BY-2.0.
How are we doing on that strive for operational excellence during these unprecedented times?
For more about recent incidents and pending actionables see Wikitech and Phabricator.
Take a look at the workboard and look for tasks that could use your help.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Breakdown of recent months:
At the end of February the total of open reports over recent months was 58. Of those, 12 got closed, but with 15 new reports from March/April still open, the total is now up at 61 open reports.
The workboard overall (which includes pre-2019 tasks) has 178 tasks open. This is actually down by a bit for the first time since October with December at 196, January at 198, and February at 199, and now April at 178. This was largely due to the Release Engineering and Core Platform teams closing out forgotten reports that have since been resolved or otherwise obsoleted.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/HjopcKClxTfw/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/ts62HKYPBxod/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/Fw3RdXt1Sdxp/#R
How’d we do in our strive for operational excellence last month? Read on to find out!
With a median of 4–5 documented incidents per month (over the last three years), there were a fairly large number of them this past month.
To read more about these incidents and pending actionables; check Incident documentation § 2020, or Explore Wikimedia incident stats (interactive).
Our error monitor (Logstash) received numerous reports about an “Undefined offset” error from the OATHAuth extension. This extension powers the Two-factor auth (2FA) login interface on Wikipedia.
@ItSpiderman and @Reedy investigated the problem. The error message:
PHP Notice: Undefined offset: 8 at /srv/mediawiki/extensions/OATHAuth/src/Key/TOTPKey.php:188
This error means that the code was accessing item number 8 from a list (an array), but the item does not exist. Normally, when a “2FA scratch token” is used, we remove it from a list, and save the remaining list for next time.
The code used the count() function to compute the length of the list, and used a for-loop to iterate through the list. When the code found the user’s token, it used the unset( $list[$num] ) operation to remove token $num from the list, and then save $list for next time.
The problem with removing a list item in this way is that it leaves a “gap”. Imagine a list with 4 items, like [ 1: …, 2: …, 3: … , 4: … ]. If we unset item 2, then the remaining list will be [ 1: …, 3: …, 4: … ]. The next time we check this list, the length of the list is now 3 (so far so good!), but the for-loop will access the items as 1-2-3. The code would not know that 3 comes after 1, causing an error because item 2 does not exist. And, the code would not even look at item 4!
When a user used their first ever scratch token, everything worked fine. But from their second token onwards, the tokens could be rejected as “wrong” because the code was not able to find them.
To avoid this bug, we changed the code to use array_splice( $list, $num, 1 ) instead of unset( $list[$num] ). The important thing about array_splice is that it renumbers the items in the list, leaving no gaps.
– T244308 / https://gerrit.wikimedia.org/r/570253
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Breakdown of recent months:
Last month’s total over recent months was 57 open reports. Of those, 6 got closed, but with 7 new reports from February still open, the total is now up at 58 open reports.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.
Together, we’re getting there!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…
How’d we do in our strive for operational excellence last month? Read on to find out!
To read more about these incidents and pending actionables; check Incident documentation § 2020, or Explore Wikimedia incident stats (interactive).
Wikimedia encountered several Zend engine bugs that could corrupt a PHP program at run-time, during the upgrade from HHVM to PHP 7.2. (Some of these bugs are still being worked on.) One of the bugs we fixed last month was particularly mysterious. Investigation led by @hashar and @tstarling.
MediaWiki would create an array in PHP and add a key-value pair to it. We could iterate this array, and see that our key was there. Moments later, if we tried to retrieve the key from that same array, sometimes the key would no longer exist!
After many ad-hoc debug logs, core dumps, and GDB sessions, the problem was tracked down to the string interning system of Zend PHP. String interning is a memory reduction technique. It means we only store one copy of a character sequence in RAM, even if many parts of the code use the same character sequence. For example, the words “user” and “edit” are frequently used in the MediaWiki codebase. One of those sequences is the empty string (“”), which is also used a lot in our code. This is the string we found disappearing most often from our PHP arrays. This bug affected several components, including Wikibase, the wikimedia/rdbms library, and ResourceLoader.
Tim used a hardware watchpoint in GDB, and traced the root cause to the Memcached client for PHP. The php-memcached client would “free” a string directly from the internal memory manager after doing some work. It did this even for “interned” strings that other parts of the program may still be depending on.
@jijiki and @Joe backported the upstream fix to our php-memcached package and deployed it to production. Thanks! — T232613
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Breakdown of recent months (past two weeks not included):
There are a total of 57 reports filed in recent months that remain open. This is down from 62 last month.
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…
How’d we do in our strive for operational excellence in November and December? Read on to find out!
November had zero reported incidents. Prior to this, the last month with no documented incidents was December 2017. To read about past incidents and unresolved actionables; check Incident documentation § 2019.
Explore Wikimedia incident graphs (interactive)
@dcausse investigated a flood of exceptions from SpecialSearch, which reported “Cannot consume query at offset 0 (need to go to 7296)”. This exception served as a safeguard in the parser for search queries. The code path was not meant to be reached. The root cause was narrowed down to the following regex:
/\G(?<negated>[-!](?=[\w]))?(?<word>(?:\\\\.|[!-](?!")|[^"!\pZ\pC-])+)/u
This regex looks complex, but it can actually be simplified to:
/(?:ab|c)+/
This regex still triggers the problematic behavior in PHP. It fails with a PREG_JIT_STACKLIMIT_ERROR, when given a long string. Below is a reduced test case:
$ret = preg_match( '/(?:ab|c)+/', str_repeat( 'c', 8192 ) ); if ( $ret === false ) { print( "failed with: " . preg_last_error() ); }
In the end, the fix we applied was to split the regex into two separate ones, and remove the non-capturing group with a quantifier, and loop through at the PHP level (Gerrit change 546209).
The lesson learned here is that the code did not properly check the return value of preg_match, this is even more important as the size allowed for the JIT stack changes between PHP versions.
For future reference, @dcausse concluded: The regex could be optimized to support more chars (~3 times more) by using atomic groups, like so /(?>ab|c)+/. — T236419
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…
How’d we do in our strive for operational excellence last month? Read on to find out!
There were three recorded incidents last month, which is slightly below our median of the past two years (Explore this data). To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.
MediaWiki uses the PSR-3 compliant Monolog library to send messages to Logstash (via rsyslog and Kafka). These messages are used to automatically detect (by quantity) when the production cluster is in an unstable state. For example, due to an increase in application errors when deploying code, or if a backend system is failing. Two distinct issues hampered the storing of these messages this month, and both affected us simultaneously.
Elasticsearch mapping limit
The Elasticsearch storage behind Logstash optimises responses to Logstash queries with an index. This index has an upper limit to how many distinct fields (or columns) it can have. When reached, messages with fields not yet in the index are discarded. Our Logstash indexes are sharded by date and source (one for “mediawiki”, one for “syslog”, and one for everthing else).
This meant that error messages were only stored if they only contained fields used before, by other errors stored that day. Which in turn would only succeed if that day’s columns weren’t already fully taken. A seemingly random subset of error messages was then rejected for a full day. Each day it got a new chance at reserving its columns, so long as the specific kind of error is triggered early enough.
To unblock deployment automation and monitoring of MediaWiki, an interim solution was devised. The subset of messages from “mediawiki” that deal with application errors now have their own index shard. These error reports follow a consistent structure, and contain no free-form context fields. As such, this index (hopefully) can’t reach its mapping limit or suffer message loss.
The general index mapping limit was also raised from 1000 to 2000. For now that means we’re not dropping any non-critical/debug messages. More information about the incident at T234564. The general issue with accommodating debug messages in Logstash long-term, is tracked at T180051. Thanks @matmarex, @hashar, and @herron.
Crash handling
Wikimedia’s PHP configuration has a “crash handler” that kicks in if everything else fails. For example, when the memory limit or execution timeout is reached, or if some crucial part of MediaWiki fails very early on. In that case our crash handler renders a Wikimedia-branded system error page (separate from MediaWiki and its skins). It also increments a counter metric for monitoring purposes, and sends a detailed report to Logstash. In migrating the crash handler from HHVM to PHP7, one part of the puzzle was forgotten. Namely the Logstash configuration that forwards these reports from php-fpm’s syslog channel to the one for mediawiki.
As such, our deployment automation and several Logstash dashboards were blind to a subset of potential fatal errors for a few days. Regressions during that week were instead found by manually digging through the raw feed of the php-fpm channel instead. As a temporary measure, Scap was updated to consider the php-fpm’s channel as well in its automation that decides whether a deployment is “green”.
We’ve created new Logstash configurations that forward PHP7 crashes in a similar way as we did for HHVM in the past. Bookmarked MW dashboards/queries you have for Logstash now provide a complete picture once again. Thanks @jijiki and @colewhite! – T234283
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
Thank you, to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…
This past week marks the release of a little tool that I've been working on for a while. In fact, it's something I've wanted to build for more than a year. But before I tell you about the solution, I need to describe the problem that I set out to solve.
Production errors are tracked with the tag Wikimedia-production-error. As a member of the Release-Engineering-Team, I've spent a significant amount of time copying details from Kibana log entries and pasting into the Production Error Report form here in Phabricator. There are several of us who do this on a regular basis, including most of my team and several others as well. I don't know precisely how much time is spent on error reporting but at least a handful of people are going through this process several times each week.
This is what lead to the idea for rPHAT Phatality: I recognized immediately that if I could streamline the process and save even a few seconds each time, the aggregate time savings could really add up quickly.
So after considering a few ways in which the process could be automated or otherwise streamlined, I finally focused on what seemed like the most practical: build a Kibana plugin that will format the log details and send them over to Phabricator, eliminating the tedious series of copy/paste operations.
Phatality has a couple of other tricks up it's sleeve but the essence of it is just that: capture all of the pertinent details from a single log message in Kibana and send it to Phabricator all at once with the click of a button in Kibana.
Clicking the [Submit] button, as seen in the above screenshot, will take you to the phabricator Production Error form with all of the details pre-filled and ready to submit:
Now that Phatality is deployed to production and a few of us have had a chance to use it to submit error reports, I can say that I definitely think it was a worthwhile effort. The Kibana plugin wasn't terribly difficult to write, and thanks to @fgiunchedi's help, the deployment went fairly smoothly. Phatality definitely streamlines the reporting process, saving several clicks each time and ensuring accuracy in the details that get sent to Phabricator. In a future version of the tool I plan to add more features such as duplicate detection to help avoid duplicate submissions.
If you use Wikimedia's Kibana to report errors in Phabricator then I encourage you to look for the Phatality tab in the log details section and save some clicks!
What other repetitive tasks are ripe for automation? I'd love to hear suggestions and ideas in the comments.
In Changes and improvements to PHPUnit testing in MediaWiki, I wrote about efforts to help speed up PHPUnit code coverage generation for local development.[0] While this improves code coverage generation time for local development, it could be better.
As the Manual:PHP unit testing/Code coverage page advises, adjusting the whitelist in the PHPUnit XML configuration can speed things up dramatically. The problem is, adjusting that file is a manual process and a little cumbersome, so I usually didn't do it. And then because code coverage generation reports were slow locally[1], I ended up not running them while working on a patch. True, you will get feedback on code coverage metrics from CI, but it would be nicer if you could quickly get this information in your local environment first.
This was the motivation to add a Composer script in MediaWiki core that will help you adjust the PHPUnit coverage whitelist quickly while you're working on a patch for an extension or skin.
You can run it with composer phpunit:coverage-edit -- extensions/$EXT_NAME, e.g. composer phpunit:coverage-edit -- extensions/GrowthExperiments.
The ComposerPhpunitXmlCoverageEdit.php script copies the phpunit.xml.dist file to phpunit.xml (not version controlled), and modifies the whitelist to add directories for that extension/skin. vendor/bin/phpunit then reads phpunit.xml instead of the phpunit.xml.dist file. Tip: Make sure "Edit configurations" in your IDE (PhpStorm in my case) is using vendor/bin/phpunit and phpunit.xml, not phpunit.xml.dist, when executing the tests.
When you want to reset your configuration, you can rm phpunit.xml and vendor/bin/phpunit will read from phpunit.xml.dist again.
Further improvements to the script could include:
Thanks to @Mainframe98 and @Krinkle for review of the patch and to @AnneT for reviewing this post. Happy hacking!
[0] [[ https://gerrit.wikimedia.org/r/c/mediawiki/core/+/520459 | One patch changed <whitelist addUncoveredFilesFromWhitelist="true"> to false ]] to help speed up PHPUnit code coverage generation, the [[ https://gerrit.wikimedia.org/r/c/integration/config/+/521190 | second patch flipped the flag back to true in CI ]] for generating complete coverage reports.
[1] For GrowthExperiments, generating coverage reports without a customized whitelist takes ~17 seconds. With a custom whitelist, it takes ~1 second. While 17 seconds is arguably not a lot of time, the near-instant feedback with a customized whitelist means one is less likely to face interruptions to their flow or concentration while working on a patch.
How’d we do in our strive for operational excellence last month? Read on to find out!
There were five recorded incidents last month, equal to the median for this and last year. – Explore this data.
To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.
This month saw three major upgrades across the MediaWiki stack.
The client-side switch to toggle between HHVM and PHP 7.2 saw its final push — from the 50% it was at previously, to 100% of page view sessions on 17 September. The switch further solidified on 24 September when static MediaWiki traffic followed suit (e.g. API and ResourceLoader). Thanks @jijiki and @Joe for the final push. – More details at T219150 and T176370.
The RFC to discontinue basic compatibility for the IE6 and IE7 browsers entered Last Call on 18 September. It was approved on 2 Oct (T232563). Thanks to @Volker_E for leading the sprint to optimise our CSS payloads by removing now-redundant style rules for IE6-7 compat. – More at T234582.
With HHVM behind us, our Composer configuration no longer needs to be compatible with a “PHP 5.6 like” run-time. Support for the real PHP 5.6 was dropped over 2 years ago, and the HHVM engine supports PHP 7 features. But, the HHVM engine identifies as “PHP 5.6.999-hhvm”. As such, Composer refused to install PHPUnit 6 (which requires PHP 7.0+). Instead, Composer could only install PHPUnit 4 under HHVM (as for PHP 5.6). Our unit tests have had to remain compatible with both PHPUnit 4 and PHPUnit 6 simultaneously.
Now that we’re fully on PHP 7.2+, our Composer configuration effectively drops PHP 5.6, 7.0 and 7.1 all at once. This means that we no longer run PHPUnit tests on multiple PHPUnit versions (PHPUnit 6 only). The upgrade to PHPUnit 8 (PHP 7.2+) is also unlocked! Thanks @MaxSem, @Jdforrester-WMF and @Daimona for leading this transition. – T192167
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
Thank you, to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. –
wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…
[2] Tasks created. –
phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. –
phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. –
phabricator.wikimedia.org/maniphest/query…
How’d we do in our strive for operational excellence in August? Read on to find out!
The number of recorded incidents in August, at three, was below average for the year so far. However, in previous years (2017-2018), August also has 2-3 incidents. – Explore this data.
To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.
Reports from Logstash indicated that some user requests were aborted by a fatal PHP error from the MessageCache class. The user would be shown a generic system error page. The affected requests didn’t seem to have anything obvious in common, however. This made it difficult to diagnose.
MessageCache is responsible for fetching interface messages, such as the localised word “Edit” on the edit button. It calls a “load()” function and then tries to access the loaded information. However, sometimes the load function would claimed to have finished its work, but yet the information was not there.
When the load function initialises all the messages for a particular language, it keeps track of this, so as to not do the same a second time. From any one angle I could look at this code, no obvious mistakes stood out. A deeper investigation revealed that two unrelated changes (more than a year apart), each broke 1 assumption that was safe to break. But, put together, and this seemingly impossible problem emerges. Check out T208897#5373846 for the details of the investigation.
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
Thank you to @aaron, @Catrope, @Daimona, @dbarratt, @Jdforrester-WMF, @kostajh, @pmiazga, @Tarrow, @zeljkofilipin, and everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…
Building off the work done at the Prague Hackathon (T216260), we're happy to announce some significant changes and improvements to the PHP testing tools included with MediaWiki.
You can now download MediaWiki, run composer install, and then composer phpunit:unit to run core's unit test suite (T89432).
You can now use the plain PHPUnit entrypoint at vendor/bin/phpunit instead of the MediaWiki maintenance class which wraps PHPUnit (tests/phpunit/phpunit.php).
Both the unit tests and integration tests can be executed with the standard phpunit entrypoint (vendor/bin/phpunit) or if you prefer, with the composer scripts defined in composer.json (e.g. composer phpunit:unit). We accomplished this by writing a new bootstrap.php file (the old one which the maintenance class uses was moved to tests/phpunit/bootstrap.maintenance.php) which executes the minimal amount of code necessary to make core, extension and skin classes discoverable by test classes.
Integration tests should be placed in tests/phpunit/integration while unit tests go in tests/phpunit/unit, these are discoverable by the new test suites (T87781). It sounds obvious now to write this, but a nice side effect is that by organizing tests into these directories it's immediately clear to authors and reviewers what type of test one is looking at.
A new base test case, MediaWikiUnitTestCase has been introduced with a minimal amount of boilerplate (@covers validator, ensuring the globals are disabled, and that the tests are in the proper directory, the default PHPUnit 4 and 6 compatibility layer). The MediaWikiTestCase has been renamed to MediaWikiIntegrationTestCase for clarity.
A significant portion of core's unit tests have been ported to use MediaWikiUnitTestCase, approximately 50% of the total. We have also worked on porting extension tests to the unit/integration directories. @Ladsgroup wrote a helpful script to assist with automating the identification and moving of unit tests, see P8702. Migrating tests from MediaWikiIntegrationTestCase to MediaWikiUnitTestCase makes them faster.
Note that unit tests in CI are still run with the PHPUnit maintenance class (tests/phpunit/phpunit.php), so when reviewing unit test patches please execute them locally with vendor/bin/phpunit /path/to/tests/phpunit/unit or composer phpunit -- /path/to/tests/phpunit/unit.
The PHPUnit configuration file now resides at the root of the repository, and is called phpunit.xml.dist. (As an aside, you can copy this to phpunit.xml and make local changes, as that file is git-ignored, although you should not need to do that.) We made a modification (T192078) to the PHPUnit configuration inside MediaWiki to speed up code coverage generation. This makes it feasible to have a split window in your IDE (e.g. PhpStorm), run "Debug with coverage", and see the results in your editor fairly quickly after running the tests.
Things we are working on:
Help is wanted in all areas of the above! We can be found in the #wikimedia-codehealth channel and via the phab issues linked in this post.
The above work has been done and supported by Máté (@TK-999), Amir (@Ladsgroup), Kosta (@kostajh), James (@Jdforrester-WMF), Timo (@Krinkle), Leszek (@WMDE-leszek), Kunal (@Legoktm), Daniel (@daniel), Michael Große (@Michael), Adam (@awight), Antoine (@hashar), JR (@Jrbranaa) and Greg (@greg) along with several others. Thank you!
thanks for reading, and happy testing!
Amir, Kosta, & Máté
How’re we doing on that strive for operational excellence? Read this first anniversary edition to find out!
The number of recorded incidents over the past month, at five, is equal to the median number of incidents per month (2016-2019). – Explore this data.
To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.
Exactly one year ago this periodical started to provide regular insights on production stability. The idea was to shorten the feedback cycle between deployment of code that leads to fatal errors and the discovery of those errors. This allows more people to find reports earlier, which (hopefully) prevents them from sneaking into a growing pile of “normal” errors.
576 reports were created between 15 July 2018 and 31 July 2019 (tagged Wikimedia-prod-error).
425 reports got closed over that same time period.
Read the first issue in story format, or the initial e-mail.
Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone who already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
Thank you to @aaron, @Anomie, @ArielGlenn, @Catrope, @cscott, @Daimona, @dbarratt, @dcausse, @EBernhardson, @Jdforrester-WMF, @jeena, @MarcoAurelio, @SBisson, @Tchanders, @Tgr, @tstarling, @Urbanecm; and everyone else who helped by finding, investigating, or resolving error reports in Wikimedia production. Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…
How’d we do in our strive for operational excellence last month? Read on to find out!
The number of incidents in June was high compared to previous years. At 11 incidents, this is higher than this year’s median (5), the 2018 median (4), and the 2017 median (5). It is also higher than any month of June in the last 4 years. – More data at CodePen.
To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.
There are currently 204 open Wikimedia-prod-error reports (up from 186 in April, and 201 in May). [4]
Hereby a shoutout to the Wikidata and Core Platform teams, at WMDE and WMF respectively. They both recently established a rotating subteam that focuses on incidental work. Such as maintenance, and other work that might otherwise hinder feature development.
I expect this to improve efficiency by avoiding context switches between feature and incidental work. The rotational aspect should distribute the work more evenly among team members (avoiding burnout). And, it may increase exposure to other teams, and lesser-known areas of our code; which provide opportunities for personal growth and to retain institutional knowledge.
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error
Or help someone who already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
By steward and software component, the unresolved issues that survived June:
Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @Anomie, @brion, @Catrope, @cscott, @daniel, @dcausse, @DerFussi, @Ebe123, @fgiunchedi, @Jdforrester-WMF, @kostajh, @Legoktm, @Lucas_Werkmeister_WMDE, @matmarex, @matthiasmullie, @Michael, @Nikerabbit, @SBisson, @Smalyshev, @Tchanders, @Tgr, @Tpt, @Umherirrender, and @Urbanecm.
Thanks!
Until next time,
– Timo Tijhof
Footnotes:
How’d we do in our strive for operational excellence last month? Read on to find out!
The number of incidents in May of this year was comparable to previous years (6 in May 2019, 2 in May 2018, 5 in May 2017), and previous months (6 in May, 8 in April, 8 in March) – comparisons at CodePen.
To read more about these incidents, their investigations, and pending actionables; check wikitech.wikimedia.org/wiki/Incident_documentation#2019.
As of writing, there are 201 open Wikimedia-prod-error tasks (up from 186 last month). [4]
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the month in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
By steward and software component, unresolved issues from April and May:
Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production.
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. –
wikitech.wikimedia.org/wiki/Special:PrefixIndex…
[2] Tasks created. –
phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. –
phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. –
phabricator.wikimedia.org/maniphest/query…
After many months of discussion, work and consultation across teams and departments[0], and with much gratitude and appreciation to the hard work and patience of @thcipriani and @hashar, the Code-Health-Metrics group is pleased to announce the introduction of the code health pipeline. The pipeline is currently in beta and enabled for GrowthExperiments, soon to be followed by Notifications, PageTriage, and StructuredDiscussions. (If you'd like to enable the pipeline for an extension you maintain or contribute to, please reach out to us via the comments on this post.)
The Code-Health-Metrics group has been working to define a set of common code health metrics. Our current understanding of code health factors are: simplicity, readability, testability, buildability. Beyond analyzing a given patch set for these factors, we also want to have a historical view of code as it evolves over time. We want to be able to see which areas of code lack test coverage, where refactoring a class due to excessive complexity might be called for, and where possible bugs exist.
After talking through some options, we settled on a proof-of-concept to integrate Wikimedia's gerrit patch sets with SonarQube as the hub for analyzing and displaying metrics on our code[1]. SonarQube is a Java project that analyzes code according to a set of a rules. SonarQube has a concept of a "Quality Gate", which can be defined organization wide or overridden on a per-project basis. The default Quality Gate says that of code added in a patch set, over 80% of it must be covered by tests, less than 3% of it may contain duplicated lines of code, and the maintainability, reliability and security ratings should be graded as an A. If code passes these criteria then we say it has passed the quality gate, otherwise it has failed.
Here's an example of a patch that failed the quality gate:
If you click through to the report, you can see that it failed because the patch introduced an unused local variable (code smell), so the maintainability score for that patch was graded as a C.
For projects that have been opted in to the code health pipeline, submitting a new patch or commenting with "check codehealth" will result in the following actions:
If you click the link, you'll be able to view the analysis in SonarQube. From there you can also view the code of a project and see which lines are covered by tests, which lines have issues, etc.
Also, when a patch merges, the mwext-codehealth-master-non-voting job executes which will update the default view of a project in SonarQube with the latest code coverage and code metrics.[3]
We would like to enable the code health pipeline for more projects, and eventually we would like to use it for core. One challenge with core is that it currently takes ~2 hours to generate the PHPUnit coverage report. We also want to gather feedback from the developer community on false positives and unhelpful rules. We have tried to start with a minimal set of rules that we think everyone could agree with but are happy to adjust based on developer feedback[2]. Our current list of rules can be seen in this quality profile.
If you'll be at the Hackathon, we will be presenting on the code health pipeline and SonarQube at the Code health and quality metrics in Wikimedia continuous integration session on Friday at 3 PM. We look forward to your feedback!
Kosta, for the Code-Health-Metrics group
[0] More about the Code Health Metrics group: https://www.mediawiki.org/wiki/Code_Health_Group/projects/Code_Health_Metrics, currently comprised of Guillaume Lederrey (R), Jean-Rene Branaa (A), Kosta Harlan (R), Kunal Mehta (C), Piotr Miazga (C), Željko Filipin (R). Thank you also to @daniel for feedback and review of rules in SonarQube.
[1] While SonarQube is an open source project, we currently use the hosted version at sonarcloud.io. We plan to eventually migrate to our own self-hosted SonarQube instance, so we have full ownership of tools and data.
[2] You can add a topic here https://www.mediawiki.org/wiki/Talk:Code_Health_Group/projects/Code_Health_Metrics
[3] You might have also noticed a post-merge job over the last few months, wmf-sonar-scanner-change. This job did not incorporate code coverage, but it did analyze most of our extensions and MediaWiki core, and as a result there is a set of project data and issues that might be of interest to you. The Issues view in SonarQube might be interesting, for example, as a starting point for new developers who want to contribute to a project and want to make some small fixes.
Writing blog is neither my job nor something that I enjoy, I am thus late in the Quibble updates. The last one Blog Post: Quibble in summer has been written in September 2018 and I forgot to publish it until now. You might want to read it first to get a glance about some nice changes that got implemented last summer.
I guess personal changes that happened in October and the traditional norther hemisphere winter hibernation kind of explain the delay (see note [ 1 ]). Now that spring is finally there ({{NPOV}}), it is time for another update.
Quibble went from 0.0.26 to 0.0.30 which I have cut just before starting this post. I wanted to highlight a few changes from an overall small change log:
The first inception of Quibble did not have much thoughts put into it with regard to speed. The main goal at the time was simply to gather all the complicated logic from CI shell scripts, Jenkins jobs shell snippets, python or javascript scripts all in one single command. That in turn made it easier to reproduce a build but with a serious limitation: commands are just run serially which is far from being optimum.
Quibble would now run the lint commands in parallel for both extensions/skins and mediawiki/core. Internally, it forks run composer test and npm test in parallel, that slightly speed up the time to get linting commands to complete.
Another annoyance is when testing multiple repositories together, preparing the git repositories could takes several minutes. An example is for an extension depending on several other extensions or the gated wmf-quibble-* jobs which run tests for several Wikimedia deployed extensions. Even when using a local cache of git repositories (--git-cache) the serially run git commands take a while. Quibble 0.0.30 learned --git-parallel to run the git commands in parallel. An example speed up using git cache, several repositories and a DSL connection:
git-parallel | Duration |
---|---|
16 | 30 seconds |
1 | 50 seconds |
The option defaults to 1 which retain the exact same behavior / code path as before. I invite you to try --git-parallel=8 for example and draw your own conclusion. Wikimedia CI will be updated once Quibble 0.0.30 is deployed.
Parallelism added by myself, @hashar, and got partly tracked in T211701.
Some part of the documentation referred to a Wikimedia CI containers that were no more suitable for running tests due to refactoring. The documentation as thus been updated to use the proper containers: docker-registry.wikimedia.org/releng/quibble-stretch-php72 or docker-registry.wikimedia.org/releng/quibble-stretch-hhvm. -- @hashar
In August, Wikidata developers used Quibble to reproduce a test failure and they did the extra step to capture their session and document how to reproduce it. Thank you @Pablo-WMDE for leading this and @Tarrow, @Addshore, @Michael, @Ladsgroup for the reviews - T200991.
You can read the documentation online at:
Note: as of this writing, the CI git servers are NOT publicly reachable (git://contint1001.wikimedia.org and git://contint2001.wikimedia.org).
Some extensions or skins might have submodules, however we never caught errors when they failed to process and kept continuing. That later causes tests to fail in non obvious way and caused several people to loose time recently. T198980
The reason is Quibble simply borrowed a legacy shell script to handle submodules and that script has been broken since its first introduction in 2014. It relied on the find command which still exit 0 even with -exec /bin/false. The reason is that although /bin/false exit code is 1, that simply causes find to consider the -exec predicate to be false, find thus abort processing further predicates but that is not an error.
The logic has been ported to pure python and now properly abort when git submodule fails. That also drop the requirement to have the find command available which might help on Windows. -- @hashar
The configuration injected by Quibble in LocalSettings.php is now a single file when it previously was made of several small PHP files glued together by shelling out to php. The inline comments have been improved. -- @Krinkle
MediaWiki installer uses a slightly stronger password (testwikijenkinspass) to accommodate for a security hardening in MediaWiki core itself. -- @Reedy T204569
The Gerrit URL to clone the canonical git repository from has been updated to catch up with a change in Gerrit. Updated r/p to simply /r. -- @Legoktm T218844
PHPUnit generates JUnit test results in the log directory, intended to be captured and interpreted by CI. -- @hashar T207841
footnotes
[ 1 ] Seasons are location based and a cultural agreement, they are quite interesting in their own. They are reversed in the Norther and Southern hemisphere, do not exist at the equator while in India they define six seasons. Thus when I refer to a winter hibernation, it really just reflect my own biased point of view.
[ 2 ] Parallelism is fun, I can never manage to write that word without mixing up the number of r or l for some reason. As a sideway note, my favorite sport to watch is parallel bars (enwiki).
The working group to consider future CI tooling for Wikimedia has finished and produced a report. The report is at https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG/Report and the short summary is that the release engineering team should do prototype implementations of Argo, GitLab CI/CD, and Zuul v3.
For a few weeks, a CI job had PHPUnit tests abruptly ending with:
returned non-zero exit status -11
The connoisseur [ 1 ] would have recognized that the negative exit status indicates the process exited due to a signal. On Linux, 11 is the value for the SIGSEGV signal, which is usually sent by the kernel to the process as a result of an improper machine instruction. The default behavior is to terminate the process (man 7 signal) and to generate a core dump file (I will come to that later).
But why? Some PHP code ended up triggering a code path in HHVM that would eventually try to read outside of its memory range, or some similar low level fault. The kernel knows that the process completely misbehaved and thus, well, terminates it. Problem solved, you never want your program to misbehave when the kernel is in charge.
The job had recently been switched to use a new container in order to benefit from more recent lib and to match the OS distributions used by the Wikimedia production system. My immediate recommendation was to rollback to the previous known state, but eventually I have let the task to go on and have been absorbed by other tasks (such as updating MediaWiki on the infrastructure).
Last week, the job suddenly began to fail constantly. We prevent code from being merged when a test fails, and thus the code stays in a quarantine zone (Gerrit) and cannot be shipped. A whole team could not ship code (the Language-Team ) for one of their flagship projects (ContentTranslation .) That in turn prevents end users from benefiting from new features they are eager for. The issue had to be acted on and became an unbreak now! kind of task. And I went to my journey.
returned non-zero exit status -11, that is a good enough error message. A process in a Docker container is really just an isolated process and is still managed by the host kernel. First thing I did was to look at the kernel syslog facility on our instances, which yields:
kernel: [7943146.540511] php[14610]: segfault at 7f1b16ffad13 ip 00007f1b64787c5e sp 00007f1b53d19d30 error 4 in libpthread-2.24.so[7f1b64780000+18000]
php there is just HHVM invoked via a php symbolic link. The message hints at libpthread which is where the fault is. But we need a stacktrace to better determine the problem, and ideally a reproduction case.
Thus, what I am really looking for is the core dump file I alluded to earlier. The file is generated by the kernel and contains an image of the process memory at the time of the failure. Given the full copy of the program instructions, the instructions it was running at that time, and all the memory segments, a debugger can reconstruct a human readable state of the failure. That is a backtrace, and is what we rely on to find faulty code and fix bugs.
The core file is not generated. Or the error message would state it had coredumped, i.e. the kernel generated the core dump file. Our default configuration is to not generate any core file, but usually one can adjust it from the shell with ulimit -c XXX where XXX is the maximum size a core file can occupy (in kilobytes, in order to prevent filling the disk). Docker being just a fancy way to start a process, it has a setting to adjust the limit. The docker run inline help states:
--ulimit ulimit Ulimit options (default [])
It is as far as useful as possible, eventually the option to set is: --ulimit core=2147483648 or up to 2 gigabytes. I have updated the CI jobs and instructed them to capture a file named core, the default file name. After a few runs, although I could confirm failures, no files got captured. Why not?
Our machines do not use core as the default filename. It can be found in the kernel configuration:
name=/proc/sys/kernel/core_pattern
/var/tmp/core/core.%h.%e.%p.%t
I thus went on the hosts looking for such files. There were none.
Or maybe I mean None or NaN.
Nada, rien.
The void.
The result is obvious, try to reproduce it! I ran a Docker container doing a basic while loop, from the host I have sent the SIGSEGV signal to the process. The host still had no core file. But surprise it was in the container. Although the kernel is handling it from the host, it is not namespace-aware when it comes time to resolve the path. My quest will soon end, I have simply mounted a host directory to the containers at the expected place:
mkdir /tmp/coredumps docker run --volume /tmp/coredumps:/var/tmp/core ....
After a few builds, I had harvested enough core files. The investigation is then very straightforward:
$ gdb /usr/bin/hhvm /coredump/core.606eb29eab46.php.2353.1552570410 Core was generated by `php tests/phpunit/phpunit.php --debug-tests --testsuite extensions --exclude-gr'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813 813 pthread_create.c: No such file or directory. [Current thread is 1 (Thread 0x7f55614be3c0 (LWP 2354))] (gdb) bt #0 0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813 #1 0x00007f556f461bb2 in timer_helper_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/timer_routines.c:120 #2 0x00007f557214a494 in start_thread (arg=0x7f55614be3c0) at pthread_create.c:456 #3 0x00007f556aeebacf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f55614be3c0, max=<optimized out>) at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387 #4 0x0000000000000000 in ?? ()
Which @Anomie kindly pointed out is an issue solved in libc6. Once the container has been rebuilt to apply the package update, the fault disappears.
One can now expect new changes to appear to ContentTranslation.
[ 1 ] ''connoisseur'', from obsolete French, means "to know" https://en.wiktionary.org/wiki/connoisseur . I guess the English language forgot to apply update on due time and can not make any such change for fear of breaking back compatibility or locution habits.
The task has all the technical details and log leading to solving the issue: T216689: Merge blocker: quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11)
(Some light copyedits to above -- Brennen Bearnes)
How’d we do in our strive for operational excellence last month? Read on to find out!
The number of incidents in April was relatively high at 8. Both compared to this year (4 in January, 7 in February, 8 in March), and compared to last year (4 in April 2018).
To read more about these incidents, their investigations, and conclusions; check wikitech.wikimedia.org/wiki/Incident_documentation#2019.
As of writing, there are 186 open Wikimedia-prod-error issues (up from 177 last month). [4]
Following the report of a PHP error that happened when saving edits to certain pages, Tim Starling investigated. The investigation motivated a big commit that brings this class into the modern era. I think this change serves as a good overview of what’s changed in MediaWiki over the last 10 years, and demonstrates our current best practices.
Take a look at Gerrit change 502678 / T220563.
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
By steward and software component, issues left from March and April:
Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @aaron, @ArielGlenn, @Daimona, @dcausse, @EBernhardson, @Jdforrester-WMF, @Joe, @KartikMistry, @Ladsgroup, @Lucas_Werkmeister_WMDE, @MaxSem, @MusikAnimal, @Mvolz, @Niharika, @Nikerabbit, @Pchelolo, @pmiazga, @Reedy, @SBisson, @tstarling, and @Umherirrender.
Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents reports by month and year. –
codepen.io/Krinkle/…
[2] Tasks created. –
phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. –
phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. –
phabricator.wikimedia.org/maniphest/query…
How’d we do in our strive for operational excellence last month? Read on to find out!
The number of incidents this month was slightly above average compared to earlier this year (7 in February, 4 in January), and this time last year (4 in March 2018, 7 in February 2018).
To read more about these incidents, their investigations, and conclusions, check wikitech.wikimedia.org/wiki/Incident_documentation#2019-03.
There are currently 177 open Wikimedia-prod-error issues, similar to last month. [4]
💡 Ideas: To suggest an investigation to highlight in a future edition, feel free contact me by e-mail, or private message on IRC.
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Or help someone that’s already started with their patch:
→ Open prod-error tasks with a Patch-For-Review
Breakdown of recent months (past two weeks not included):
By steward and software component, for issues remaining from February and March:
Thanks to @aaron, @Anomie, @Arlolra, @Daimona, @hashar, @Jdforrester-WMF, @kostajh, @matmarex, @MaxSem, @Niedzielski, @Nikerabbit, @Petar.petkovic, @santhosh, @ssastry, @Umherirrender, @WMDE-leszek, @zeljkofilipin, and everyone else who helped last month by reporting, investigating, or patching errors found in production!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex/Incident_documentation/201903 …
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query …
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query …
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query …
The working group to consider future tooling for continuous integration is making progress (see previous blog post J148 for more information). We're looking at and evaluating alternatives and learning of new needs within WMF.
If you have CI needs that are not covered by building from git in a Linux container, we would like to hear from you. For example, building iOS applications is difficult without a Mac/OS X build worker, so we're looking into what we can do to provide that. What else is needed?
We're currently aiming to make CI much more "self-serve" so that as much as possible can be done by developers themselves, without having to go via or through the Release Engineering team.
Our list of candidates include systems that are not open source or are "open core" (open source, but with optional proprietary parts). We will be self-hosting, and open source is going to be a hard requirement. "Open core" may be an acceptable compromise for a system that is otherwise very good. We want to look at all alternatives, however, so that we know what's out there and what's possible.
We track our work in Phabricator, ticket T217325.
The Release Engineering team has started a working group to discuss and consider our future continuous integration tooling. Please help!
The RelEng team is working with SRE to build a continuous delivery and deployment pipeline, as well as changing production to run things in containers under Kubernetes. We aim to improve the process of making changes to software behind our various sites by making it take less effort, happen faster, be less risky, and as automated as possible. The developers will have a better development experience, be more empowered, and more productive.
Wikimedia has had a CI system for many years now, but is based on versions of tools that are reaching the end of their useful life. Those tools need to be upgraded, and this will probably require further changes due to how the new versions function. This is a good point to consider what tools and functionality we need and want.
The working group is tasked to consider the needs and wants, and evaluate the available options, and make a recommendation of what to use in the future. The deadline is March 25. The work is being documented at https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG and we're currently collecting requirements and candidates to evaluate.
We would welcome any feedback on those! Via IRC (#wikimedia-pipeline), on the talk page of the working group's wiki page above, or as a comment to this blog post.
How’d we do in our strive for operational excellence? Read on to find out!
There are in total 177 open Wikimedia-prod-error tasks today. (188 in Feb, 172 in Jan, and 165 in Dec.)
There’s been an increase in how many application errors are reported each week. And, we’ve also managed to mostly keep up with those each week, so that’s great!
But, it does appear that most weeks we accumulated one or two unresolved errors, which is starting to add up. I believe this is mainly because they were reported a day after the branch went out. That is, if the same issues had been reported 24 hours earlier in a given week, then they might’ve blocked the train as a regression.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error/
Below is breakdown of unresolved prod errors since last quarter. (I’ve omitted the last three weeks.)
By month:
By steward and software component:
Previously, a link to Special:Contributions could pass invalid options to a part of MediaWiki that doesn’t allow invalid options. Why would anything allow invalid options? Let’s find out.
Think about software as an onion. Software tends to have an outer layer where everything is allowed. If this layer finds illegal user input, it has to respond somehow. For example, by informing the user. In this outer layer, illegal input is not a problem in the software. It is a normal thing to see as we interact with the user. This outer layer responds directly to a user, is translated, and can do things like “view recent changes”, “view user contributions” or “rename a page”.
Internally, such action is divided into many smaller tasks (or functions). For example, a function might be “get talk namespace for given subject namespace”. This would answer “Talk:” to “(Article)”, and “Wikipedia_talk:” to “Wikipedia:”. When searching for edits on My Contributions with “Associated namespaces” ticked, this function is used. It is also used by Move Page if renaming a page together with its talk page. And it’s used on Recent Changes and View History, for all those little “talk” links next to each page title and username.
If one of your edits is for a page that has no discussion namespace, what should MediaWiki do? Show no edits? Skip that edit and tell the user “1 edit was hidden”? Show normally, but without a talk link? That decision is made by the outer layer for a feature, when it catches the internal exception. Alternatively, it can sometimes avoid an exception by asking a different question first – a question that cannot fail. Such as “Does namespace X have a talk space?”, instead of “What is the talk space for X?”.
When a program doesn’t catch or avoid an exception, a fatal error occurs. Thanks to @D3r1ck01 for fixing this fatal error. – T150324
Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @aaron, @Addshore, @alaa_wmde, @Amorymeltzer, @Anomie @D3r1ck01 @Daimona @daniel @hashar @hoo, @jcrespo, @KaMan, @Mainframe98, @Marostegui, @matej_suchanek, @Ottomata, @Pchelolo, @Reedy, @revi, @Smalyshev, @Tarrow, @Tgr, @thcipriani, @Umherirrender, and @Volker_E.
Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages…
[2] Tasks created. — phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. — phabricator.wikimedia.org/maniphest/query…
🍏 He got me invested in some kind of.. fruit company.
How’d we do in our strive for operational excellence last month? Read on to find out!
Xiplus reported that renaming a File page on zh.wikipedia.org led to a fatal database exception. Andre Klapper identified the stack trace from the logs, and Brad (@Anomie) investigated.
The File renaming failed because the File page did not have a media file associated with it (such move action is not currently allowed in MediaWiki). But, while handling this error the code caused a different error. The impact was that the user didn't get informed about why the move failed. Instead, they received a generic error page about a fatal database exception.
@Tgr fixed the code a few hours later, and it was deployed by Roan later that same day.
Thanks! — T213168
During a routine audit of Logstash dashboards, I found a DBPerformance warning. The warning indicated that the limit of 0 for “master connections” was violated. That's a cryptic way of saying it found code in MediaWiki that uses a database master connection on a regular page view.
MediaWiki can have many replica database servers, but there can be only one master database at any given moment. To reduce chances of overload, delaying edits, or network congestion; we make sure to use replicas whenever possible. We usually involve the master only when source data is being changed, or is about to be changed. For example, when editing a page, or saving changes.
As the vast majority of traffic is page views, we have lower thresholds for latency and dependency on page views. In particular, page views may (in the future) be routed to secondary data centres that don’t even have a master DB.
@Tchanders from the Anti-Harassment team investigated the issue, found the culprit, and fixed it in time for the next MediaWiki train. Thanks! — T214735
@Tacsipacsi and @Evad37 both independently reported the same TemplateData issue. TemplateData powers the template insertion dialog in VisualEditor. It wasn't working for some templates after we deployed the 1.33-wmf.13 branch.
The error was “Argument 1 passed to ApiResult::setIndexedTagName() must be an instance of array, null given”. This means there was code that calls a function with the wrong parameter. For example, the variable name may've been misspelled, or it may've been the wrong variable, or (in this case) the variable didn't exist. In such case, PHP implicitly assumes “null”.
Bartosz (@matmarex) found the culprit. The week before, I made a change to TemplateData that changed the “template parameter order” feature to be optional. This allows users to decide whether VisualEditor should force an order for the parameters in the wikitext. It turned out I forgot to update one of the references to this variable, which still assumed it was always present.
Brad (Anomie) fixed it later that week, and it was deployed the next day. Thanks! — T213953
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.
→ phabricator.wikimedia.org/tag/wikimedia-production-error
There are currently 188 open Wikimedia-prod-error tasks as of 12 February 2019. (We’ve had a slight increase since November; 165 in December, 172 in January.)
For this month’s edition, I’d like to draw attention to a few older issues that are still reproducible:
Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: A2093064‚ @Anomie, @Daimona @Gilles, @He7d3r, @Jdforrester-WMF, @matmarex, @mmodell, @Nikerabbit, @Catrope, @Tchanders, @Tgr, and @thiemowmde.
Thanks!
Until next time,
— Timo Tijhof
👢There's a snake in my boot. Reach for the sky!
Footnotes:
[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages…
[2] Tasks closed. — phabricator.wikimedia.org/maniphest/query…
[3] Tasks created. — phabricator.wikimedia.org/maniphest/query…
Finding reviewers for a change is often a challenge, especially for a newcomer or folks proposing changes to projects they are not familiar with. Since January 16th, 2019, Gerrit automatically adds reviewers on your behalf based on who last changed the code you are affecting.
Antoine "@hashar" Musso exposes what lead us to enable that feature and how to configure it to fit your project. He will offers tip as to how to seek more reviewers based on years of experience.
When uploading a new patch, reviewers should be added automatically, that is the subject of the task T91190 opened almost four years ago (March 2015). I declined the task since we already have the Reviewer bot (see section below), @Tgr found a plugin for Gerrit which analyzes the code history with git blame and uses that to determine potential reviewers for a change. It took us a while to add that particular Gerrit plugin and the first version we installed was not compatible with our Gerrit version. The plugin was upgraded yesterday (Jan 16th) and is working fine (T101131).
Let's have a look at the functionality the plugin provides, and how it can be configured per repository. I will then offer a refresher of how one can search for reviewers based on git history.
The Gerrit plugin looks at affected code using git blame, it extracts the top three past authors which are then added as reviewers to the change on your behalf. Added reviewers will thus receive a notification showing you have asked them for code review.
The configuration is done on a per project basis and inherits from the parent project. Without any tweaks, your project inherits the configuration from All-Projects. If you are a project owner, you can adjust the configuration. As an example the configuration for operations/mediawiki-config which shows inherited values and an exception to not process a file named InitialiseSettings.php:
The three settings are described in the documentation for the plugin:
plugin.reviewers-by-blame.maxReviewers
The maximum number of reviewers that should be added to a change by this plugin.
By default 3.plugin.reviewers-by-blame.ignoreFileRegEx
Ignore files where the filename matches the given regular expression when computing the reviewers. If empty or not set, no files are ignored.
By default not set.plugin.reviewers-by-blame.ignoreSubjectRegEx
Ignore commits where the subject of the commit messages matches the given regular expression. If empty or not set, no commits are ignored.
By default not set.
By making past authors aware of a change to code they previously altered, I believe you will get more reviews and hopefully get your changes approved faster.
Previously we had other methods to add reviewers, one opt-in based and the others being cumbersome manual steps. They should be used to compliment the Gerrit reviewers by blame plugin, and I am giving an overview of each of them in the following sections.
The original system from Gerrit lets you watch projects, similar to a user watch list on MediaWiki. In Gerrit preferences, one can get notified for new changes, patchsets, comments... Simply indicate a repository, optionally a search query and you will receive email notifications for matching events.
The attached image is my watched projects configuration, I thus receive notifications for any changes made to the integration/config config as well as for changes in mediawiki/core which affect either composer.json or one of the Wikimedia deployment branches for that repo.
One drawback is that we can not watch a whole hierarchy of projects such as mediawiki and all its descendants, which would be helpful to watch our deployment branch. It is still useful when you are the primary maintainer of a repository since you can keep track of all activity for the repository.
The reviewer bot has been written by Merlijn van Deen (@valhallasw), it is similar to the Gerrit watched projects feature with some major benefits:
One registers reviewers on a single wiki page: https://www.mediawiki.org/wiki/Git/Reviewers.
Each repository filter is a wikitext section (eg: === mediawiki/core ===) followed by a wikitext template and a file filter using using python fnmatch. Some examples:
Listen to any changes that touch i18n:
== Listen to repository groups == === * === * {{Gerrit-reviewer|JohnDoe|file_regexp=<nowiki>i18n</nowiki>}}
Listen to MediaWiki core search related code:
=== mediawiki/core === * {{Gerrit-reviewer|JaneDoe|file_regexp=<nowiki>^includes/search/</nowiki>
The system works great, given maintainers remember to register on the page and that the files are not moved around. The bot is not that well known though and most repositories do not have any reviewers listed.
A source of reviewers is the git history, one can easily retrieve a list of past authors which should be good candidates to review code. I typically use git shortlog --summary --no-merges for that (--no-merges filters out merge commit crafted by Gerrit when a change is submitted). Example for MediaWiki Job queue system:
$ git shortlog --no-merges --summary --since "one year ago" includes/jobqueue/|sort -n|tail -n4 3 Petr Pchelko 4 Brad Jorsch 4 Umherirrender 16 Aaron Schulz
Which gives me four candidates that acted on that directory over a year.
When a patch is merged, Gerrit records in git trace votes and the canonical URL of the change. They are available in git notes under /refs/notes/review, once notes are fetched, they can be show in git show or git log by passing --show-notes=review, for each commit, after the commit messages, the notes get displayed and show votes among other metadata:
$ git fetch refs/notes/review:refs/notes/review $ git log --no-merges --show-notes=review -n1 commit e1d2c92ac69b6537866c742d8e9006f98d0e82e8 Author: Gergő Tisza <tgr.huwiki@gmail.com> Date: Wed Jan 16 18:14:52 2019 -0800 Fix error reporting in MovePage Bug: T210739 Change-Id: I8f6c9647ee949b33fd4daeae6aed6b94bb1988aa Notes (review): Code-Review+2: Jforrester <jforrester@wikimedia.org> Verified+2: jenkins-bot Submitted-by: jenkins-bot Submitted-at: Thu, 17 Jan 2019 05:02:23 +0000 Reviewed-on: https://gerrit.wikimedia.org/r/484825 Project: mediawiki/core Branch: refs/heads/master
And I can then get the list of authors that previously voted Code-Review +2 for a given path. Using the previous example of includes/jobqueue/ over a year, the list is slightly different:
$ git log --show-notes=review --since "1 year ago" includes/jobqueue/|grep 'Code-Review+2:'|sort|uniq -c|sort -n|tail -n5 2 Code-Review+2: Umherirrender <umherirrender_de.wp@web.de> 3 Code-Review+2: Jforrester <jforrester@wikimedia.org> 3 Code-Review+2: Mobrovac <mobrovac@wikimedia.org> 9 Code-Review+2: Aaron Schulz <aschulz@wikimedia.org> 18 Code-Review+2: Krinkle <krinklemail@gmail.com>
User Krinkle has approved a lot of patches, even if he doesn't show in the list of authors obtained by the previous mean (inspecting git history).
The Gerrit reviewers by blame plugin acts automatically which offers a good chance your newly uploaded patch will get reviewers added out of the box. For finer tweaking one should register as a reviewer on https://www.mediawiki.org/wiki/Git/Reviewers which benefits everyone. The last course of action is meant to compliment the git log history.
For any remarks, support, concerns, reach out on IRC freenode channel #wikimedia-releng or fill a task in Phabricator.
Thank you @thcipriani for the proof reading and english fixes.
Inside a broad Code Health project there is a small Code Health Metrics group. We meet weekly and discuss how code health could be improved by metrics. Each member has only a few hours each week to work on this, so our projects are small.
In our discussions, we have agreed on a few principles. Some of them are:
The goal of the project is to provide fast and actionable feedback on code health metrics. Since our time for this project is limited, we've decided to make a spike (T207046). The spike focuses on:
All of the above tasks are already completed, except for the last one. In parallel to finishing the spike, we are also working on expanding the scope to more repositories, languages and metrics. At the moment, the spike works for several Java repositories.
After some investigation, the tool we have selected is SonarQube. The tool does everything we need, and more. In this post I'll only mention one feature. We have decided not to host SonarQube ourselves at the moment. We are using a hosted solution, SonarCloud. You can see the our current dashboart at wmftest organization at SonarCloud.
As mentioned in the principles, in order to make the metrics actionable, we've decided to focus only on new code, ignoring existing code for now. That means that when you make a change to a repository with a lot of code, you are not overwhelmed with all metrics (and problems) the tool has found. Instead, the tool focuses just on the code you have wrote. So, for example, if a small patch you have submitted to a big repository does not introduce new problems, the tool says so. If the patch introduces new problems (like decreased branch coverage) the tools let's you know.
Members of the Code Health Metrics group have reminded me multiple times that I have to mention SonarLint, an IDE extension. I don't use it myself, since it doesn't support my favorite editor.
A good example is at at wmftest organization at SonarCloud. Elasticsearch extra plugins has failed quality gate.
Opening the project Elasticsearch extra plugins project you see that the failure is related to test coverage (less than 80%).
Click the warning and you get more details: Coverage on New Code 0.0%.
Click the ExtraCorePlugin.java file. New lines have yellow background. It's easy to see that there are lines that are marked red (meaning no coverage) but it's also easy to see which new lines (yellow background) have no coverage (red sidebar).
We have planned to present what we have so far during Wikimedia Foundation All Hands. The prepare for that, we're created this blog post and presented at 5 Minute Demo and Testival Meetup.
I would like to thank all members of the Code Health Metrics Working group for help writing this post and especially to Guillaume Lederrey and Kosta Harlan.
Q: Sonar-what?!
A: SonarQube is the tool. SonarCloud is the hosted version of the tool. SonarLint in an IDE extension.
Q: When can I use this on my project?
A: Soon. Probably when T207046 is resolved. If there are no blockers, in a few weeks.
Q: Why are we using SonarCloud instead of hosting SonarQube ourselves?
A: We did not want to invest time in hosting it ourselves until we're sure the tool is the right choice for us.
How’d we do in our strive for operational excellence last month? Read on to find out!
Terminology:
For December, I haven’t prepared any stories or taken interviews. Instead, I’ve got a lightning round of errors in various areas that were found and fixed this past month.
MarcoAurelio reported that Special:Contributions failed to load for certain user names on meta.wikimedia.org (PHP Fatal error, due to a faulty database record). Brad Jorsch investigated and found a relation to database maintenance from March 2018. He corrected the faulty records, which resolved the problem. Thanks! — T210985
The newly created Cantonese Wiktionary (yue.wiktionary.org) was encountering errors from the Siteinfo API. We found this was due to invalid site configuration. Urbanecm patched the issue, and also created a new unit test for wmf-config that will prevent this issue from happening on other wikis in the future. Thanks! — T211529
After deploying the 1.33.0-wmf.8 train to all wikis, we found a regression in the HTTP library for MediaWiki. When MediaWiki requested an HTTP resource from another service, and this resource was unavailable, then MediaWiki failed to correctly determine the HTTP status code of that error. Which then caused another error! This happened, for example, when Special:Collection was unable to reach the PediaPress.com backend in some cases. Patched by Bill Pirkle. Thanks! — T212005
When the 1.33.0-wmf-9 train reached the canary phase on Tue 18 December (aka, group0 [1]), Željko spotted a new fatal error in the logs. The fatal originated in the Kartographer extension and would have affected various users of the MediaWiki API. Patched the same day by Michael Holloway, reviewed by James Forrester, and deployed by Željko. Thanks! — T212218
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error
November's theme will continue for now, as I imagine lots of you were on vacation during that time! I’d like to draw attention to a subset of PHP fatal errors. Specifically, those that are publicly exposed (e.g. don’t need elevated user rights) and emit an HTTP 500 error code.
Public user requests resulting in fatals can (and have) caused alerts to fire that notify SRE of wikis potentially being less available or down.
Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including @MarcoAurelio, @Anomie, @Urbanecm, @BPirkle, @zeljkofilipin, @Mholloway, @Esanders, @Jdforrester-WMF, and @hashar.
Until next time,
— Timo Tijhof
Footnotes:
[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages...
[2] Tasks closed. — phabricator.wikimedia.org/maniphest/query...
[3] Tasks opened. — phabricator.wikimedia.org/maniphest/query...
[4] What is group0? — wikitech.wikimedia.org/wiki/Deployments/One_week#Three_groups
How’d we do in our strive for operational excellence last month? Read on to find out!
Terminology:
With that behind us... Let’s celebrate this month’s highlights!
Quiddity reported that he was unable to disable a spam account, due to a fatal exception. Andre Klapper used the Exception ID to find the stack trace in the logs. The trace revealed that a table was missing in Wikitech’s database.
The MediaWiki software was recently expanded with a “Partial blocking” ability. [4] This involved introducing a new database table that stores block metadata differently. This software update was deployed to Wikitech, but this new table was not created.
@Marostegui (Database administrator) quickly applied the schema patches that create the missing table. Thanks Manuel, Andre, and Quiddity; Teamwork!
– T209674
It had been known for years, [5] that users are unable to delete or restore pages with more than a few hundred revisions. Attempts to do so could fail, with a fatal “DBTransactionSizeError” exception. This error indicates that the change is too big or too slow. Such changes risk replication lag, and may impact the stability of the infrastructure.
The database structure used by MediaWiki for page archives dates back to 2003 (over 15 years ago). I'll spare you the details, but it depends on database interactions that are inherently slow when applied to systems as big as Wikipedia! RFC T20493 intends to modernise this structure for the long-term.
Then along came @BPirkle. Bill joined the WMF Core platform team earlier this year. He took on the challenge of making page deletion work for any size page, today.
Previously, page deletion happened in a single step. This simple approach had the benefit of either succeeding in its entirety, or safely rolling back like nothing happened. It also meant that the database protected us against conflicting changes. In August, Bill started a two-month effort that carefully split the logic for “delete a page” into smaller steps that each are safe and quick. It now uses our JobQueue to schedule and run these steps, without the user waiting for it.
– T198176
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.
→ https://phabricator.wikimedia.org/tag/wikimedia-production-error
I’d like to draw attention to a subset of PHP fatal errors. Specifically, those that are publicly exposed (e.g. don’t require elevated user rights) and use an HTTP 500 status code.
Public user requests resulting in fatals can (and have) caused alerts to fire that notify SRE of wikis potentially being less available or down.
Thank you to everyone who helped by reporting or investigating problems in Wikimedia production; and for implementing or reviewing their solutions. Including: @tstarling, @thiemowmde, @thcipriani, @Tgr, @Steinsplitter, @Quiddity, @pmiazga, @Nikerabbit, @Mvolz, @Lucas_Werkmeister_WMDE, @kostajh, @jrbs, @JJMC89, @Jdforrester-WMF, @hashar, @Gilles, @Daimona, @Ciencia_Al_Poder, @Catrope, @BPirkle, @Barkeep49, @Anomie, and @Aklapper.
Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:AllPages...
[2] Tasks closed. – phabricator.wikimedia.org/maniphest/query...
[3] Tasks opened. – phabricator.wikimedia.org/maniphest/query...
[4] Partial blocks. – meta.wikimedia.org/wiki/Community_health_initiative
[5] Bug report about page deletion, 2007. – T13402
The Release Engineering team wants to continually improve the quality of our software over time. One of the ways in which we hoped to do that this year is by creating more useful Selenium smoke tests. (From now on, test will be used instead of Selenium test.) This blog post is about how we determined where the tests should focus and the relative priority.
At first, I thought this would be a trivial task. A few hours of work. A few days at most. A week or two if I've completely underestimated it. A couple of months later, I know I have completely underestimated it.
Things I needed to do:
In general:
For the last year:
This was relatively simple task. The best source of information is Developers/Maintainers page.
This was also easy. Selenium/Node.js page has list of repositories that have tests in Node.js. I already had all repositories with Node.js and Ruby tests on my machine, so a quick search for webdriverio (Node.js) and mediawiki_selenium (Ruby) found all the tests. In order to be really sure I've found all repositories with tests, I've cloned all repositories from Gerrit.
$ ack --json webdriverio extensions/Echo/package.json 27: "webdriverio": "4.12.0" ...
$ ack --type-add=lock:ext:lock --lock mediawiki_selenium skins/MinervaNeue/Gemfile.lock 42: mediawiki_selenium (1.7.3) ...
To make extra sure I have not missed any repositories, I've used MediaWiki code search (mediawiki_selenium, webdriverio) and GitHub search (org:wikimedia extension:lock mediawiki_selenium, org:wikimedia extension:json webdriverio)
This is the list.
Repository | Language |
mediawiki/core | JavaScript |
mediawiki/extensions/AdvancedSearch | JavaScript |
mediawiki/extensions/CentralAuth | Ruby |
mediawiki/extensions/CentralNotice | Ruby |
mediawiki/extensions/CirrusSearch | JavaScript |
mediawiki/extensions/Cite | JavaScript |
mediawiki/extensions/Echo | JavaScript |
mediawiki/extensions/ElectronPdfService | JavaScript |
mediawiki/extensions/GettingStarted | Ruby |
mediawiki/extensions/Math | JavaScript |
mediawiki/extensions/MobileFrontend | Ruby |
mediawiki/extensions/MultimediaViewer | Ruby |
mediawiki/extensions/Newsletter | JavaScript |
mediawiki/extensions/ORES | JavaScript |
mediawiki/extensions/Popups | JavaScript |
mediawiki/extensions/QuickSurveys | Ruby |
mediawiki/extensions/RelatedArticles | JavaScript |
mediawiki/extensions/RevisionSlider | Ruby |
mediawiki/extensions/TwoColConflict | JavaScript, Ruby |
mediawiki/extensions/Wikibase | JavaScript, Ruby |
mediawiki/extensions/WikibaseLexeme | JavaScript, Ruby |
mediawiki/extensions/WikimediaEvents | PHP |
mediawiki/skins/MinervaNeue | Ruby |
phab-deployment | JavaScript |
wikimedia/community-tech-tools | Ruby |
wikimedia/portals/deploy | JavaScript |
After reviewing several tools, I've found that we already use Bitergia for various metrics. There is even a nice list of top 50 repositories by the number of commits. The tool even supports limiting the report from a date to a date. Exactly what I needed.
Bitergia > Last 90 days > Absolute > From 2017-11-01 00:00:00.000 > To 2018-10-31 23:59:59.999 > Go > Git > Overview > Repositories (raw data: P7776, direct link).
This is the top 50 list (excludes empty commits and bots).
Repository | Commits |
---|---|
mediawiki/extensions | 11300 |
operations/puppet | 7988 |
mediawiki/core | 4590 |
operations/mediawiki-config | 4005 |
integration/config | 1652 |
operations/software/librenms | 1169 |
pywikibot/core | 927 |
mediawiki/extensions/Wikibase | 806 |
apps/android/wikipedia | 789 |
mediawiki/services/parsoid | 700 |
mediawiki/extensions/VisualEditor | 692 |
operations/dns | 653 |
VisualEditor/VisualEditor | 599 |
mediawiki/skins | 570 |
mediawiki/extensions/MobileFrontend | 504 |
mediawiki/extensions/ContentTranslation | 491 |
translatewiki | 486 |
oojs/ui | 469 |
wikimedia/fundraising/crm | 457 |
mediawiki/extensions/BlueSpiceFoundation | 414 |
mediawiki/extensions/CirrusSearch | 357 |
mediawiki/extensions/AbuseFilter | 306 |
phabricator/phabricator | 302 |
mediawiki/services/restbase | 290 |
mediawiki/extensions/Flow | 232 |
mediawiki/extensions/Echo | 223 |
mediawiki/vagrant | 221 |
mediawiki/extensions/Popups | 184 |
mediawiki/extensions/Translate | 182 |
mediawiki/extensions/DonationInterface | 180 |
analytics/refinery | 178 |
mediawiki/extensions/PageTriage | 177 |
mediawiki/extensions/Cargo | 176 |
mediawiki/tools/codesniffer | 156 |
mediawiki/extensions/TimedMediaHandler | 152 |
mediawiki/extensions/UniversalLanguageSelector | 142 |
mediawiki/vendor | 140 |
mediawiki/extensions/SocialProfile | 139 |
analytics/refinery/source | 138 |
operations/software | 137 |
mediawiki/services/restbase/deploy | 136 |
operations/debs/pybal | 123 |
mediawiki/extensions/CentralAuth | 116 |
mediawiki/tools/release | 116 |
mediawiki/services/cxserver | 112 |
mediawiki/extensions/BlueSpiceExtensions | 110 |
mediawiki/extensions/WikimediaEvents | 110 |
labs/private | 108 |
operations/debs/python-kafka | 104 |
labs/tools/heritage | 96 |
I've got similar results with running git rev-list for all repositories (script, results: P7834).
This proved to be the most time consuming task.
I have started by reviewing existing incident documentation. Take a look at a few incidents. Can you tell which incident report is connected to which repository? I couldn't. (If you can, please let me know. I need your help.)
Incident reports are a wall of text. It was really hard for me to connect an incident report to a repository. An incident report has a title and text, example: 20180724-Train. Text has several sections, including Actionables. Text contains links to Gerrit patches and Phabricator tasks. (From now on, I'll use patches instead of Gerrit patches and tasks instead of Phabricator tasks.)
A patch belongs to a repository. Wikitext [[gerrit:448103]] is patch mediawiki/extensions/Wikibase/+/448103, so repository is mediawiki/extensions/Wikibase. That is the strongest link between an incident and a repository.
A task usually has patches associated with it. Wikitext [[phab:T181315]] is patch T181315. Gerrit search bug:T181315 finds many connected patches, many of them in operations/puppet and one in mediawiki/vagrant. That is an useful, but not a strong link between an incident and a repository. Some tasks have several related patches, so it provides a lot of data.
A task also usually has several tags. Most of them are not useful in this context, but tags that are components (and not for example milestones or tags) could be useful, if the component can be linked to a repository. It is also not a strong link between an incident and a repository, and it usually does not provide a lot of data.
At the end, I wrote a tool with imaginative name, Incident Documentation. The tool currently collects data from patches and tasks from Actionables section of the incident report. It does not collect data from task components. It is tracked as issue #5.
After reviewing Actionables section for each incident report, related patches and tasks, here are the results. Please note this table only connects incident report and repositories. It does not show how many patches from a repository are connected to an incident report. It is tracked as issue #11.
Repository | Incidents |
---|---|
operations/puppet | 22 |
mediawiki/core | 6 |
operations/mediawiki-config | 4 |
mediawiki/extensions/Wikibase | 4 |
wikidata/query/rdf | 2 |
operations/debs/pybal | 2 |
mediawiki/extensions/ORES | 2 |
integration/config | 2 |
wikidata/query/blazegraph | 1 |
operations/software | 1 |
operations/dns | 1 |
mediawiki/vagrant | 1 |
mediawiki/tools/release | 1 |
mediawiki/services/ores/deploy | 1 |
mediawiki/services/eventstreams | 1 |
mediawiki/extensions/WikibaseQualityConstraints | 1 |
mediawiki/extensions/PropertySuggester | 1 |
mediawiki/extensions/PageTriage | 1 |
mediawiki/extensions/Cognate | 1 |
mediawiki/extensions/Babel | 1 |
maps/tilerator/deploy | 1 |
maps/kartotherian/deploy | 1 |
integration/jenkins | 1 |
eventlogging | 1 |
analytics/refinery/source | 1 |
analytics/refinery | 1 |
All-Projects | 1 |
This table is sorted by the amount of change. The only column that needs explanation is Selected. It shows if a test makes sense for the repository, taking into account all available data. Repositories without maintainers and with existing tests are excluded.
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions | 11300 | ||||
operations/puppet | 7988 | SRE | 22 | ||
mediawiki/core | 4590 | Core Platform | JavaScript | 6 | |
operations/mediawiki-config | 4005 | Release Engineering | 4 | ||
integration/config | 1652 | Release Engineering | 2 | ||
operations/software/librenms | 1169 | SRE | |||
pywikibot/core | 927 | ||||
mediawiki/extensions/Wikibase | 806 | WMDE | JavaScript, Ruby | 4 | |
apps/android/wikipedia | 789 | ||||
mediawiki/services/parsoid | 700 | Parsing | |||
mediawiki/extensions/VisualEditor | 692 | Editing | ✅ | ||
operations/dns | 653 | SRE | 1 | ||
VisualEditor/VisualEditor | 599 | Editing | |||
mediawiki/skins | 570 | Reading | |||
mediawiki/extensions/MobileFrontend | 504 | Reading | Ruby | ||
mediawiki/extensions/ContentTranslation | 491 | Language engineering | ✅ | ||
translatewiki | 486 | ||||
oojs/ui | 469 | ||||
wikimedia/fundraising/crm | 457 | Fundraising tech | |||
mediawiki/extensions/BlueSpiceFoundation | 414 | ||||
mediawiki/extensions/CirrusSearch | 357 | Search Platform | JavaScript | ||
mediawiki/extensions/AbuseFilter | 306 | Contributors | ✅ | ||
phabricator/phabricator | 302 | Release Engineering | ✅ | ||
mediawiki/services/restbase | 290 | Core Platform | |||
mediawiki/extensions/Flow | 232 | Growth | ✅ | ||
mediawiki/extensions/Echo | 223 | Growth | JavaScript | ||
mediawiki/vagrant | 221 | Release Engineering | 1 | ||
mediawiki/extensions/Popups | 184 | Reading | JavaScript | ||
mediawiki/extensions/Translate | 182 | Language engineering | ✅ | ||
mediawiki/extensions/DonationInterface | 180 | Fundraising tech | ✅ | ||
analytics/refinery | 178 | Analytics | 1 | ||
mediawiki/extensions/PageTriage | 177 | Growth | 1 | ✅ | |
mediawiki/extensions/Cargo | 176 | ||||
mediawiki/tools/codesniffer | 156 | ||||
mediawiki/extensions/TimedMediaHandler | 152 | Reading | ✅ | ||
mediawiki/extensions/UniversalLanguageSelector | 142 | Language engineering | ✅ | ||
mediawiki/vendor | 140 | ||||
mediawiki/extensions/SocialProfile | 139 | ||||
analytics/refinery/source | 138 | Analytics | 1 | ||
operations/software | 137 | SRE | 1 | ||
mediawiki/services/restbase/deploy | 136 | Core Platform | |||
operations/debs/pybal | 123 | SRE | 2 | ||
mediawiki/extensions/CentralAuth | 116 | Ruby | |||
mediawiki/tools/release | 116 | 1 | |||
mediawiki/services/cxserver | 112 | ||||
mediawiki/extensions/BlueSpiceExtensions | 110 | ||||
mediawiki/extensions/WikimediaEvents | 110 | PHP | |||
labs/private | 108 | ||||
operations/debs/python-kafka | 104 | SRE | |||
labs/tools/heritage | 96 | ||||
Since some of the repositories connected to incidents are not in the top 50 Bitergia report, I've used git rev-list to sort them. Numbers are different because Bitergia excludes empty commits and bots (script, results: P7834).
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions/WikibaseQualityConstraints | 910 | WMDE | 1 | ✅ | |
mediawiki/extensions/ORES | 364 | Growth | JavaScript | 2 | |
wikidata/query/rdf | 204 | WMDE | 2 | ||
mediawiki/extensions/Babel | 146 | Editing | 1 | ✅ | |
mediawiki/services/ores/deploy | 84 | Growth | 1 | ||
maps/kartotherian/deploy | 80 | 1 | |||
mediawiki/extensions/PropertySuggester | 67 | WMDE | 1 | ✅ | |
maps/tilerator/deploy | 61 | 1 | |||
mediawiki/extensions/Cognate | 47 | WMDE | 1 | ✅ | |
All-Projects | 37 | 1 | |||
eventlogging | 26 | 1 | |||
integration/jenkins | 19 | Release Engineering | 1 | ||
mediawiki/services/eventstreams | 16 | 1 | |||
wikidata/query/blazegraph | 10 | WMDE | 1 | ||
Change column uses Bitergia numbers. Numbers in italic are from git rev-list.
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions/VisualEditor | 692 | Editing | ✅ | ||
mediawiki/extensions/ContentTranslation | 491 | Language engineering | ✅ | ||
mediawiki/extensions/AbuseFilter | 306 | Contributors | ✅ | ||
phabricator/phabricator | 302 | Release Engineering | ✅ | ||
mediawiki/extensions/Flow | 232 | Growth | ✅ | ||
mediawiki/extensions/Translate | 182 | Language engineering | ✅ | ||
mediawiki/extensions/DonationInterface | 180 | Fundraising tech | ✅ | ||
mediawiki/extensions/PageTriage | 177 | Growth | 1 | ✅ | |
mediawiki/extensions/TimedMediaHandler | 152 | Reading | ✅ | ||
mediawiki/extensions/UniversalLanguageSelector | 142 | Language engineering | ✅ | ||
mediawiki/extensions/WikibaseQualityConstraints | 910 | WMDE | 1 | ✅ | |
mediawiki/extensions/Babel | 146 | Editing | 1 | ✅ | |
mediawiki/extensions/PropertySuggester | 67 | WMDE | 1 | ✅ | |
mediawiki/extensions/Cognate | 47 | WMDE | 1 | ✅ | |
The same table grouped by stewards.
Repository | Change | Stewards | Coverage | Incidents | Selected |
---|---|---|---|---|---|
mediawiki/extensions/VisualEditor | 692 | Editing | ✅ | ||
mediawiki/extensions/Babel | 146 | Editing | 1 | ✅ | |
mediawiki/extensions/ContentTranslation | 491 | Language engineering | ✅ | ||
mediawiki/extensions/Translate | 182 | Language engineering | ✅ | ||
mediawiki/extensions/UniversalLanguageSelector | 142 | Language engineering | ✅ | ||
mediawiki/extensions/AbuseFilter | 306 | Contributors | ✅ | ||
phabricator/phabricator | 302 | Release Engineering | ✅ | ||
mediawiki/extensions/Flow | 232 | Growth | ✅ | ||
mediawiki/extensions/PageTriage | 177 | Growth | 1 | ✅ | |
mediawiki/extensions/DonationInterface | 180 | Fundraising tech | ✅ | ||
mediawiki/extensions/TimedMediaHandler | 152 | Reading | ✅ | ||
mediawiki/extensions/WikibaseQualityConstraints | 910 | WMDE | 1 | ✅ | |
mediawiki/extensions/PropertySuggester | 67 | WMDE | 1 | ✅ | |
mediawiki/extensions/Cognate | 47 | WMDE | 1 | ✅ | |
We decided to implement some basic e2e test scenarios which would only run in production – both after someone deploys a change and a few times a day to cover situations where someone makes some changes to a server or something.
Next steps:
Incident Documentation tool improvements:
Halloween is a full two weeks behind us here in the United States, but it's still on my mind. It happens to be my favorite holiday, and I receive it both gleefully and somberly.
Some of the more obvious and delightful ways I appreciate Halloween include: busting out my giant spider to hang in the front yard; getting messy with gory and gaudy decorations; scaring neighborhood children; stuffing candy in my face. What's not to like about all that, really?
But there are more deeply felt reasons to appreciate Halloween, reasons that aren't often fully internalized or even discussed. Rooted in its pagan Celtic traditions and echoed by similar traditions worldwide, like Día de los Muertos of Mexico and Obon of Japan, Halloween asks us, for a night, to put away our timidness about living and dying. It asks us to turn toward the growing darkness of winter, turn toward the ones we've lost, turn toward the decay of our own bodies, and honor these very real experiences as equal partners to the light, birth, and growth embodied by our everyday expectations. More precisely it asks us to turn toward these often difficult aspects of life not with hesitation or fear but with strength, jubilation, a sense of humor. It is this brave posture of Halloween's traditions that I appreciate so very much.
So Halloween is over and I'm looking back. What does that have to do with anything here at WMF and in Phabricator no less? Well, I want to take you into another dark and ominous cauldron of our experience that most would rather just forget about.
I want to show you some Continuous Integration build metrics for the month of October!
Will we see darkness? Oh yes. Will we see decay? Surely. Was that an awkward transition to the real subject of this post? Yep! Sorry, but I just had to have a thematic introduction, and brace yourself with a sigh because the theme will continue.
You see this past October, Release Engineering battled a HORDE OF ZOMBIE CONTAINERS! And we'll be seeing in our metrics proof that this horde was, for longer than anyone wishes zombies to ever hang around, chowing down on the brains of our CI.
Before I get to the zombies, let's look briefly at a big picture view of last month's build durations... Let's also get just a bit more serious.
What are we looking at? We're looking at statistics for build durations. The above chart plots the daily 75th, 95th, and 98th percentiles of successful build durations during the month of October as well as the number of job configuration changes made within the same range of time.
These data points were chosen for a few reasons.
First, percentiles are used over daily means to better represent what the vast majority of users experience when they're waiting on CI[1]. It excludes outliers, build durations that occur only about 2 percent of the time, not because they're unimportant to us, but because setting them aside temporarily allows us to find patterns of most common use and issues that might otherwise be obfuscated by the extra noise of extraordinarily long builds.
Next, three percentiles were chosen so that we might look for patterns among both faster builds and the longer running ones. Practically this means we can measure the effects of our changes on the chosen percentiles independently, and if we make changes to improve the build durations of jobs that typically perform closer to one percentile, we can measure the effect discretely while also making sure performance at other percentiles has not regressed.
Finally, job configuration changes are plotted alongside daily duration percentiles to help find indications of whether our changes to integration/config during October had an impact on overall build performance. Of course, measuring the exact impact of these changes is quite a bit more difficult and requires the build data used to populate this chart to be classified and analyzed much further—as we'll see later—but having the extra information there is an important first step.
So what can we see in this chart? Well, let's start with that very conspicuous dip smack dab in the middle.
And for background, another short thematic interlude:
Back in June, @thcipriani of Release Engineering was waiting on a particularly long build to complete—it was a "dark and stormy night" or something, *sighs and rolls eyes*—and during his investigation on the labs instance that was running the build, he noticed a curious thing: There was a Docker container just chugging away running a build that had started more than 6 hours prior, a build that had thought to be canceled and reaped by Jenkins, a build that should have been long dead but was sitting there very much undead and seemingly loving its long and private binge before the terminal specter of a meat-space man had so rudely interrupted.
"It's a zombie container," @thcipriani (probably) muttered as he felt his way backward on outstretched fingertips (ctrl-ccccc), logged out, and filed task T198517 to which @hashar soon replied and offered a rational but disturbing explanation.
I'm not going to explain the why in its entirety but you can read more about it in the comments of an associated task, T176747, and the links posted therein. I will, however, briefly explain what I mean by "zombie container."
A zombie container for the sake of this post is not strictly a zombie process in the POSIX sense, but means that a build's main process is still running, even after Jenkins has told it to stop. It is both taking up some amount of valuable host resources (CPU, memory, or disk space), and is invisible to anyone looking only at the monitoring interfaces of Gerrit, Zuul, or Jenkins.
We didn't see much evidence of these zombie containers having enough impact on the overall system to demand dropping other priorities—and to be perfectly honest, I half assumed that Tyler's account had simply been due to madness after ingesting a bad batch of homebrew honey mead—but the data shows that they continued to lurk and that they may have even proliferated under the generally increasing load on CI. By early October, these zombie containers were wreaking absolute havoc—compounded by the way our CI system deals with chains of dependent builds and superseding patchsets—and it was clear that hunting them down should be a priority.
Task T198517 was claimed and conquered, and to the dismay of zombie containers across CI:
Looking again at that dip in the percentiles chart, a few things are clear.
There's a noticeable drop among all three daily duration percentiles. Second, there also seems to be a decrease in both the variance of each day's percentile average expressed by the plotted error bars—remember that our percentile precision demands we average multiple values for each percentile/day—and the day-to-day differences in plotted percentiles after the dip. And lastly, the dip strongly coincides with the job configuration changes that were made to resolve T198517.
WE. DID. IT. WE'VE FREED CI FROM THOSE DREADED ZOMBIE CONTAINERS! THEY ARE TRULY (UN)^2-DEAD AGAIN SO LET'S DITCH THESE BORING CHARTS AND CELEBRA...
Say what? Oh. Right. I guess we didn't adequately measure exactly how much of an improvement in duration there was pre-and-post T198517 and whether or not there was unnoticed/unanticipated regression. Let's pause on that celebration and look a little deeper.
So how does one get a bigger picture of overall CI build durations before and after a change? Or of categories within any real and highly heterogeneous performance data for that matter? I did not have a good answer to this question, so I went searching and I found a lovely blog post on analyzing DNS performance across various geo-distributed servers[2]. It's a great read really, and talks about a specific statistical tool that seemed like it might be useful in our case: The logarithmic percentile histogram.
"I like the way you talk..." Yes, it's a fancy name, but it's pretty simple when broken down... backwards, because, well, English.
A histogram shows the distribution of one quantitative variable in a dataset, in our case build duration, across various 'buckets'. A percentile histogram buckets values for the variable of the histogram by its percentiles, and a logarithmic percentile histogram plots the distribution of values across percentile buckets on a logarithmic scale.
I think it's a bit easier to show than to describe, so here's our plot of build duration percentiles before and after T198517 was resolved, represented as a histogram on a logarithmic scale.
First, note that while we ranked build durations low to high in our other chart, this one presents a high-to-low ranking, meaning that longer durations (slower builds) are ranked within lower percentiles and shorter durations (faster builds) are ranked in higher percentiles. This better fits the logarithmic scale, and more importantly it brings the lowest percentiles (the slowest durations) into focus, letting us see where the biggest gains were made by resolving the zombie container issue.
Also valuable about this representation is the fact that it shows all percentiles, not just the three that we saw earlier in the chart of daily calculations, which shows us that gains were made consistently across the board and there are no notable regressions among the percentile ranks where it would matter—there is a small section of the plot that shows percentiles of post-T198517 durations being slighter higher (slower), but this is among some of the percentiles for the very fastest of builds where the absolute values of differences are very small and perhaps not even statistically significant.
Looking at the percentage gains annotated parenthetically in the plot, we can see major gains at the 0.2, 1, 2, 10, 25, and 50th percentiles. Here they are as a table.
percentile | duration w/ zombies | w/o zombies | gain from killing zombies |
p0.2 | 43.3 minutes | 39.3 minutes | -9.2% |
p1 | 34.0 | 26.5 | -22.2% |
p2 | 27.7 | 22.2 | -19.7% |
p10 | 17.6 | 12.7 | -27.9% |
p25 | 11.0 | 7.2 | -34.4% |
p50 | 5.3 | 3.4 | -36.9% |
So there it is quite plain, a CI world with and without zombie containers, and builds running upwards of 37% faster without those zombies chomping away at our brains! It's demonstrably a better world without them I'd say, but you be the judge; We all have different tastes. 8D
Now celebrate or don't celebrate accordingly!
Oh and please have at the data[3] yourself if you're interested in it. Better yet, find all the ways I screwed up and let me know! It was all done in a giant Google Sheet—that might crash your browser—because, well, I don't know R! (Side note: someone please teach me how to use R.)
[1] https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/
[2] https://blog.apnic.net/2017/11/24/dns-performance-metrics-logarithmic-percentile-histogram/
[3] https://docs.google.com/spreadsheets/d/1-HLTy8Z4OqatLnufFEszbqkS141MBXJNEPZQScDD1hQ/edit#gid=1462593305
Thanks to @thcipriani and @greg for their review of this post!
//"DOCKER ZOMBIE" is a derivative of https://linux.pictures/projects/dark-docker-picture-in-playing-cards-style and shared under the same idgaf license as original https://linux.pictures/about. It was inspired by but not expressly derived from a different work by drewdomkus https://flickr.com/photos/drewdomkus/3146756158//
This survey will help the Release Engineering team measure developer satisfaction and determine where to invest resources. The topics covered will include the following:
We are soliciting feedback from all Wikimedia developers, including Staff, 3rd party contributors and volunteer developers. The survey will be open for 2 weeks, closing on November 14th.
This survey will be conducted via a third-party service, which may subject it to additional terms. For more information on privacy and data-handling, see the survey privacy statement.
To participate in this survey, please start here: Developer Satisfaction Survey.
Mukunda Modell
How’d we do in our strive for operational excellence last month? Read on to find out!
October had a relatively high number of incidents – compared to prior months and compared to the same month last year (details).
Terminology:
I’ve highlighted a few of last month’s resolved tasks below.
Fixed by volunteer @Mh-3110 (Mahuton).
The Thanks functionality for MediaWiki (created in 2013) wasn’t working in some cases. This problem was first reported in April, with four more reports since then. Mahuton investigated together with @SBisson. They found that the issue was specific to talk pages with structured discussions.
It turned out to be caused by an outdated array access key in SpecialThanks.php. Once adjusted, the functionality was restored to its former glory. The error existed for about eight months, since internal refactoring in March for T186920 changed the internal array.
This was Mahuton’s first Gerrit contribution. Thank you @Mh-3110, and welcome!
– T191442 / https://gerrit.wikimedia.org/r/461189
Fixed by volunteer @D3r1ck01 (Derick Alangi).
Administrators use the Special:DeletedContributions page to search for edits that are hidden from public view. When an admin typed a space at the end of their search, the MediaWiki application would throw a fatal exception. The user would see a generic error page, suggesting that the website may be unavailable.
Derick went in and updated the input handler to automatically correct these inputs for the user.
– T187619
Accessing the private link for ContentTranslation when logged-out isn’t meant to work. But, the code didn’t account for this fact. When users attempted to open such url when not logged in, the ContentTranslation code performed an invalid operation. This caused a fatal error from the MediaWiki application. The user would see a system error page without further details.
This could happen when opening the link from your bookmarks before logging in, or after restarting the browser, or after clearing one’s cookies.
Fixed by @santhosh (Santhosh Thottingal, WMF Language Engineering team).
– T205433
Thank you to everyone who helped by reporting or investigating problems in Wikimedia production; and for devising, coding or reviewing the corrective measures. Including: @Addshore, @Aklapper, @Anomie, @ArielGlenn, @Catrope, @D3r1ck01, @Daimona, @Fomafix, @Ladsgroup, @Legoktm, @MSantos, @Mainframe98, @Melos, @Mh-3110, @SBisson, @Tgr, @Umherirrender, @Vort, @aaron, @aezell, @cscott, @dcausse, @jcrespo, @kostajh, @matmarex, @mmodell, @mobrovac, @santhosh, @thcipriani, and @thiemowmde.
Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.
https://phabricator.wikimedia.org/tag/wikimedia-production-error
Thanks!
Until next time,
– Timo Tijhof
Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:AllPages...
[2] Tasks closed. – phabricator.wikimedia.org/maniphest/query...
[3] Tasks opened. – phabricator.wikimedia.org/maniphest/query...
How’d we do in our strive for operational excellence last month? Read on to find out!
Frequent:
Other:
This is an oldie: (Well..., it's an oldie where I come from... 🎸)
Terminology:
The combined volume of infrequent non-fatal errors is high. This limits our ability to automatically detect whether a deployment caused problems. The “public GET” risks in particular can (and have) caused alerts to fire that notify Operations of wikis potentially being down. Such exceptions must not be publicly exposed.
With that behind us... Let’s celebrate this month’s highlights!
Tyler Cipriani (Release Engineering) reported an error in Quiz. Wikiversity uses Quiz for interactive learning. Editors define quizzes in the source text (wikitext). The Quiz program processes this text, creates checkboxes with labels, and sends it to a user. When the sending part failed, "Error: Undefined index" appeared in the logs. @Umherirrender investigated.
A line in the source text can: define a question, or an answer, or nothing at all. The code that creates checkboxes needs to decide between "something" and "nothing". The code utilised the PHP "if" statement for this, which compares a value to True and False. The answers to a quiz can be any text, which means PHP first transforms the text to one of True or False. In doing so, values like "0" became False. This meant the code thought "0" was not an answer. The code responsible for sending checkboxes did not have this problem. When the code tried to access the checkbox to send, it did not exist. Hence, "Error: Undefined index".
Umherirrender fixed the problem by using a strict comparison. A strict comparison doesn't transform a value first, it only compares.
– T196684
Kosta Harlan (from Audiences's Growth team) investigated a warning for PageTriage. This extension provides the New Pages Feed tool on the English Wikipedia. Each page in the feed has metadata, usually calculated when an editor creates a page. Sometimes, this is not available. Then, it must be calculated on-demand, when a user triages pages. So far, so good. The information was then saved to the database for re-use by other triagers. This last part caused the serious performance warning: "Unexpected database writes".
Database changes must not happen on page views. The database has many replicas for reading, but only one "master" for all writing. We avoid using the master during page views to make our systems independent. This is a key design principle for MediaWiki performance. [5] It lets a secondary data centre build pages without connecting to the primary (which can be far away).
Kosta addressed the warning by improving the code that saves the calculated information. Instead of saving it immediately, an instruction is now sent via a job queue, after the page view is ready. This job queue then calculates and saves the information to the master database. The master synchronises it to replicas, and then page views can use it.
– T199699 / https://gerrit.wikimedia.org/r/455870
After developers submit code to Gerrit, they eagerly await the result from Jenkins, an automated test runner. It sometimes incorrectly reported a problem with the MergeHistory feature. The code assumed that the tests would finish by "tomorrow".
It might be safe to assume our tests will not take one day to finish. Unfortunately, the programming utility "strtotime", does not interpret "tomorrow" as "this time tomorrow". Instead, it means "the start of tomorrow". In other words, the next strike of midnight! The tests use UTC as the neutral timezone.
Every day in the 15 minutes before 5 PM in San Francisco (which is midnight UTC), code submitted to Code Review, could have mysteriously failing tests.
– Continue at https://gerrit.wikimedia.org/r/452873
In August, developers started to notice rare and mysterious failures from Jenkins. No obvious cause or solution was known at that time.
Later that month, Dan Duvall (Release Engineering team) started exploring ways to run our tests faster. Before, we had many small virtual servers, where each server runs only one test at a time. The idea: Have a smaller group of much larger virtual servers where each server could run many tests at the same time. We hope that during busier times this will better share the resources between tests. And, during less busy times, allow a single test to use more resources.
As implementation of this idea began, the mysterious test failures became commonplace. "No space left on device", was a common error. The test servers had their hard disk full. This was surprising. The new (larger) servers seemed to have enough space to accommodate the number of tests it ran at the same time. Together with Antoine Musso and Tyler Cipriani, they identified and resolved two problems:
Thank you to everyone who has helped report, investigate, or resolve production errors past month. Including:
Tpt
Ankry
Daimona
Legoktm
Volker_E
Pchelolo
Dan Duvall
Gilles Dubuc
Daniel Kinzler
Umherirrender
Greg Grossmeier
Gergő Tisza (Tgr)
Sam Reed (Reedy)
Giuseppe Lavagetto
Brad Jorsch (Anomie)
Tim Starling (tstarling)
Kosta Harlan (kostajh)
Jaime Crespo (jcrespo)
Antoine Musso (hashar)
Roan Kattouw (Catrope)
Adam WMDE (Addshore)
Stephane Bisson (SBisson)
Niklas Laxström (Nikerabbit)
Thiemo Kreuz (thiemowmde)
Subramanya Sastry (ssastry)
This, that and the other (TTO)
Manuel Aróstegui (Marostegui)
Bartosz Dziewoński (matmarex)
James D. Forrester (Jdforrester-WMF)
Thanks!
Until next time,
– Timo Tijhof
Further reading:
Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20180809&to=Incident+documentation%2F20180922&namespace=0
[2] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/wOuWkMNsZheu/#R
[3] Tasks opened. – https://phabricator.wikimedia.org/maniphest/query/6HpdI76rfuDg/#R
[4] Quiz on Wikiversity. – https://en.wikiversity.org/wiki/How_things_work_college_course/Conceptual_physics_wikiquizzes/Velocity_and_acceleration
[5] Operate multiple datacenters. – https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki
Note: this post has been published on 03/28 but has been originally written in September 2018 after Quibble 0.0.26 and never got published.
The last update about Quibble is from June 1st (Blog Post: Quibble in May), this is about updating on progress made over the summer.
Since the last update, Quibble version went from 0.0.17 to 0.0.26:
For --commands one pass them as shell snippets such as: --commands 'echo starting' 'phpunit' 'echo done'. A future version of Quibble would make it only accept a single argument though it can be repeated. Or in other terms, in the future one would have to use: --command 'echo starting' --command 'phpunit' --command 'echo done'.
The MediaWiki PHPUnit test suite to use is determined based on ZUUL_PROJECT. --phpunit-testsuite lets one explicitly set it, a use case is to run extensions tests for a change made to mediawiki/core and ensure it does not break extensions (ZUUL_PROJECT=mediawiki/core quibble --phpunit-testsuite=extensions mediawiki/extensions/BoilerPlate). On Wikimedia CI they are the wmf-quibble-* jobs.
You can get great speed up by using a tmpfs for the database. Create a tmpfs and then pass --db-dir to make use of it. With a Docker container one would do: docker run --tmpfs /workspace/db:size=320M quibble:latest --db-dir=/workspace/db.
In the future, I would like Quibble to be faster, it runs the commands in a serialized way and would be made faster by parallelizing at least some of the test commands (edit: done in 0.0.29).
Changelog for 0.0.17 to 0.0.26
This blog post will describe a bit about how we are utilizing the "Task Types" feature in Phabricator to facilitate better tracking of work and to streamline workflows with custom fields. Additionally, I will be soliciting feedback about potential use-cases which could potentially take further advantage of this feature.
Task Types are a relatively new feature in Phabricator which allow tasks to be created with extra information fields that are unique to tasks of a given type. For example, Release tasks have a release date and release version which are not relevant for other types of tasks.
Another task type that has been recently introduced is the deadline type. Deadlines include a single extra field Due Date which is displayed at the top of the task view as well as on workboard cards.
Deadline | Release |
---|---|
Task types have the potential to streamline workflows and support the use of Phabricator for collecting structured data.
One proposed use of task types is for collecting specific information in bug reports and feature requests. Bug reports, for example, might ask for OS or Browser version in separate fields to aid in sorting and searching through reports.
Another potential use-case which is currently being developed is a security issue task type. This will allow the security team to add fields relevant to security issues without cluttering the task form used by everyone for other types of tasks.
Custom forms can be created which hide irrelevant fields and generally streamline the process of submitting a task for a given workflow or for a team's specific use-case. This is a great feature in Phabricator and we have made extensive use of it for various purposes. The drawback to custom forms is that they are generally only useful for submitting tasks. Once a task is created, editing takes place on the normal "generic" task edit form.
It's now possible to assign a type to a form. Now it's possible to configure forms so that whenever you edit a Security task you always see the Edit Security Task form. Thanks to typed forms, we can now add custom fields which are always visible when editing one type of task but hidden when editing other types.
Security Issue Form | Standard Form |
---|---|
Your feedback will be helpful in shaping the types of tasks and forms available in Phabricator. In order to best meet the needs of everyone who uses Phabricator, I'd love to hear your input on what forms and fields would be most useful for your needs. Describe a workflow or a use-case that you think would be well served by custom fields. You can comment here or on the task: T93499: Add support for task types (subtypes)
It has been a while since the last mediawiki_selenium release! 💎
I have just released version 1.8.1. 🚀
Notable changes:
I would like to thank several contributors that have improved the gem since the last release: @hashar, @Rammanojpotla, @demon and @thiemowmde! 👏
[Quibble] is the new test runner for MediaWiki (see the intro Blog Post: Introducing Quibble). This post is to give an update of what happened during May 2018.
Željko Filipin wrote a blog post Blog Post: Run Selenium tests using Quibble and Docker.
Since the last update, Quibble version went from 0.0.11 to 0.0.17:
The documentation can use tutorials for various use cases. It is in integration/quibble.git in the doc/source directory. You should be able to generate it by simply running:
tox -e doc <your web browser> doc/build/index.html
Any support or question you might have are most welcome as a Phabricator task against Quibble.
I have migrated MediaWiki and a lot of extensions to use the Quibble jobs. There are still 229 mediawiki extensions not migrated yet. A test report is build daily by Jenkins:
https://integration.wikimedia.org/ci/job/integration-config-qa/lastCompletedBuild/testReport/
Tests "test_mediawiki_repos_use_quibble" represent extension not migrated yet. T183512 is the huge tracking task.
Make MediaWiki tests passing with Postgres!
T195807: Fix failing MediaWiki core tests on Postgres database backend
Huge thanks to Kunal Mehta, Timo Tijhof, Adam Wight, Željko Filipin and Stephen Niedzielski.
That is all for May 2018.
References
[Quibble]
https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089812.html
[Presentation]
https://commons.wikimedia.org/wiki/File:20180519-QuibblePres.pdf
[Last update]
https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089858.html
One particularly interesting topic discussed during the Hackathon Technical Debt session (T194934) was that of the contagious aspect of technical debt. Although this makes sense in hindsight, it's not something that I had really given much thought to previously.
The basic premise is that existing technical debt can have a contagious effect on other areas of code. One aspect of this is developers new to the MediaWiki code base may use existing code as a pattern for new code development. If that code has technical debt, the technical debt could get replicated in other areas of code.
This can be overcome with both education about desired patterns as well as sharing the technical debt state of existing code. It's not clear how best to accomplish the later, but perhaps it's as simple as a comment in the code, once it's been identified and is being tracked in Phabricator.
Another aspect of the contagion effect (perhaps more of a compound effect), is the result of maintaining code with existing technical debt. As bugs are fixed or minor features added, those changes can, in effect, result in a spreading of the technical debt. Of course this doesn't always need to be the case, but it can be, if one is not careful.
I'd like to get your thoughts on this topic and your past experiences working with and around technical debt.
Thoughts/Questions:
Dependencies are Git Python 3, and Docker Community Edition (CE).
First, the general setup.
$ git clone https://gerrit.wikimedia.org/r/p/integration/quibble ... $ cd quibble/ $ python3 -m pip install -e . ... $ docker pull docker-registry.wikimedia.org/releng/quibble-stretch:latest ... (2m 26s)
The simplest, and slowest, way to run Quibble.
$ docker run -it --rm \ docker-registry.wikimedia.org/releng/quibble-stretch:latest ... (12m 54s)
Speed things up by using local repositories.
$ mkdir -p ref/mediawiki/skins $ git clone --bare https://gerrit.wikimedia.org/r/mediawiki/core ref/mediawiki/core.git ... (3m 40s) $ git clone --bare https://gerrit.wikimedia.org/r/mediawiki/vendor ref/mediawiki/vendor.git ... $ git clone --bare https://gerrit.wikimedia.org/r/mediawiki/skins/Vector ref/mediawiki/skins/Vector.git ... $ mkdir cache $ chmod 777 cache $ mkdir -p log $ chmod 777 log $ mkdir -p src $ chmod 777 src $ docker run -it --rm \ -v "$(pwd)"/cache:/cache \ -v "$(pwd)"/log:/workspace/log \ -v "$(pwd)"/ref:/srv/git:ro \ -v "$(pwd)"/src:/workspace/src \ docker-registry.wikimedia.org/releng/quibble-stretch:latest ... (18m 0s)
The second run of everything, just to see if things get faster.
$ docker run -it --rm \ -v "$(pwd)"/cache:/cache \ -v "$(pwd)"/log:/workspace/log \ -v "$(pwd)"/ref:/srv/git:ro \ -v "$(pwd)"/src:/workspace/src \ docker-registry.wikimedia.org/releng/quibble-stretch:latest ... (16m 50s)
If you get this error message
A LocalSettings.php file has been detected. To upgrade this installation, please run update.php instead
just remove the file
$ rm src/LocalSettings.php
Speed things up by skipping Zuul and not installing dependencies.
$ docker run -it --rm \ -v "$(pwd)"/cache:/cache \ -v "$(pwd)"/log:/workspace/log \ -v "$(pwd)"/ref:/srv/git:ro \ -v "$(pwd)"/src:/workspace/src \ docker-registry.wikimedia.org/releng/quibble-stretch:latest --skip-zuul --skip-deps ... (6m 17s)
Speed things up by just running Selenium tests.
$ docker run -it --rm \ -v "$(pwd)"/cache:/cache \ -v "$(pwd)"/log:/workspace/log \ -v "$(pwd)"/ref:/srv/git:ro \ -v "$(pwd)"/src:/workspace/src \ docker-registry.wikimedia.org/releng/quibble-stretch:latest --skip-zuul --skip-deps --run selenium ... (1m 19s)
Running all tests for MediaWiki and matching what CI/Jenkins is running has been a constant challenge for everyone, myself included. Today I am introducing Quibble, a python script that clone MediaWiki, set it up and run test commands.
It is a follow up to the Vienna Hackathon in 2017. We had a lot of discussion to make the CI jobs reproducible on a local machine and to unify the logic at a single place. Today, I have added a few jobs to
mediawiki/core.
An immediate advantage is that they run in Docker containers and will start running as soon as an execution slot is available. That will be faster than the old jobs (suffixed with -jessie) that had to wait for a
virtual machine to be made available.
A second advantage, is one can exactly reproduce the build on a local computer and even hack code for a fix up.
The setup guide is available from the source repository (integration/quibble.git):
https://gerrit.wikimedia.org/g/integration/quibble/
The minimal example would be:
git clone https://gerrit.wikimedia.org/r/p/integration/quibble
cd quibble
python3 -m pip install -e .
quibble
A few more details are available in this post on the QA list:
https://lists.wikimedia.org/pipermail/qa/2018-April/002699.html
Please give it a try and send issues, support requests to Phabricator Quibble project.
It will eventually used for all MediaWiki extensions and skins as well.
I have been working on the project with more or less focus on it since 2015. Maybe the easiest way to follow the project is by taking a look at a few epic tasks:
T182421: Q3 Selenium framework improvements will come to an end in a few days, so last week a few of us had a meeting to discuss the project.
Conclusions:
What could have gone better:
Things to do:
Meeting notes are available at 20180320 Selenium Retrospective.
Image by Paul Friel - Meerkat II, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=24567063
This is a digest of the updates from several weeks of changelogs which are published upstream. This is an incomplete list as I've cherry-picked just the changes which I think will be of significant interest to end-users of Wikimedia's phabricator. Please see the upstream changelogs for a detailed overview of everything that's changed recently.
https://secure.phabricator.com/T13025 The bulk editor (previously sometimes called the "batch editor") has been rebuilt on top of modern infrastructure (EditEngine) and a number of bugs have been fixed.
You can now modify the set of objects being edited from the editor screen, and a wider range of fields (including "points" and some custom fields) are supported. The bulk editor should also handle edits of workboard columns with large numbers of items more gracefully.
Bulk edits can now be made silently (suppressing notifications, feed stories, and email) with bin/bulk make-silent. The need to run a command-line tool is a little clumsy and is likely to become easier in a future version of Phabricator, but the ability to act silently could help an attacker who compromised an account avoid discovery for an extended period of time.
Edits which were made silently show an icon in the timeline view to make it easier to identify them.
Herald now supports formally defining webhooks. You can configure webhooks in "firehose" mode (so they receive all events) or use Herald rules to call them when certain conditions are met.
Several users have requested a way to differentiate notifications triggered by an @mention from the deluge of regular task subscription notification emails. This feature should provide a very good solution. See T150766 for one such request.
Mail now supports "mail stamps" to make it easier to use client rules to route or flag mail. Stamps are pieces of standardized metadata attached to mail in a machine-parseable format, like "FRAGILE" or "RETURN TO SENDER" might be stamped on a package.
By default, stamps are available in the X-Phabricator-Stamps header. You can also enable them in the mail body by changing the Settings → Email Format → Send Stamps setting. This may be useful if you use a client like Gmail which can not act on mail headers.
Stamps provide more comprehensive information about object and change state than was previously available, and you can now highlight important mail which has stamps like mention(@alice) or reviewer(@alice).
See https://secure.phabricator.com/T13069 for additional discussion and plans for this feature.
You can now Mute Notifications for any object which supports subscriptions. This action is available in the right-hand column under Subscribe. Muting notifications for an object stops you from receiving mail from that object, except for mail triggered by Send me an email rules in Herald.
This feature is "on probation" and may be removed in the future if it proves more confusing than useful.
See https://secure.phabricator.com/T13068 for some discussion.
Maniphest now explicitly tracks a closed date (and closing actor) for tasks. This data will be built retroactively by a migration during the upgrade. This will take a little while if you have a lot of tasks (see "Migrations" below).
The Maniphest search UI can now order by close date and filter tasks closed between particular dates or closed by certain users. The maniphest.search API has similar support, and returns this data in result sets. This data is also now available via Export Data.
For closed tasks, the main task list view now shows a checkmark icon and the close date. For open tasks, the view retains the old behavior (no icon, modified date).
Herald rules can now Require secure mail. You can use this action to prevent discussion of sensitive objects (like security bugfixes) from being transmitted via email.
To use this feature, you'll generally write a Herald rule like this:
Global Rule for Revisions When: [ Projects ][ include ][ Security Fix ] Take actions: [ Require secure mail ]Users will still be notified that the corresponding object has been updated, but will have to follow a link in the mail to view details over HTTPS.
This may be useful if you use mailing lists with wide distributions or model sophisticated attackers as threats.
Note that this action is currently not stateful: the rule must keep matching every update to keep the object under wraps. This may change in the future. This flag may also support continuing to send mail content if GPG is configured in some future release.
I expect that we will utilize this feature to improve the secrecy of critical security bugs which are kept private until a security patch has been released.
- Slightly reduced the level of bleeding/explosions on the Maniphest burnup chart.
- Added date range filtering to activity logs, pull logs, and push logs.
- Push logs are now more human readable.
- "Assign to" should now work properly in the bulk editor.
- Fixed an issue with comment actions that affect numeric fields like "Points" in Maniphest.
- maniphest.edit should now accept null to unassign a task, as suggested by the documentation.
- GitLFS over SSH no longer fatals on a bad getUser() call.
- Commits and revisions may now Reverts <commit|revision> one another, and reverting or reverted changes are shown more clearly in the timeline.
This is your friendly but final warning that we are replacing Selenium tests written in Ruby with tests in Node.js. There will be no more reminders. Ruby stack will no longer be maintained. For more information see T139740 and T173488.
Extensive documentation is available at mediawiki.org. If you need help with the migration, I am available for pairing and code review (zfilipin in Gerrit, zeljkof in #wikimedia-releng).
To see how to write a test watch Selenium tests in Node.js tech talk (J78).
Željko Filipin, Engineer (Contractor) from Release Engineering team. That's me! 👋
Selenium tests in Node.js. We will write a new simple test for a MediaWiki extension. An example: https://www.mediawiki.org/wiki/Selenium/Node.js/Write
Tuesday, October 31, 16:00 UTC (E766).
The internet! The event will be streamed and recorded. Details coming soon.
We are deprecating Ruby Selenium framework (T173488).
See you there!
Youtube, Commons (coming soon)
Originally an email sent on September 25 2017 to qa, engineering and wikitech-l mailing lists.
This is your friendly but penultimate warning that we are replacing Selenium tests written in Ruby with tests in Node.js. There will be only one more reminder, in October. In the meantime, only critical problems will be resolved in the Ruby stack. After October we will no longer maintain it.
You can follow task T139740 or Release Engineering blog for more information.
Extensive documentation is available at mediawiki.org. If you need help with the migration, I am available for pairing and code review (zfilipin in Gerrit, zeljkof in #wikimedia-releng).
Originally an email sent on August 23 2017 to qa, engineering and wikitech-l mailing lists.
As announced in April, we are replacing Selenium tests written in Ruby with tests in Node.js. Now is the last responsible moment to make the move. There will be two more reminders, in September and October. In the meantime, only critical problems will be resolved in the Ruby stack. After October we will no longer maintain it. You can follow task T139740 for more information. Extensive documentation is available at mediawiki.org. If you need help with the migration, I am available for pairing and code review (zfilipin in Gerrit, zeljkof in #wikimedia-releng).
Originally an-email sent on April 3 2017 to qa, engineering and wikitech-l mailing lists.
You can now write Selenium tests in Node.js! Learn more about it at https://www.mediawiki.org/wiki/Selenium/Node.js
Five years ago we introduced browser tests using Selenium and a Ruby based stack. It has worked great for some teams, and not so great for others. Last year we talked to people from several teams and ran a survey. The outcome is a preference toward using a language developers are familiar with: JavaScript/Node.Js.
After several months of research and development, we are proud to announce support for writing tests in Node.js. We have decided to use WebdriverIO. It is already available in MediaWiki core and supports running tests for extensions.
You can give it a try in MediaWiki-Vagrant:
vagrant up vagrant ssh sudo apt-get install chromedriver export PATH=$PATH:/usr/lib/chromium cd /vagrant/mediawiki xvfb-run npm run selenium
Extensive details are available on the landing page: https://www.mediawiki.org/wiki/Selenium/Node.js
We plan to replace the majority of Selenium tests written in Ruby with tests in Node.js in the next 6 months. We can not force anybody to rewrite existing tests, but we will offer documentation and pairing sessions for teams that need help. After 6 months, teams that want to continue using Ruby framework will be able to do so, but without support from Release Engineering team.
I have submitted a skill share session for Wikimedia Hackathon 2017 in Vienna. If you would like to pair on Selenium tests in person, that would be a great time.
The list of short term actions is in task T139740.
I would like to thank several people for reviews, advice and code: Jean-Rene Branaa, Dan Duvall, Antoine Musso, Jon Robson, Timo Tijhof. (Names are sorted alphabetically by last name. Apologies to people I have forgot.)
I just finished deploying an update to Phabricator which includes a simple but rather useful feature:
T116515: Enable embedding of media from Wikimedia Commons
You can now embed videos from Wikimedia commons into any Task, Comment or Post. Just paste the commons URL to embed the standard commons player in an iframe. For example, this url:
https://commons.wikimedia.org/wiki/File:Saving_and_sharing_search_queries_in_Phabricator.webm
Produces this embedded video:
In T135327, the WMF Technical Collaboration team collected a list of Phabricator bugs and feature requests from the Wikimedia Developer Community. After identifying the most promising requests from the community, these were presented to Phacility (the organization that builds and maintains Phabricator) for sponsored prioritization.
I am very pleased to report that we are already seeing the benefits of this initiative. Several sponsored improvements have landed on https://phabricator.wikimedia.org/ over the past few weeks. For an overview of what's landed recently, read on!
The following tasks are now resolved:
Notice three of those have task numbers lower than 2000. Those long-standing tasks date from the first months of WMF's Phabricator evaluation and RFC period. When those tasks were originally filled, Phabricator was just a test install running in WMF Labs. For me, It's especially satisfying to close so many long-standing issues that have effected many of us for more than a year.
Several more issues were identified for sponsorship which are still awaiting a complete solution. Some of these are at least partially fixed and some are still pending. You can find out more details by reading the comments on each task linked below.
Besides the sponsored features and bug fixes, there are several other recent improvements which are worth mentioning.
This very helpful feature displays a graphical representation of a task's Parents and Subtasks.
Initially there was an issue with this feature that made tasks with many relationships unable to load. This was exacerbated by the historical use of "tracking tasks" in the Wikimedia Bugzilla context. Thankfully after a quick patch from @epriestley (the primary author of Phabricator) and lots of help and testing from @Danny_B and @Paladox, @mmodell was able to deploy a fix for the issue a little over 24 hours after it was discovered.
Here's to yet more fruitful collaborations with upstream Phabricator!
Starting Thursday May 12th, 13:00 PDT ( 20:00 GMT ) we will be having the first weekly Code Review office hours on freenode IRC in the #wikimedia-codereview channel.
Event details: E179: Code Review Office Hours
Background: T128371: Set up Code Review office hours
Thanks to everyone who's been helping to organize this. We would welcome people to submit your patches for review as well as reviewers who can spare a few minutes to provide feedback and hopefully merge some patches!
If you can't make it during the scheduled time period then please feel free to suggest other times that would be better for you. I intend to set up one or two other weekly time slots, at least one of which should be at a time that's more convenient for people in Europe and Asia.
Looking forward to seeing you in #wikimedia-codereview
Not a lot has changed for Wikimedia's instance of Phabricator over the past few months. That's because a lot has been happening behind the scenes, as well as upstream at Phacility. Members of the Release-Engineering-Team and Team-Practices group have been working since December 2015 to integrate various upstream changes, however, nothing was released to our production instance because there were so many important features that were in-progress and not yet fully usable. Additionally, we had to figure out exactly how these features would fit with the specific needs of our project and test a lot of functionality to be sure that we would not break anyone's workflows.
So our Phabricator instance has been relatively unchanged since November of last year. This all changed last Wednesday night (Thursday February 18th, 01:00 UTC) when we unleashed several months of changes into production. If you use phabricator.wikimedia.org regularly then you have probably already noticed some of the more obvious improvements.
A whole lot of hard work went into this release. Thankfully, everyone's hard work seems to have paid off as we only encountered a couple of relatively small issues which were fixed quickly after.
This post is to fill everyone in about what's changed and what you can expect from some of the exciting new functionality that has been added with this release.
It's now possible to customize individual project pages to meet the needs of each type of project or the needs of specific teams.
Projects can now be nested. There are two new types of projects in Phabricator and they could prove to be really useful for organizing all of the things. Sub-projects are just like regular projects, but nested inside of an existing project. Milestones are a special type of sub-project that can be used to represent a sprint or a software release. There are a few somewhat complex rules about how project membership, policies and tasks are affected by sub-projects. There is detailed coverage in the Phabricator Projects Documentation and we have attempted to explain some of the implications here:
Previously this functionality was provided by a custom field and rPHSP phabricator-Sprint
This couldn't have happened without everyone's help <3
Specifically I'd like to thank: