Doing the needfulhttps://phabricator.wikimedia.org/phame/blog/feed/1/2024-03-07T21:36:37+00:00Occasional updates from the #release-engineering-teamInvestigate a PHP segmentation faulthttps://phabricator.wikimedia.org/phame/post/view/306/hashar (Antoine Musso)2023-07-28T12:06:08+00:002024-03-07T21:36:37+00:00

Summary


The Beta-Cluster-Infrastructure is a farm of wikis we use for experimentation and integration testing. It is updated continuously: new code is every ten minutes and the databases every hour by running MediaWiki maintenance/update.php. The scheduling and running are driven by Jenkins jobs which statuses can be seen on the Beta view:

image.png (192×928 px, 33 KB)

On top of that, Jenkins will emit notification messages to IRC as long as one of the update job fails. One of them started failing on July 25th and this is how I was seeing it the alarm (times are for France, UTC+2):

image.png (92×941 px, 37 KB)

(wmf-insecte is the Jenkins bot, insecte is french for bug (animals), and the wmf- prefix identifies it as a Wikimedia Foundation robot).

Clicking on the link gives the output of the update script which eventually fails with:

+ /usr/local/bin/mwscript update.php --wiki=wikifunctionswiki --quick --skip-config-validation
20:31:09 ...wikilambda_zlanguages table already exists.
20:31:09 ...have wlzl_label_primary field in wikilambda_zobject_labels table.
20:31:09 ...have wlzl_return_type field in wikilambda_zobject_labels table.
20:31:09 /usr/local/bin/mwscript: line 27:  1822 Segmentation fault      sudo -u "$MEDIAWIKI_WEB_USER" $PHP "$MEDIAWIKI_DEPLOYMENT_DIR_DIR_USE/multiversion/MWScript.php" "$@"

The important bit is Segmentation fault which indicates the program (php) had a fatal fault and it got rightfully killed by the Linux Kernel. Looking at the instance Linux Kernel messages via dmesg -T:

[Mon Jul 24 23:33:55 2023] php[28392]: segfault at 7ffe374f5db8 ip 00007f8dc59fc807 sp 00007ffe374f5da0 error 6 in libpcre2-8.so.0.7.1[7f8dc59b9000+5d000]
[Mon Jul 24 23:33:55 2023] Code: ff ff 31 ed e9 74 fb ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 41 54 55 48 89 d5 53 44 89 c3 48 81 ec 98 52 00 00 <48> 89 7c 24 18 4c 8b a4 24 d0 52 00 00 48 89 74 24 10 48 89 4c 24
[Mon Jul 24 23:33:55 2023] Core dump to |/usr/lib/systemd/systemd-coredump 28392 33 33 11 1690242166 0 php pipe failed

With those data, I had enough to the most urgent step: file a task (T342769) which can be used as an audit trail and reference for the future. It is the single most important step I am doing whenever I am debugging an issue, since if I have to stop due to time constraint or lack of technical abilities, others can step in and continue. It also provides an historical record that can be looked up in the future, and indeed this specific problem already got investigated and fully documented a couple years ago. Having a task is the most important thing one must do whenever debugging, it is invaluable. For PHP segmentation fault, we even have a dedicated project php-segfault

With the task filed, I have continued the investigation. The previous successful build had:

19:30:18 ...have wlzl_label_primary field in wikilambda_zobject_labels table.
19:30:18 ...have wlzl_return_type field in wikilambda_zobject_labels table.
19:30:18 	❌ Unable to make a page for Z7138: The provided content's label clashes with Object 'Z10138' for the label in 'Z1002'.
19:30:18 	❌ Unable to make a page for Z7139: The provided content's label clashes with Object 'Z10139' for the label in 'Z1002'.
19:30:18 	❌ Unable to make a page for Z7140: The provided content's label clashes with Object 'Z10140' for the label in 'Z1002'.
19:30:18 ...site_stats is populated...done.

The successful build started at 19:20 UTC and the failing one finished at 20:30 UTC which gives us a short time window to investigate. Since the failure seems to happen after updating the WikiLambda MediaWiki extension, I went to inspect the few commits that got merged at that time. I took advantage of Gerrit adding review actions as git notes, notably the exact time a change got submitted and subsequently merged. The process:

Clone the suspect repository:

git clone https://gerrit.wikimedia.org/r/extensions/WikiLambda
cd WikiLambda

Fetch the Gerrit review notes:

git fetch origin refs/notes/review:refs/notes/review

The review notes can be shown below the commit by passing --notes=review to git log or git show, an example for the current HEAD of the repository:

$ git show -q --notes=review
commit c7f8071647a1aeb2cef6b9310ccbf3a87af2755b (HEAD -> master, origin/master, origin/HEAD)
Author: Genoveva Galarza <ggalarzaheredero@wikimedia.org>
Date:   Thu Jul 27 00:34:03 2023 +0200

    Initialize blank function when redirecting to FunctionEditor from DefaultView
    
    Bug: T342802
    Change-Id: I09d3400db21983ac3176a0bc325dcfe2ddf23238

Notes (review):
    Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org>
    Verified+2: jenkins-bot
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Submitted-by: jenkins-bot
    Submitted-at: Wed, 26 Jul 2023 22:47:59 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/942026
    Project: mediawiki/extensions/WikiLambda
    Branch: refs/heads/master

Which shows this change has been approved by Jforrester and entered the repository on Wed, 26 Jul 2023 22:47:59 UTC. Then to find the commits in that range, I ask git log to list:

  • anything that has a commit date for the day (it is not necessarily correct but in this case it is a good enough approximation)
  • from oldest to newest
  • sorted by topology order (aka in the order the commit entered the repository rather than based on the commit date)
  • show the review notes to get the Submitted-at field

I can then scroll to the commits having a Submitted-at in the time window of 19:20 UTC - 20:30 UTC. I have amended the below output to remove most of the review notes except for the first commit:

$ git log --oneline --since=2023/07/25 --reverse --notes=review --no-merges --topo-order
<scroll>
653ea81a Handle oldid url param to view a particular revision
Notes (review):
    Verified+1: SonarQube Bot <kharlan+sonarqubebot@wikimedia.org>
    Verified+2: jenkins-bot
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Submitted-by: jenkins-bot
    Submitted-at: Tue, 25 Jul 2023 19:26:53 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941482
    Project: mediawiki/extensions/WikiLambda
    Branch: refs/heads/master

fe4b0446 AUTHORS: Update for July 2023
Notes (review):
    Submitted-at: Tue, 25 Jul 2023 19:49:43 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941507

73fcb4a4 Update function-schemata sub-module to HEAD (1c01f22)
Notes (review):
    Submitted-at: Tue, 25 Jul 2023 19:59:23 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941384

598f5fcc PageRenderingHandler: Don't make 'read' selected if we're on the edit tab
Notes (review):
    Submitted-at: Tue, 25 Jul 2023 20:16:05 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/941456

Or in a Phabricator task and human friendly way:

The Update function-schemata sub-module to HEAD (1c01f22) has a short log of changes it introduces:

  • New changes:
  • abc4aa6 definitions: Add Z1908/bug-bugi and Z1909/bug-lant ZNaturalLanguages
  • 0f1941e definitions: Add Z1910/piu ZNaturalLanguage
  • 1c01f22 definitions: Re-label all objects to drop the 'Z' per Amin

Since the update script fail on WikiLambda I have reached out to its developers so they can investigate their code and maybe find what can trigger the issue.

On the PHP side we need a trace. That can be done by configuring the Linux Kernel to take a dump of the program before terminating it and having it stored on disk, it did not quite work due to a configuration issue on the machine and in the first attempt we forgot to run the command by asking bash to allow the dump generation (ulimit -c unlimited). From a past debugging session, I went to run the command directly under the GNU debugger: gdb.

There are a few preliminary step to debug the PHP program, at first one needs to install the debug symbols which lets the debugger map the binary entries to lines of the original source code. Since error mentions libpcre2 I also installed its debugging symbols:

$ sudo apt-get -y install php7.4-common-dbgsym php7.4-cli-dbgsym libpcre2-dbg

I then used gdb to start a debugging session:

sudo  -s -u www-data gdb --args /usr/bin/php /srv/mediawiki-staging/multiversion/MWScript.php update.php --wiki=wikifunctionswiki --quick --skip-config-validation
gdb>

Then ask gdb to start the program by entering in the input prompt: run . After several minutes, it caught the segmentation fault:

gdb> run
<output>
<output freeze for several minutes while update.php is doing something>

Thread 1 "php" received signal SIGSEGV, Segmentation fault.
0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, 
    subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, 
    start_offset=start_offset@entry=0, options=0, 
    match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870)
    at src/pcre2_match.c:6001
6001	src/pcre2_match.c: No such file or directory.

I could not find a debugging symbol package containing src/pcre2_match.c but that was not needed afterall.

To retrieve the stacktrace enter to the gdb prompt bt :

gdb> bt
#0  0x00007ffff789e807 in pcre2_match_8 (code=0x555555ce1fb0, 
    subject=subject@entry=0x7fffcb410a98 "Z1002", length=length@entry=5, 
    start_offset=start_offset@entry=0, options=0, 
    match_data=match_data@entry=0x555555b023e0, mcontext=0x555555ad5870)
    at src/pcre2_match.c:6001
#1  0x00005555556a3b24 in php_pcre_match_impl (pce=0x7fffe83685a0, 
    subject_str=0x7fffcb410a80, return_value=0x7fffcb44b220, subpats=0x0, global=0, 
    use_flags=<optimized out>, flags=0, start_offset=0) at ./ext/pcre/php_pcre.c:1300
#2  0x00005555556a493b in php_do_pcre_match (execute_data=0x7fffcb44b710, 
    return_value=0x7fffcb44b220, global=0) at ./ext/pcre/php_pcre.c:1149
#3  0x00007ffff216a3cb in tideways_xhprof_execute_internal ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#4  0x000055555587ddee in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1732
#5  execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#6  0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#7  0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#8  execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#9  0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#10 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#11 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#12 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#13 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#14 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#15 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#16 0x000055555587c63c in ZEND_DO_FCALL_SPEC_RETVAL_UNUSED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1602
#17 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53535
#18 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#19 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#20 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#21 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#22 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
    at ./Zend/zend_vm_execute.h:1714
#23 execute_ex (ex=0x555555ce1fb0) at ./Zend/zend_vm_execute.h:53539
#24 0x00007ffff2169c89 in tideways_xhprof_execute_ex ()
   from /usr/lib/php/20190902/tideways_xhprof.so
#25 0x000055555587de4b in ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER ()
 at ./Zend/zend_vm_execute.Quit
CONTINUING

Which is not that helpful. Thankfully the PHP project provides a set of macro for gdb which lets one map the low level C code to the PHP code that was expected. It is provided in their source repository /.gdbinit and one should use the version from the PHP branch being debugged, since we use php 7.4 I went to use the version from the latest 7.4 series (7.4.30 at the time of this writing): https://raw.githubusercontent.com/php/php-src/php-7.4.30/.gdbinit

Download the file to your home directory (ex: /home/hashar/gdbinit) and ask gdb to import it with, for example, source /home/hashar/gdbinit :

(gdb) source /home/hashar/gdbinit

This provides a few new commands to show PHP Zend values and to generate a very helpfull stacktrace (zbacktrace):

(gdb) zbacktrace
[0x7fffcb44b710] preg_match("\7^Z[1-9]\d*$\7u", "Z1002") [internal function]
[0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219 
[0x7fffcb44a760] Opis\JsonSchema\Validator->validateProperties(reference, reference, array(0)[0x7fffcb44a7d0], array(7)[0x7fffcb44a7e0], object[0x7fffcb44a7f0], object[0x7fffcb44a800], object[0x7fffcb44a810], NULL) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:943 
[0x7fffcb44a4c0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb44a530], array(7)[0x7fffcb44a540], object[0x7fffcb44a550], object[0x7fffcb44a560], object[0x7fffcb44a570]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:519 
[0x7fffcb44a310] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb44a380], array(7)[0x7fffcb44a390], object[0x7fffcb44a3a0], object[0x7fffcb44a3b0], object[0x7fffcb44a3c0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 
[0x7fffcb449350] Opis\JsonSchema\Validator->validateConditionals(reference, reference, array(0)[0x7fffcb4493c0], array(7)[0x7fffcb4493d0], object[0x7fffcb4493e0], object[0x7fffcb4493f0], object[0x7fffcb449400]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:703 
[0x7fffcb4490b0] Opis\JsonSchema\Validator->validateKeywords(reference, reference, array(0)[0x7fffcb449120], array(7)[0x7fffcb449130], object[0x7fffcb449140], object[0x7fffcb449150], object[0x7fffcb449160]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:523 
[0x7fffcb448f00] Opis\JsonSchema\Validator->validateSchema(reference, reference, array(0)[0x7fffcb448f70], array(7)[0x7fffcb448f80], object[0x7fffcb448f90], object[0x7fffcb448fa0], object[0x7fffcb448fb0]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:332 
<loop>

The stacktrace shows the code entered an infinite loop while validating a Json schema up to a point it is being stopped.

The arguments can be further inspected by using printz and giving it as argument an object reference. For the line:

For [0x7fffcb44aba0] Opis\JsonSchema\Validator->validateString(reference, reference, array(0)[0x7fffcb44ac10], array(7)[0x7fffcb44ac20], object[0x7fffcb44ac30], object[0x7fffcb44ac40], object[0x7fffcb44ac50]) /srv/mediawiki-staging/php-master/vendor/opis/json-schema/src/Validator.php:1219
(gdb) printzv 0x7fffcb44ac10
[0x7fffcb44ac10] (refcount=2) array:     Hash(0)[0x5555559d7f00]: {
}
(gdb) printzv 0x7fffcb44ac20
[0x7fffcb44ac20] (refcount=21) array:     Packed(7)[0x7fffcb486118]: {
      [0] 0 => [0x7fffcb445748] (refcount=17) string: Z2K2
      [1] 1 => [0x7fffcb445768] (refcount=18) string: Z4K2
      [2] 2 => [0x7fffcb445788] long: 1
      [3] 3 => [0x7fffcb4457a8] (refcount=15) string: Z3K3
      [4] 4 => [0x7fffcb4457c8] (refcount=10) string: Z12K1
      [5] 5 => [0x7fffcb4457e8] long: 1
      [6] 6 => [0x7fffcb445808] (refcount=6) string: Z11K1
}
(gdb) printzv 0x7fffcb44ac30
[0x7fffcb44ac30] (refcount=22) object(Opis\JsonSchema\Schema) #485450 {
id => [0x7fffcb40f508] (refcount=3) string: /Z6#
draft => [0x7fffcb40f518] (refcount=1) string: 07
internal => [0x7fffcb40f528] (refcount=1) reference: [0x7fffcb6704e8] (refcount=1) array:     Hash(1)[0x7fffcb4110e0]: {
      [0] "/Z6#" => [0x7fffcb71d280] (refcount=1) object(stdClass) #480576
}
(gdb) printzv 0x7fffcb44ac40
[0x7fffcb44ac40] (refcount=5) object(stdClass) #483827
Properties     Hash(1)[0x7fffcb6aa2a0]: {
      [0] "pattern" => [0x7fffcb67e3c0] (refcount=1) string: ^Z[1-9]\d*$
}
(gdb) printzv 0x7fffcb44ac50
[0x7fffcb44ac50] (refcount=5) object(Opis\JsonSchema\ValidationResult) #486348 {
maxErrors => [0x7fffcb4393e8] long: 1
errors => [0x7fffcb4393f8] (refcount=2) array:     Hash(0)[0x5555559d7f00]: {
}

Extracting the parameters was enough for WikiLambda developers to find the immediate root cause, they have removed some definitions which triggered the infinite loop and manually ran a script to reload the data in the Database. Eventually the Jenkins job managed to update the wiki database:

16:30:26 <wmf-insecte> Project beta-update-databases-eqiad build #69029: FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/69029/

One problem solved!

References:

CI: Get notified immediately when a job failshttps://phabricator.wikimedia.org/phame/post/view/302/kostajh (Kosta Harlan)2023-03-07T09:59:27+00:002023-04-17T14:24:07+00:00

If you've submitted patches for MediaWiki core, skins or extensions, you've seen this output in Gerrit:

image.png (498×1 px, 253 KB)

That is a list of links to each job's console output for a patch that failed verification.

You can see a job that failed at 1m 54s. But jenkins-bot does not post a comment on the patch until all jobs have completed. That means you won't get email/IRC notifications for test failures on your patch until the longest running job completes, in this case, after 14m 57s.[0] ⏳⏱️

With all due respect to xkcd/303... wouldn't it be nice to get notified as soon as a failure occurs, so you can fix your patch earlier to avoid context switching, or losing time during a backport window?

IMHO, yes, and, now it's possible!

⚙️ Get started

Commit a quibble.yaml file (documentation, example patch) to your MediaWiki project[1]:

earlywarning:
    should_vote: 1
    should_comment: 1

The next time that there is a test failure[2] in your repository, you will see a comment from the Early warning bot and a Verified: -1 vote.

Here's an example of how that might look in practice:

image.png (1×1 px, 303 KB)
[3]
(Yes, the formatting needs some work still, patches welcome!)

So, the bot announces 2 minutes after the patch is updated that there's a problem, with the output of the failed command. The full report from jenkins-bot arrives 14 minutes later.

📚 Further reading

For details on how this works, please see the documentation for the Early warning bot. Your feedback and contributions are very welcome on T323750: Provide early feedback when a patch has job failures (feel free to tag T323750 with patches adding quibble.yaml to your project.)

🙌🏻 Thank you

Cheers,
Kosta

[0] An alternative for getting real time progress is to watch Zuul TV

image.png (878×618 px, 118 KB)
. There is also the excellent work in T214068: Display Zuul status of jobs for a change on Gerrit UI but this does not generate email/IRC notifications or set a verification label.
[1] This will work for MediaWiki core, extensions, skins; in theory any CI job using Quibble could use it, though.
[2] Some jobs, like mwext-phan, won't report back early because they are not yet run via Quibble.

Shrinking H2 database fileshttps://phabricator.wikimedia.org/phame/post/view/300/hashar (Antoine Musso)2022-12-16T15:38:36+00:002023-01-09T17:23:24+00:00

Our code review system Gerrit has several caches, the largest ones being backed up on disk. The disk caches offload memory usage and persist the data between restarts. As a Java application, the caches are stored in H2 database files and I recently had to find how to connect to them in order to inspect their content and reduce their size.

In short: java -Dh2.maxCompactTime=15000 ... would cause the H2 driver to compact the database upon disconnection.

context

During an upgrade, the Gerrit installation filed up the system root partition entirely (incident report for Gerrit 3.5 upgrade). The reason is two caches occupying 9G and 11G out of a the 40G system partition. Those caches hold differences to files made by patchsets and are stored in two files:

/var/lib/gerrit2/review_site/cache/Size (MB)
git_file_diff.h2.db8376
gerrit_file_diff.h2.db11597

An easy fix would have been to stop the service, delete all caches, restart the service and let the application refile the cold caches. It is a short term solution, long term what if it is an issue in the application and we have to do the same all over again in the next few weeks? The large discrepancy also triggered my curiosity and I had to know the exact root cause to find a definitive fix to it. There started my journey of debugging.

They are all empty?

When looking at the cache through the application shows caches are way smaller at around 150MBytes:

ssh -p 29418 gerrit.wikimedia.org gerrit show-caches
  Name                          |Entries              |  AvgGet |Hit Ratio|
                                |   Mem   Disk   Space|         |Mem  Disk|
--------------------------------+---------------------+---------+---------+
D gerrit_file_diff              | 24562 150654 157.36m|  14.9ms | 72%  44%|
D git_file_diff                 | 12998 143329 158.06m|  14.8ms |  3%  14%|
                                               ^^^^^^^

One could assume some overhead but there is no reason for metadata to occupy hundred times more space than the actual data they are describing. Specially given each cached item is a file diff which is more than a few bytes. To retrieve the files locally I compressed them with gzip and they shrunk to a mere 32 MBytes! It is a strong indication those files are filled mostly with empty data which suggests the database layer never reclaims no more used blocks. Reclaiming is known as compacting in H2 database or vacuuming in Sqlite.

Connecting

Once I retrieved the files, I have tried to connect to them using the H2 database jar and kept doing mistakes after mistakes due to my completely lack of knowledge on that front:

Version matters

At first I tried with the latest version h2-2.1.214.jar and it did not find any data. I eventually found out the underlying storage system has been changed compared to version 1.3.176 used by Gerrit.I thus had to use an older version which can be retrieved from the Gerrit.war package.

File parameter which is not a file

I then wanted to a SQL dump of the database to inspect it using the Script java class: java -cp h2-1.3.176.jar org.h2.tools.Script, it requires a -url option which is a jdbc URI containing the database name. Intuitively I gave the full file name:

java -cp h2-1.3.176.jar org.h2.tools.Script -url jdbc:h2:git_file_diff.h2.db'

It returns instantly and generate the dump:

backup.sql
CREATE USER IF NOT EXISTS "" SALT '' HASH '' ADMIN;

Essentially an empty file. Looking at file on disk it created a git_file_diff.h2.db.h2.db file which is 24kbytes. Lesson learned, the h2.db suffix must be removed from the URI. I was then able to create the dump using:

java -cp h2-1.3.176.jar org.h2.tools.Script -url jdbc:h2:git_file_diff'

Which resulted in a properly sized backup.sql.

Web based admin

I have altered the SQL to make it fit Sqlite in order to load it in SqliteBrowser (a graphical interface which is very convenient to inspect those databases). Then I found invoking the jar directly starts a background process attached to the database and open my web browser to a web UI: java -jar h2-1.3.176.jar -url jdbc:h2:git_file_diff:

h2_web_ui.png (470×810 px, 92 KB)

That is very convenient to inspect the file. The caches are are key value storages with a column keeping track of the size of each record. Summing them is how gerrit show-caches finds out the size of the caches (roughly 150Mbytes for the two diff caches).

Compacting solutions

The H2 Database feature page mentions empty space is to be re-used which is not the case as seen above. The document states when the database connection is closed, it compact it for up to 200 milliseconds. Gerrit establish the connection on start up and keep it up until it is shutdown at which point the compaction occurs. It is not frequent enough, and the small delay is apparently not sufficient to compact our huge databases. To run a full compaction several methods are possible:

SHUTDOWN COMPACT: this request an explicit compaction and terminates the connection. The documentation implies it is not subject to the time limit. That would have required a change in the Gerrit Java code to issue the command.

org.h2.samples.Compact script: H2 has a org.h2.samples.Compact to manually compact a given database, it would need some instrumentation to trigger it against each file after Gerrit is shutdown, possibly as a systemd.service ExecStopPost and iterating through each files.

jdbc URL parameter MAX_COMPACT_TIME: the 200 milliseconds can be bumped by adding the parameter to the JDBC connection URL (separated by a semi column ;). Again it would require a change in Gerrit Java code to modify the way it connects.

The beauty of open source is I could access the database source code. It is hosted in https://github.com/h2database/h2database in the version-1.3 tag which holds a subdirectory for each sub version. When looking at a setting, the database driver uses the following piece of code (code licensed under Mozilla Public License Version 2.0 or Eclipse Public License 1.0):

version-1.3.176/h2/src/main/org/h2/engine/SettingsBase.java
60     /**
61      * Get the setting for the given key.
62      *
63      * @param key the key
64      * @param defaultValue the default value
65      * @return the setting
66      */
67     protected String get(String key, String defaultValue) {
68         StringBuilder buff = new StringBuilder("h2.");
69         boolean nextUpper = false;
70         for (char c : key.toCharArray()) {
71             if (c == '_') {
72                 nextUpper = true;
73             } else {
74                 // Character.toUpperCase / toLowerCase ignores the locale
75                 buff.append(nextUpper ? Character.toUpperCase(c) : Character.toLowerCase(c));
76                 nextUpper = false;
77             }
78         }
79         String sysProperty = buff.toString();
80         String v = settings.get(key);
81         if (v == null) {
82             v = Utils.getProperty(sysProperty, defaultValue);
83             settings.put(key, v);
84         }
85         return v;
86     }

When retrieving the setting MAX_COMPACT_TIME it forges a camel case version of the setting name prefixed by h2. which gives h2.maxCompactTime then look it up in the JVM properties an if set pick its value.

Raising the compact time limit to 15 seconds is thus all about passing to java: -Dh2.maxCompactTime=15000.

Applying and resolution

7f6215e039 in our Puppet applies the fix and summarize the above. Once I applied, I restart Gerrit once to have the setting taken in account and restarted it a second time to have it disconnect from the databases with the setting applied. The results are without appeal. Here are the largest gains:

FileBeforeAfter
approvals.h2.db610M313M
gerrit_file_diff.h2.db12G527M
git_file_diff.h2.db8.2G532M
git_modified_files.h2.db899M149M
git_tags.h2.db1.1M32K
modified_files.h2.db905M208M
oauth_tokens.h2.db1.1M32K
pure_revert.h2.db1.1M32K

The gerrit_file_diff and git_file_diff went from respectively 12GB and 8.2G to 0.5G which addresses the issue.

Conclusion

Setting the Java property -Dh2.maxCompactTime=15000 was a straightforward fix which does not require any change to the application code. It also guarantee the database will keep being compacted each time Gerrit is restarted and the issue that has lead to a longer maintenance window than expect would not reappear.

Happy end of year 2022!

References:

scap backport Makes Deployments Easyhttps://phabricator.wikimedia.org/phame/post/view/297/jeena (Jeena Huneidi)2022-09-26T22:47:23+00:002022-11-24T01:20:15+00:00

Mediawiki developers, have you ever thought, “I wish I could deploy my own code for Mediawiki”? Now you can! More deploys! More fun!

Next time you want to get some code deployed, why not try scap backport?

One Command To Deploy

scap backport is one command that will +2 your patch, deploy to mwdebug and wait for your approval, and finally sync to all servers. You only need to provide the change number or gerrit url of your change.

You can run scap backport on patches that have already merged, or re-run scap backport if you decided to cancel in the middle of a run. scap backport can also handle multiple patches at a time. After all the patches have been merged, they’ll be deployed all together. scap backport will confirm that your patches are deployable before merging, and double check no extra patches have sneaked into your deployment.

One Command To Revert

And if your code didn’t work out, don’t worry, there’s scap backport —revert, which will create a revert patch, send it to Gerrit, and run all steps of scap backport to revert your work. You’re offered the choice to give a reason for revert, which will show up in the commit message. Just be aware that you'll need to wait for tests to run and your code to merge before it gets synced, so in an emergency this might not be the best option.

Extra Information

You can also list available backports or reverts using the —list flag!

If you'd like some guidance on deploying backports, please sign up here to join us for backport training, which happens once a week on Thursday during the UTC late backport window!

Scap Backport In Action

ezgif.com-gif-maker(1).gif (450×800 px, 2 MB)

Compare to Manual Steps

For comparison, the previous way to backport would require the user to enter the following commands on the deployment host:

cd /srv/mediawiki-staging/php-<version>
git status
git fetch
git log -p HEAD..@{u}
git rebase

Then, if there were changes to an extension: git submodule update [extensions|skins]/<name>
Then, log in to mwdebug and run scap pull
Then, back on the deployment host: scap sync-file php-<version>/<path to file> 'Backport: [[gerrit:<change no>|<subject> (<bug no>)]]' for each changed file

Example Usage

List backports
scap backport --list

Backport change(s)
scap backport 1234
scap backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1234
scap backport 1234 5678

Merge but do not sync
scap backport --stop-before-sync 1234

List revertable changes
scap backport --revert --list

Revert change(s)
scap backport --revert 1234
scap backport --revert 1234 5678

That's all for now, and happy backporting!

Production Excellence #46: July & August 2022https://phabricator.wikimedia.org/phame/post/view/296/Krinkle (Timo Tijhof)2022-09-08T23:02:42+00:002022-09-11T15:16:05+00:00

How are we doing in our strive for operational excellence? Read on to find out!

Incidents

7 documented incidents in July, and 4 in August (Incident graphs). Read more about past incidents at Incident status on Wikitech.

2022-07-03 shellbox
Impact: For 16 minutes, edits and previews for pages with Score musical notes were slow or unavailable.

2022-07-10 thumbor
Impact: For several days, Thumbor p75 service response times gradually regressed by several seconds.

2022-07-11 FrontendUnavailable cache text
Impact: For 5 minutes, the MediaWiki API cluster in eqiad responded with higher latencies or errors.

2022-07-11 Shellbox and parsoid saturation
Impact: For 13 minutes, the mobileapps service was serving HTTP 503 errors to clients.

2022-07-12 codfw A5 power cycle
Impact: No observed public-facing impact. Internal clean up took some work, e.g. for Ganeti VMs.

2022-07-13 eqsin bandwidth
Impact: For 20 minutes, there was a small increase in error responses for thumbnails served from the Eqsin data center (Singapore).

2022-07-20 eqiad network
Impact: For 10-15 minutes, a portion of wiki traffic from Eqiad-served regions was lost (about 1M uncached requests). For ~30 minutes, Phabricator was unable to access its database.

2022-08-10 cassandra disk space
Impact: During planned downtime, other hosts ran out of space due to accumulating logs. No external impact.

2022-08-10 confd all hosts
Impact: No external impact.

2022-08-16 Beta Cluster 502
Impact: For 7 hours, all Beta Cluster sites were unavailable.

2022-08-16 x2 database replication
Impact: For 36 minutes, errors were noticeable for some editors. Saving edits was unaffected.

proderr-incidents 2022-08.png (800×1 px, 107 KB)


Incident follow-up

Recently completed incident follow-up:

Replace certificate on elastic09 in Beta Cluster
Brian (@bking, WMF Search) noticed during an incident review that an internal server used an expired cert and renewed it in accordance with a documented process.

Localisation cache must be purged after train deploy
@Tchanders (WMF AHT) filed this in 2020 after a recurring issue with stale interface labels. Work led by Ahmon (@dancy, WMF RelEng).

Remember to review and schedule Incident Follow-up work in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded.

Highlight from the "Oldest incident follow-up" query:

  • T83729 Fix monitoring of poolcounter service.

Trends

The month of July saw 22 new production errors of which 9 are still open today. In August we encountered 29 new production errors of which 10 remain open today and have carried over to September.

Take a look at the Wikimedia-production-error workboard and look for tasks that could use your help.

💡 Did you know?

To zoom in and find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard.

proderr-unified 2022-08.png (1×1 px, 110 KB)

For the month-over-month numbers, refer to the spreadsheet data.


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Production Excellence #45: June 2022https://phabricator.wikimedia.org/phame/post/view/292/Krinkle (Timo Tijhof)2022-07-30T00:14:09+00:002022-07-30T00:39:02+00:00

How are we doing in our strive for operational excellence? Read on to find out!

Incidents

There were 6 incidents in June this year. That's double the median of three per month, over the past two years (Incident graphs).

2022-06-01 cloudelastic
Impact: For 41 days, Cloudelastic was missing search results about files from commons.wikimedia.org.

2022-06-10 overload varnish haproxy
Impact: For 3 minutes, wiki traffic was disrupted in multiple regions for cached and logged-in responses.

2022-06-12 appserver latency
Impact: For 30 minutes, wiki backends were intermittently slow or unresponsive, affecting a portion of logged-in requests and uncached page views.

2022-06-16 MariaDB password
Impact: For 2 hours, a current production database password was publicly known. Other measures ensured that no data could be compromised (e.g. firewalls and selective IP grants).

2022-06-21 asw-a2-codfw power
Impact: For 11 minutes, one of the Codfw server racks lost network connectivity. Among the affected servers was an LVS host. Another LVS host in Codfw automatically took over its load balancing responsibility for wiki traffic. During the transition, there was a brief increase in latency for regions served by Codfw (Mexico, and parts of US/Canada).

2022-06-30 asw-a4-codfw power
Impact: For 18 minutes, servers in the A4-codfw rack lost network connectivity. Little to no external impact.

proderr-incidents 2022-06.png (800×1 px, 139 KB)


Incident follow-up

Recently completed incident follow-up:

Audit database usage of GlobalBlocking extension
Filed by Amir (@Ladsgroup) in May following an outage due to db load from GlobalBlocking. Amir reduced the extensions' DB load by 10%, through avoiding checks for edit traffic from WMCS and Toolforge. And he implemented stats for monitoring GlobalBlocking DB queries going forward.

Reduce Lilypond shellouts from VisualEditor
Filed by Reuven (@RLazarus) and Kunal (@Legoktm) after a shellbox incident. Ed (@Esanders) and Sammy (@TheresNoTime) improved the Score extension's VisualEditor plugin to increase its debounce duration.

Remember to review and schedule Incident Follow-up work in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.


Trends

In June and July (which is almost over), we reported 27 new production errors and 25 production errors respectively. Of these 52 new issues, 27 were closed in weeks since then, and 25 remain unresolved and will carry over to August.

We also addressed 25 stagnant problems that we carried over from previous months, thus the workboard overall remains at exactly 299 unresolved production errors.

Take a look at the Wikimedia-production-error workboard and look for tasks that could use your help.

💡 Did you know?

To zoom in and find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard .

For the month-over-month numbers, refer to the spreadsheet data.

proderr-unified 2022-06.png (1×1 px, 111 KB)


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

"Mr. Vice President. No numbers, no bubbles."
🔴🟠🟡🟢🔵🟣

GitLab-a-thon!https://phabricator.wikimedia.org/phame/post/view/288/brennen (Brennen Bearnes)2022-05-31T18:30:19+00:002022-06-03T18:39:22+00:00

Release Engineering's "GitLab-a-thon" sprint for May 10th-24th (roughly) focused on the mechanics of migrating a Wikimedia service to GitLab, setting up a CI pipeline, building container images from that service, and publishing images to the Wikimedia registry. We selected the Blubber project as a good candidate for experimentation:

We evaluated build mechanisms including GitLab's suggested docker-in-docker, Kaniko, Podman, and BuildKit:

We ultimately landed on BuildKit as the least constraining for future options, and the most in line with features we'd like to offer.

We explored a range of options for building and publishing, including variations on:

  • Building on runners provisioned on a DigitalOcean Kubernetes cluster and importing to the production registry from some trusted location (contint, for example) by way of a shim.
  • Building on trusted runners and publishing to the GitLab Container Registry, then importing to the production registry by way of a shim.
  • Building on trusted runners and publishing directly from there to the prod registry, authenticated against GitLab by way of JWT.

We eventually landed on this latter, and work is well underway on implementation: T308501: Authenticate trusted runners for registry access against GitLab using temporary JSON Web Token

Other work included implementing CI for Blubber on GitLab (T307534), improvements to user-facing documentation (T307535, T307538), enforcing the allowlist for container images in GitLab CI (T291978), experimentation with the GitLab Container Registry (T307537), and extensive discussions with ServiceOps on GitLab infrastructure.

Production Excellence #44: May 2022https://phabricator.wikimedia.org/phame/post/view/285/Krinkle (Timo Tijhof)2022-06-16T00:07:21+00:002022-06-16T01:13:36+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

By golly, we've had quite the month! 10 documented incidents, which is more than three times the two-year median of 3. The last time we experienced ten or more incidents in one month, was June 2019 when we had eleven (Incident graphs, Excellence monthly of June 2019).

I'd like to draw your attention to something positive. As you read the below, take note of incidents that did not impact public services, and did not have lasting impact or data loss. For example, the Apache incident benefited from PyBal's automatic health-based depooling. The deployment server incident recovered without loss thanks to Bacula. The Etcd incident impact was limited by serving stale data. And, the Hadoop incident recovered by resuming from Kafka right where it left off.

proderr-incidents 2022-05.png (800×1 px, 135 KB)

2022-05-01 etcd
Impact: For 2 hours, Conftool could not sync Etcd data between our core data centers. Puppet and some other internal services were unavailable or out of sync. The issue was isolated, with no impact on public services.

2022-05-02 deployment server
Impact: For 4 hours, we could not update or deploy MediaWiki and other services, due to corruption on the active deployment server. No impact on public services.

2022-05-05 site outage
Impact: For 20 minutes, all wikis were unreachable for logged-in users and non-cached pages. This was due to a GlobalBlocks schema change causing significant slowdown in a frequent database query.

2022-05-09 Codfw confctl
Impact: For 5 minutes, all web traffic routed to Codfw received error responses. This affected central USA and South America (local time after midnight). The cause was human error and lack of CLI parameter validation.

2022-05-09 exim-bdat-errors
Impact: During five days, about 14,000 incoming emails from Gmail users to wikimedia.org were rejected and returned to sender.

2022-05-21 varnish cache busting
Impact: For 2 minutes, all wikis and services behind our CDN were unavailable to all users.

2022-05-24 failed Apache restart
Impact: For 35 minutes, numerous internal services that use Apache on the backend were down. This included Kibana (logstash) and Matomo (piwik). For 20 of those minutes, there was also reduced MediaWiki server capacity, but no measurable end-user impact for wiki traffic.

2022-05-25 de.wikipedia.org
Impact: For 6 minutes, a portion of logged-in users and non-cached pages experienced a slower response or an error. This was due to increased load on one of the databases.

2022-05-26 m1 database hardware
Impact: For 12 minutes, internal services hosted on the m1 database (e.g. Etherpad) were unavailable or at reduced capacity.

2022-05-31 Analytics Hadoop failure
Impact: For 1 hour, all HDFS writes and reads were failing. After recovery, ingestion from Kafka resumed and caught up. No data loss or other lasting impact on the Data Lake.


Incident follow-up

Recently completed incident follow-up:

Invalid confctl selector should either error out or select nothing
Filed by Amir (@Ladsgroup) after the confctl incident this past month. Giuseppe (@Joe) implemented CLI parameter validation to prevent human error from causing a similar outage in the future.

Backup opensearch dashboards data
Filed back in 2019 by Filippo (@fgiunchedi). The OpenSearch homepage dashboard (at logstash.wikimedia.org) was accidentally deleted last month. Bryan (@bd808) tracked down its content and re-created it. Cole (@colewhite) and Jaime (@jcrespo) worked out a strategy and set up automated backups going forward.

Remember to review and schedule Incident Follow-up work in Phabricator! These are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.

💡Did you know?: The form on the Incident status page now includes a date, to more easily create backdated reports.

Trends

In May we discovered 28 new production errors, of which 20 remain unresolved and have come with us to June.

Last month the workboard totalled 292 tasks still open from prior months. Since the last edition, we completed 11 tasks from previous months, gained 11 additional errors from May (some of May was counted in last month), and have 7 fresh errors in the current month of June. As of today, the workboard houses 299 open production error tasks (spreadsheet, phab report).

Take a look at the workboard and look for tasks that could use your help.
View Workboard

proderr-unified 2022-05.png (1×1 px, 191 KB)


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Production Excellence #43: April 2022https://phabricator.wikimedia.org/phame/post/view/284/Krinkle (Timo Tijhof)2022-05-12T21:00:02+00:002022-05-12T21:00:02+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

Last month we experienced 2 (public) incidents. This is below the three-year median of 3 incidents a month (Incident graphs).

2022-04-06 esams network
Impact: For 30 minutes, wikis were slow or unreachable for a portion of clients to the Esams data center. Esams is one of two DCs primarily serving Europe, Middle East, and Africa.

2022-04-26 cr2-eqord down
Impact: No external impact. Internally, for 2 hours we were unable to access our Eqord routers by any means. This was due to a fiber cut on a redundant link to Eqiad, which then coincided with planned vendor maintenance on the links to Ulsfo and Eqiad. See also Network design.

proderr-incidents 2022-04.png (800×1 px, 127 KB)


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.

Recently resolved incident follow-up:

Reduce mysql grants for wikiadmin scripts
Filed in 2020 after the wikidata drop-table incident (details). Carried out over the last six months by Amir @Ladsgroup (SRE Data Persistence).

Improve reliability of Toolforge k8s cron jobs and Re-enable CronJobControllerV2
Filed earlier this week after a Toolforge incident and carried out by Taavi @Majavah.


Trends

During the month of April we reported 27 new production errors. Of these new errors, we resolved 14, and the remaining 13 are still open and have carried over to May.

Last month, the workboard totalled 298 unresolved error reports. Of these older reports that carried over from previous months, 16 were resolved. Most of these were reports from before 2019.

The new total, including some tasks for the current month of May, is 292. A slight decrease! (spreadsheet).

Take a look at the workboard and look for tasks that could use your help.

View Workboard

proderr-unified 2022-04.png (1×1 px, 116 KB)


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

In a fair fight, I'd kill you!
— Well, that's not much incentive for me to fight fair then, is it?

Production Excellence #42: March 2022https://phabricator.wikimedia.org/phame/post/view/283/Krinkle (Timo Tijhof)2022-04-21T21:29:56+00:002022-04-21T21:29:56+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

We've had quite the month, with 8 documented incidents. That's more than double the two-year median of three a month (Incident graphs).

2022-03-01 ulsfo network
Impact: For 20 minutes, clients normally routed to Ulsfo were unable to reach our projects. This includes New Zealand, parts of Canada, and the United States west coast.

2022-03-04 esams availability banner sampling
Impact: For 1.5 hours, all wikis were largely unreachable from Europe (via Esams), with more limited impact across the globe via other data centers as well.

2022-03-06 wdqs-categories
Impact: For 1.5 hours, some requests to the public Wikidata Query Service API were sporadically blocked.

2022-03-10 site availability
Impact: For 12 min, all wikis were unreachable to logged-in users, and to unregistered users trying to access uncached content.

2022-03-27 api
Impact: For ~4 hours, in three segments of 1-2 hours each over two days, there were higher levels of failed or slow MediaWiki API requests.

2022-03-27 wdqs outage
Impact: For 30 minutes, all WDQS queries failed due to an internal deadlock.

2022-03-29 network
Impact: For approximately 5 minutes, Wikipedia and other Wikimedia sites were slow or inaccessible for many users, mostly in Europe/Africa/Asia. (Details not public at this time.)

2022-03-31 api errors
Impact: For 22 minutes, API server and app server availability were slightly decreased (~0.1% errors, all for s7-hosted wikis such as Spanish Wikipedia), and the latency of API servers was elevated as well.

proderr-incidents 2022-03.png (800×1 px, 107 KB)


Incident follow-up

Remember to review and schedule Incident Follow-up (Sustainability) in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech. Some recently completed sustainability work:

Add linecard diversity to router-to-router interconnect at Codfw
Filed by Chris @CDanis (SRE Infra) in 2020 after an incident where all hosts in the Codfw data center lost connectivity at once. Completed by Arzhel @ayounsi and Cathal cmooney (SRE Infra), and @Papaul (DC Ops); including in Esams where the same issue existed.

Expand parser tests to cover language conversation variants in table-of-contents output
Suggested and carried out by @cscott (Parsoid) after reviewing an incident in November. The TOC on wikis that rely on the LanguageConverter service (such as Chinese Wikipedia) were no longer localized

Fix unquoted URL parameters in Icgina health checks
Suggested by Riccardo @Volans (SRE Infra) in response to an early warning signal for TLS certificate expiry. He realized that automated checks for a related cluster were still claiming to be in good health, when they in fact should have been firing a similar warning. Carried out by Filippo @fgiunchedi and Daniel @Dzahn.

Provide automation to quickly show replication status when primary is down
Filed in April by Jaime (SRE Data Persistence), carried out by John @jbond and Amir @Ladsgroup.


Trends

Since the last edition, we resolved 24 of the 301 unresolved errors that carried over from previous months.

In March, we created 54 new production errors. That's quite high compared to the twenty-odd reports we find most months. Of these, 17 remain open today a month later.

In the month of April, so far, we reported 20 new errors of which also 17 remain open today.

The production error workboard once again adds up to exactly 298 open tasks (spreadsheet).

Take a look at the workboard and look for tasks that could use your help.

View Workboard

proderr-unified 2022-03.png (1×1 px, 113 KB)


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Becky: What do you want?
Gilbert: I want Momma to take aerobics classes. I want Ellen to grow up. I want a new brain for Arnie.
Becky: — What do you want for you? Just for you?
Gilbert: I want to be a good person.

What We Learned from Trainsperiment Weekhttps://phabricator.wikimedia.org/phame/post/view/281/thcipriani (Tyler Cipriani)2022-04-20T20:15:37+00:002022-04-28T08:56:31+00:00

Developers should own the process of putting their code into production. They should decide when to deploy, monitor their deployment, and make decisions about rollback.

But that’s not how we work at Wikimedia today, and we on Release Engineering aren’t sure how to get there, so we’ve decided to experiment.

Typically a deployment takes us a full week to complete—the week of March 21st, 2022, we deployed MediaWiki four times.

We called that week 🚂🧪Trainsperiment Week.

📻 Deployment frequency

MediaWiki's mainline branch is changing constantly, but we deploy MediaWiki weekly (kind of). We keep stats that measure how far our main branch is from production.

deployment-fidelity.png (427×627 px, 18 KB)

The trainsperiment changed our deployment frequency, which affected all the other metrics, too. Faster deployment means smaller batch size, and shorter change lead time.

deployment-fidelity-cd.png (427×627 px, 13 KB)

📦 Change lead time

The number that we knew would change during trainsperiment week was change lead time—the time from merge to deploy. If I merge a change, then a minute later I deploy it, that change’s lead time is one minute.

This chart shows the average lead time of all patches in a given train:

image2273.png (877×1 px, 136 KB)

The chart below compares a typical week (1.38.0-wmf.1) to trainsperiment week (1.38.0-wmf.2, wmf.3, and wmf.4). Each dot is a change in a particular version—fewer dots mean fewer changes.

During trainsperiment week, we deployed faster. Each deployment was smaller, and the lead time of each patch in a release was shorter.

Annotated Trainsperiment Week (1).png (448×925 px, 63 KB)

Here’s the same data on a logarithmic scale. During trainsperiment week there were only a few hours between trains, so the lead time could be measured in hours, not days!

leadtime-log.png (665×1 px, 50 KB)

📝 Survey Feedback

At the end of the week, we asked for feedback via the Wikitech-l mailing list. We collected comments from the mediawiki.org talk page and the summaries of candid conversations.

👍 Satisfaction

A small number of people took the time to respond to the survey—20 people answered our questions.

Almost everyone who took the survey seemed satisfied with communication. Most were satisfied with the experiment overall.

There were concerns on the talk page and in the survey responses about testing. Testing felt time-crunched, and everyone was worried about the time pressure on our Quality and Test Engineering Team (QTE).

🚂🧪Trainsperiment Satsifaction (2).png (512×826 px, 21 KB)

🌚 Impact

Less than half of our respondents felt that the Trainsperiment positively impacted their work, with one respondent strongly disagreeing that there was a positive impact.

Most people were neutral about the impact of this experiment on their work.

The person who felt that there was a negative impact was concerned about the lack of time allotted for testing—they urged us to rethink testing if we wanted to try this again.

🚂🧪Trainsperiment Agreement (1).png (514×839 px, 23 KB)

💌 Comments

The survey contained free-form prompts for feedback. Below is a smattering of representative responses. Most of the comments below are amalgamations and simplifications, but the reactions in quotes are verbatim.

What should RelEng have done differently

  • Automated alerts: emails whenever there’s a deploy or the train is blocked

What would you need to change if we did this every week?

  • No time to find and fix regressions means the QA process would need to change somehow
  • More transparency around when train rolls out and a clearer blocking process
  • Translations
  • “my mental model.”

Other Feedback

  • “With less time between groups, breakage will reach all wikis very quickly”
  • “Often Tuesdays are currently used to deploy bug fixes that are hard to test locally […] we would need to revisit many of our workflows”
  • “This, at least on paper, will help devs”
  • “This was a pure win, IMO.”

🗣️ Conversations

We talked individually to people who had concerns about the experiment on Slack and IRC, in meetings, in the survey feedback, and on the talk page.

People were concerned about shortening the time for review. This is understandable given that we shortened a 168-hour process to a 12-hour process. 

Our QA process takes time. Our overburdened principal engineers take time to review code going live on a weekly basis. Due to some esoteric details, even our CI system gives us more confidence given more time—it was possible that MediaWiki could have broken compatibility with an extension without alerting anyone.

We have come to rely on the weekly cadence to make a careful release, and a faster process would mean rethinking our process pipeline to production.

🎀 Release Engineering's Feedback

The weekly train hides a lot of technical debt—it’s a giant feature flag and the missing testing environment rolled into one. It goes out every week (mostly), and Release Engineering spends about 20% of its time monitoring the release.

During trainsperiment week, we spent 100% of our time deploying—that’s not sustainable for our team.

We surfaced process pain points with this experiment, which was a success. We added to the already overlarge burdens of our principal engineers and quality engineers, which was a failure.

But this isn’t the end of the experiments. We endeavor to bring developers and production closer together—preferably with us standing back a healthy distance. If you’d like to help us get there—get in touch.


Thanks to @kchapman, @brennen, and @Krinkle for reading earlier drafts of this post and offering their feedback.

A Trainsperiments Week Reflectionhttps://phabricator.wikimedia.org/phame/post/view/278/dduvall (Dan Duvall)2022-04-01T02:29:46+00:002022-04-08T13:59:27+00:00

Over here in the Release-Engineering-Team, Train Deployment is usually a rotating duty. We've written about it before, so I won't go into the exact process, but I want to tell you something new about it.

It's awful, incredibly stressful, and a bit lonely.

And last week we ran an experiment where we endeavored to perform the full train cycle four times in a single week... What is wrong with us? (Okay. I need to own this. It was technically my idea.) So what is wrong with me? Why did I wish this on my team? Why did everyone agree to it?

First I think it's important to portray (and perhaps with a little more color) how terrible running the train can be.

How it usually feels to run a Train Deployment and why

Here's a little chugga-choo with a captain and a crew. Would the llama like a ride? Llama Llama tries to hide.

―Llama Llama, Llama Llama Misses Mama

At the outset of many a week I have wondered why, when the kids are safely in childcare and I'm finally in a quiet house well fed and preparing a nice hot shower to not frantically use but actually enjoy, my shoulder is cramping and there's a strange buzzing ballooning in my abdomen.

Am I getting sick? Did I forget something? This should be nice. Why can't I have nice things? Why... Oh. Yes. Right. I'm on train this week.

Train begins in the body before it terrorizes the mind, and I'm not the only one who feels that way.

A week of periodic drudgery which at any moment threatens to tip into the realm of waking nightmare.

―Stoic yet Hapless Conductor

Aptly put. The nightmare is anything from a tiny visual regression to taking some of the largest sites on the Internet down completely.

Giving a presentation but you have no idea what the slides are.

―Bravely Befuddled Conductor

Yes. There's no visibility into what we are deploying. It's a week's worth of changes, other teams' changes, changes from teams with different workflows and development cycles, all touching hundreds of different codebases. The changes have gone through review, they've been hammered by automated tests, and yet we are still too far removed from them to understand what might happen when they're exposed to real world conditions.

It's like throwing a penny into a well, a well of snakes, bureaucratic snakes that hate pennies, and they start shouting at you to fill out oddly specific sounding forms of which you have none.

―Lost Soul been 'round these parts

Kafkaesque.

When under the stress and threat of the aforementioned nightmare, it's difficult to think straight. But we have to. We have to parse and investigate intricate stack traces, run git blames on the deployment server, navigate our bug reporting forms and try to recall which teams are responsible for which parts of the aggregate MediaWiki codebase we've put together which itself is highly specific to WMF's production installation and really only becomes that long after changes merge to main branches of the constituent codebases.

We have to exercise clear judgement and make decisive calls of whether to rollback partially (previous group) or completely (all groups to previous version). We may have to halt everything and start hollering in IRC, Slack channels, mailing lists, to get the signal to the right folks (wonderful and gracious folks) that no more code changes will be deployed until what we're seeing is dealt with. We have to play the bad guys and gals to get the train back on track.

Trainsperiments Week and what was different about it

Study after study shows that having a good support network constitutes the single most powerful protection against becoming traumatized. Safety and terror are incompatible. When we are terrified, nothing calms us down like a reassuring voice or the firm embrace of someone we trust.

―Bessel Van Der Kolk, M.D., The Body Keeps the Score

Four trains in a single week and everyone in Release Engineering is onboard. What could possibly be better about that?

Well there is a safety in numbers as they say, and not in some Darwinistic way where most of us will be picked off by the train demons and the others will somehow take solace in their incidental fitness, but in a way where we are mutually trusting, supportive, and feeling collectively resourced enough to do the needful with aplomb.

So we set up video meetings for all scheduled deployment windows, had synchronous hand offs between our European colleagues and our North American ones. We welcomed folks from other teams into our deployments to show them the good, the bad, and the ugly of how their code gets its final send off 'round the bend and into the setting hot fusion reaction that is production. We found and fixed longstanding and mysterious bugs in our tooling. We deployed four full trains in a single week.

And it felt markedly different.

One of those barn raising projects you read about where everybody pushes the walls up en masse.

―Our Stoic Now Softened but Still Sardonic Conductor

Yes! Lonely and unwitnessed work is de facto drudgery. Toiling safely together we have a greater chance at staving off the stress and really feeling the accomplishment.

Giving a presentation with your friends and everyone contributes one slide.

―Our No Longer Befuddled but Simply Brave Conductor

Many hands make light work!

It was like throwing a handful of pennies into a well, a well of snakes, still bureaucratic and shouty, oh hey but my friends are here and they remind me these are just stack traces, words on a screen, and my friends happen to be great at filling out forms.

―Our Once Lost Now Found Conductor

When no one person is overwhelmed or unsafe, we all think and act more clearly.

The hidden takeaways of Trainsperiment Week

So how should what we've learned during our Trainsperiment Week inform our future deployment strategies and process. How should train deployments change?

The known hypothesis we wanted to test by performing this experiment was in essence:

  1. More frequent deployments will result in fewer changes being deployed each time.
  2. Fewer changes on average means the deployment is less likely to fail. The deployment is safer.
  3. A safer deployment can be performed more frequently. (Positive feedback loop to #1.)
  4. Overall we will: move faster; break less.

I don't know if we've proved that yet but we got an inkling that yes, the smaller subsequent deployments of the week did seem to go more smoothly. One week, however, even a week of four deployment cycles is not a large enough sample to say definitively whether doing train more frequently will for sure result in safer, more frequent deployments with fewer failures.

What was not apparent until we did our retrospective, however, is that it simply felt easier to do deployments together. It was still a kind of drudgery, but it was not abjectly terrible.

My personal takeaway is that a conductor who feels resourced and safe is the basis for all other improvements to the deployment process, and I want conductors to not only have tooling that works reliably with actionable logging at their disposal, but to feel a sense of community there with them when they're pushing the buttons. I want them to feel that the hard calls of whether or not to halt everything and rollback are not just their calls but shared in the moment among numerous people with intimate knowledge of the overall MediaWiki software ecosystem.

Better tooling—particularly around error reporting and escalation—is a barrier to entry for sure. Once we've made sufficient improvements there we need to get that tooling into other people's hands and show them that this process does not have to be so terrifying. And I think we're on the right track here with increased frequency and smaller sets of changes, but we can't lose sight of the human/social element and foundational basis of safety.

More than anything else, I want wider participation in the train deployment process by engineers in the entire organization along with volunteers.


Thanks to @thcipriani for reading my drafts and unblocking me from myself a number of times. Thanks to @jeena and @brennen for the inspirational analogies.

GitLab: Rethinking how we handle access controlhttps://phabricator.wikimedia.org/phame/post/view/273/brennen (Brennen Bearnes)2022-03-04T22:44:08+00:002022-03-10T00:42:03+00:00

I'll start with a bit of general administrivia. First, our migration of Wikimedia code review & CI to GitLab continues, and we're mindful that people could use regular updates on progress. Second, I need to think through some stuff about the project, and doing that in writing is helpful for all involved. I'm going to try writing occasional blog entries here for both purposes.

Now on to the main topic of this post: Access control for groups and projects on the Wikimedia GitLab instance.

The tl;dr: We've been modeling access to things on GitLab by using groups under /people to contain individual users and then granting those groups access to things under /repos. This has been tricky to explain and doesn't work as well at a technical level as we'd hoped, so we're mostly scrapping the distinction, and moving control of project access to individual memberships in groups under /repos. This should be easier to think about, simpler to manage, and seems like it will suit our needs better. Read on for the nitty-gritty detail.

(Thanks to @Dzahn, @Majavah, @bd808, @AntiCompositeNumber, and @thcipriani for helping me think through the issues underlying this post.)

Background

During the GitLab consultation, when we were working on building up a model of how we'd use GitLab for Wikimedia projects, we wrote up a draft policy for managing users and their access to projects.

GitLab supports Groups. GitLab groups are similar to GitHub's concept of organizations, although the specifics differ. Groups can contain:

  • Other, nested groups
  • Individual projects (repositories & metadata)
  • Users as members; members of other groups can be invited to a group
    • A user who is a member of a top-level group is also a member of every group it contains

We've since changed the original draft policy in some small ways - in particular, we decided to move most projects into a top-level /repos group in order to offer shared CI runners (see T292094). You can read the policy we landed on at the latest revision of GitLab/Policy on mediawiki.org.

The basic idea was that we would separate groups out into:

  1. Sub-groups of /repos: Namespaces for projects, split up by functional area of code
  2. Sub-groups of /people: Namespaces for individual users, split up by organizational units like:
    • Volunteer group
    • Teams at organizations such as the WMF, WMDE, etc.

Groups in /people could then be given access to projects under /repos.

Our hope was that this would let us decouple the management of groups of humans from the individual projects they work on, and ease onboarding for new contributors. A new member of the WMF Release Engineering team, for example, could be added to a single group and then have access to all the things they need to do their job.

We intended for most /people groups to be owned by their members, who would in turn have ownership-level access to their projects under /repos, allowing for contributors to a project to manage access and invite new contributors.

As a concrete example:

Problems with this scheme

I've been proceeding under this plan as people request the creation of GitLab project groups, but there turn out to be some problems.

First, it doesn't seem like permission inheritance for nested groups with other groups as members works the way you'd expect & hope: See T300939 - "GitLab group permissions are not inherited by sub-groups for groups of users invited to the parent repo".

Second, users have concerns about equity of access and tight coupling of things like employment with a specific organization to project access. We didn't have any intention of modeling any group of users as second-class citizens within this scheme, but it seems to create the impression of one all the same. It's also striking that the set of projects people work on just isn't that cleanly mapped to any particular organizational structure. Once you've been a technical contributor for a while, you've almost certainly collected responsibilities that no org chart reflects accurately.

Finally, and maybe most importantly, this is a complicated way to do things. People have a hard time thinking about it, and it requires a lot of explanation. That seems bad for an abstraction that we'd like to be basically self-serve for most users.

Proposed solution

Mostly, my plan is to use groups closer to how they seem to be designed:

  1. Sub-groups of /repos will contain both individual contributor memberships and projects
  2. Except in occasional one-off cases, access should be granted at the level of a containing group rather than at the level of individual projects, so as to avoid micromanaging access to many projects.
  3. We'll keep /people in mind as a potential solution for some problems (for example, it might be a good tool for synchronizing groups of users from LDAP and granting access to certain projects on that basis), but not rely on it for anything at the moment.

There are some unanswered questions here, but I plan to redraft the policy doc, move existing project layouts to this scheme, and start creating new project groups on this basis in the coming week or so.

My main philosophical takeaway here is that I work with a bunch of anarchists, and it's always best to plan accordingly.

Originally, one of our goals for this migration was avoiding a repeat of the weird, nested morass that is our current set of Gerrit permissions. While it would be a good idea to keep the structure of things on GitLab flatter and easier to think about, I'm no longer that worried about it. Some of the complexity is inherent to any large set of projects and contributors; some of it just reflects a long-lived technical culture that's emergent and largely self-governing, tendencies that nearly always resist well-intentioned efforts to rationalize and map structure to things like official organizational layout.

Diving Into Our Deployment Datahttps://phabricator.wikimedia.org/phame/post/view/272/thcipriani (Tyler Cipriani)2022-02-15T21:17:21+00:002022-02-25T22:11:38+00:00

If you’ve ever experienced the pride of seeing your name on MediaWiki's contributor list, you've been involved in our deployment process (whether you knew it or not).

The Wikimedia deployment process — 🚂🌈 The Train — pushed over 13,000 developer changes to production in 2021 . That's more than a change per hour for every single hour of the year—24 hours per day, seven days per week!

Trainbows_Not_Painbows1.svg.png (351×640 px, 33 KB)

As you deploy more software to production, you may begin to wonder: is anything I've been working on going to be deployed this week? What's the status of production? Where can I find data about any of this?

🤔 Current train info

Bryan Davis (@bd808) created the versions toolforge tool in 2017. The versions tool is a dashboard showing the current status of Wikimedia's more than 900 wikis.

Other places to find info about the current deployment:

📈 Past train data

There's an aphorism in management: you can't manage what you can't measure. For years the train chugged along steadily, but it's only recently that we've begun to collect data on its chuggings.

The train stats project started in early 2021 and contains train data going back to March 2016.

Now we're able to talk about our deployments informed by the data. Release-Engineering-Team partnered with Research late last year to explore the data we have.

We're able to see metrics like Lead time and Cycle time

We measured product delivery lead time as the time it takes to go from code committed to code running in production.

– Accelerate (pg. 14, 15)

leadtime.png (637×1 px, 162 KB)

Our lead time — the time to go from commit in mainline to production — is always less than a week. In the scatter plots above, we can see some evidence of work-life balance: not many patches land two days before deployment — that's the weekend!

For the software delivery process, the most important global metric is cycle time. This is the time between deciding that a feature needs to be implemented and having that feature released to users.

– Continuous Delivery (pg 138)

cycle-time.png (637×1 px, 48 KB)

Our cycle time — the time between a patch requesting code review and its deployment — varies. Some trains have massive outliers. In the chart above, for example, you can see one train that had a patch that was five years old!

It is now possible to see what we on Release Engineering had long suspected: the number of patches for each train has slowly been ticking up over time:

patches-per-train.png (489×1 px, 78 KB)

Also shown above: as the number of patches continues to rise, the number of comments per patch — that is, code-review comments per patch — has dropped.

The data also show that the average number of lines of code per patch is slightly going up:

loc-per-patch.png (485×1 px, 55 KB)

🔥 Train derailment

The train-stats repo has data on blockers and delays. Most trains have a small number of blockers and deploy without fanfare. Other trains are plagued by problems that explode into an endless number of blockers — cascading into a series of psychological torments, haunting deployers like the train-equivalent of ringwraiths. Trainwraiths, let’s say.

trainwraith.jpg (332×600 px, 91 KB)

The shape of the histogram of this data shows that blockers per train follows a power law — most trains have a few blockers:

blockers-per-train.png (681×787 px, 15 KB)

Surprisingly, most of our blockers happen before we even start a train. Bugs from the previous week that we couldn't justify halting everything to fix, but need to be fixed before we lay down more code on top.

blockers-by-group.png (426×1 px, 17 KB)

The data also let us correlate train characteristics with failure signals. Here we see that the number of patches (“patches”) per train (trending ↑) positively correlates with blockers, and lines of code review (“loc_per_train_bug”) per patch (trending ↓) negatively correlates with blockers — more patches and less code review are both correlated with more blockers:

correlation.png (716×713 px, 82 KB)

Contrast this with Facebook's view of train risk. In a 2016 paper entitled "Development and Deployment at Facebook," Facebook's researchers documented how their Release Engineering team quantified deployment risk:

Inputs affecting the amount of oversight exercised over new code are the size of the change and the amount of discussion about it during code reviews; higher levels for either of these indicate higher risk.
– Development and Deployment at Facebook (emphasis added)

In other words, to Facebook, more code, and more discussion about code, means riskier code. Our preliminary data seem to only partially support this: more code is riskier, but more discussion seems to lessen our risk.

🧭 Explore on your own

This train data is open for anyone to explore. You can download the sqlite database that contains all train data from our gitlab repo, or play with it live on our datasette install.

There are a few Jupyter notebooks that explore the data:

An audacious dream for the future of this data is to build a model to quantify exactly how risky a patchset is. We keep data on everything from bugs to rollbacks. Perhaps in future a model will help us roll out code faster and safer.


Thanks to @Miriam, @bd808, and @brennen for reading early drafts of this post: it'd be wronger without their input 💖.

Production Excellence #41: February 2022https://phabricator.wikimedia.org/phame/post/view/267/Krinkle (Timo Tijhof)2022-03-15T00:59:04+00:002022-03-20T12:49:25+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

3 documented incidents last month.

2022-02-01 ulsfo network
Impact: For 3 minutes, clients served by the ulsfo POP were not able to contribute or display un-cached pages.

2022-02-22 wdqs updater codfw
Impact: For 2 hours, WDQS updates failed to be processed. Most bots and tools were unable to edit Wikidata during this time.

2022-02-22 vrts
Impact: For 12 hours, incoming emails to a specific recently created VRTS queue were not processed with senders receiving a bounce with an SMTP 550 Error.

proderr-incidents 2022-02.png (800×1 px, 122 KB)

Figure from Incident graphs.


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.

Recently conducted incident follow-up:

Create a dashboard for Prometheus metrics about health of Prometheus itself.
Pitched by CDanis after an April 2019 incident, carried out by Filippo (@fgiunchedi).

Improve wording around AbuseFilter messages about throttling functionality.
Originally filed in 2018. This came up last month during an incident where the wording may've led to a misunderstanding. Now resolved by @Daimona.

Exclude restart procedure from automated Elasticsearch provisioning.
There can be too much automation! Filed after an incident last September. Fixed by @RKemper.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

I skip breakdowns most months as each breakdown has its flaws. However, I hear people find them useful, so I'll try to do them from time to time with my noted caveats. The last breakdown was in the December edition, which focussed on throughput during a typical month. Important to recognise is that neither high nor low throughput is per-se good or bad. It's good when issues are detected, reported, and triaged correctly. It's also good if a team's components are stable and don't produce any errors. A report may be found to be invalid or a duplicate, which is sometimes only determined a few weeks later.

The below "after six months" breakdown takes more of that into consideration by looking at what's on the table after six months (tasks upto Sept 2021). This may be considered "fairer" in some sense, although has the drawback of suffering from hindsight bias, and possibly not highlighting current or most urgent areas.

WMF Product:

  • Anti Harassment Tools (3): 1 MW Blocks, 2 SecurePoll.
  • Community Tech (0).
  • Design Systems (1): 1 WVUI.
  • Editing Team (15): 14 VisualEditor, 1 OOUI.
  • Growth Team (13): 11 Flow, 1 GrowthExperiments, 1 MW Recent changes.
  • Language Team (6): 4 ContentTranslation, 1 CX-server, 1 Translate extension.
  • Parsoid Team (9): 8 Parsoid, 1 ParserFunctions extension .
  • Product Infrastructure: 2 JsonConfig, 1 Kartographer, 1 WikimediaEvents.
  • Reading Web (0).
  • Structured Data (4): 2 MW Uploading, 1 WikibaseMediaInfo, 1 3D extension.

WMF Tech:

  • Data Engineering: 1 EventLogging.
  • Fundraising Tech: 1 CentralNotice.
  • Performance: 1 Rdbms.
  • Platform MediaWiki Team (19): 4 MW-Page-data, 1 MW-REST-API, 1 MW-Action-API, 1 MW-Snapshots, 1 MW-ContentHandler, 1 MW-JobQueue, 1 MW-libs-RequestTimeout, 9 Other.
  • Search Platform: 1 MW-Seach.
  • SRE Service Operations: 1 Other.

WMDE:

  • WMDE-Wikidata (7): 5 Wikibase, 2 Lexeme.
  • WMDE-TechWish: 1 FileImporter.

Other:

  • Missing steward (7): 2 Graph, 2 LiquidThreads, 2 TimedMediaHandler, 1 MW Special-Contributions-page.
  • Individually maintained (2): 1 WikimediaIncubator, 1 Score extension.

Trends

In February, we reported 25 new production errors. Of those, 13 have since been resolved, and 12 remain open as of today (two weeks into the following month). We also resolved 22 errors that remained open from previous months. The overall workboard has grown slightly to a total of 301 outstanding error reports.

proderr-unified 2022-02.png (1×1 px, 105 KB)

For the month-over-month numbers, refer to the spreadsheet data.


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Production Excellence #40: January 2022https://phabricator.wikimedia.org/phame/post/view/266/Krinkle (Timo Tijhof)2022-02-04T04:32:13+00:002022-02-04T16:21:02+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

There were no incidents this January. Pfew! Remember to review and schedule Incident Follow-up work in Phabricator. These are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.

proderr-incidents 2022-01.png (800×1 px, 166 KB)


Trends

During 2021, I compared us to the median of 4 incidents per month, as measured over the two years prior (2019-2020).

I'm glad to announce our median has lowered to 3 per month over the past two years (2020-2021). For more plots and numbers about our incident documentation, refer to Incident stats.

Since the previous edition, we resolved 17 tasks from previous months. In January, there were 45 new error reports of which 28 have been resolved within the same month, the remaining 17 have carried over to February.

With precisely 17 tasks both closed and added, the workboard remains at the exact total of 298 open tasks, for the third month in a row. That's quite the coincidence.

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Figure 1: Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

It could mean that that point in time contains some cosmic significance.., as if it were the temporal junction point of the entire space-time continuum… Or it could just be an amazing coincidence.

Production Excellence #39: December 2021https://phabricator.wikimedia.org/phame/post/view/265/Krinkle (Timo Tijhof)2022-01-17T22:16:19+00:002022-01-19T04:59:52+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

One documented incident last month:

2021-12-03 mx
Impact: A portion of outgoing email from wikimedia.org was delivered with a delay of upto 24 hours. This affected staff Gmail, and Znuny/Phabricator notifications. No mail was lost, it was eventually delivered.

proderr-incidents 2021-12.png (840×2 px, 154 KB)

Image from Incident graphs.


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator. These are preventive measures and tech debt mitigations written down after an incident. Read about past incidents at Incident status on Wikitech.

Recently resolved incident follow-up:

Create paging alert for high MX queues.
Filed in December after the mail delivery incident, resolved later that month by Keith (Herron).

Limit db execution time of expensive MW special pages.
Filed in December after various incidents due to high DB/appserver load, carried out by Amir (Ladsgroup).


Trends

In December we reported 22 new errors in December, of which 5 have since been resolved, and 17 remain open and have carried over to January. From the 298 issues previously carried over, we also resolved 17, thus the workboard still adds up to 298 in total.

In previous editions, we sometimes looked at the breakdown of tasks that remained unresolved. This time, I'd like to draw attention to the throughput and distribution of tasks that did get resolved.

Production errors resolved in the month of December, by team and component (query):

  • Community-Tech (2): GlobalPreferences (1), CodeMirror (1).
  • DBA: DjVuHandler (1).
  • Editing-team: DiscussionTools (1).
  • Fundraising Tech: CentralNotice (1).
  • Growth-Team (8): GrowthExperiments (6), Image-Suggestions (1), StructuredDiscussions (1).
  • Language-Team: UniversalLanguageSelector (1).
  • Parsoid (1).
  • Product-Infrastructure: TemplateStyles (1).
  • Readers-Web (2).
  • Structured-Data (2).
  • Wikidata team: Wikidata-Page-Banner (1).
  • Missing steward (1): MediaWiki-Logevents (T289806: Thanks @Umherirrender!).

Figure 1: Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
View Workboard

Oldest unresolved errors:

  • (June 2020) WikibaseClient: RuntimeException in wblistentityusage API. T254334
  • (June 2020) WikibaseClient: Deadlock in EntityUsageTable::addUsages method. T255706

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

💡 Did you know:

To find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard.

Production Excellence #38: November 2021https://phabricator.wikimedia.org/phame/post/view/261/Krinkle (Timo Tijhof)2021-12-12T01:34:21+00:002021-12-13T21:48:59+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

6 documented incidents last month. That's above the two-year and five-year median of 4 per month (per Incident graphs).

2021-11-04 large file upload timeouts
Impact: For 9 months, editors were unable to upload large files (e.g. to Commons). Editors would receive generic error messages, typically after a timeout. In retrospect, a dozen different distinct production errors had been reported and regularly observed that were related and provided different clues, however most of these remained untriaged and uninvestigated for months. This may be related to the affected components having no active code steward.

2021-11-05 TOC language converter
Impact: For 6 hours, wikis experienced a blank or missing table of contents on many pages. For up to 3 days prior, wikis that have multiple language variants (such as Chinese Wikipedia) displayed the table of contents in an incorrect or inconsistent language variant (which are not understandable to some readers).

2021-11-10 cirrussearch commonsfile outage
Impact: For ~2.5 hours, the Search results page was unavailable on many wikis (except English Wikipedia). On Wikimedia Commons the search suggestions feature was unresponsive as well.

2021-11-18 codfw ipv6 network
Impact: For 8 minutes, the Codfw cluster experienced partial loss of IPv6 connectivity for upload.wikimedia.org. This did not affect availability of the service because the "Happy Eyeballs" algorithm ensures browsers (and other clients) automatically fallback to IPv4. The Codfw cluster generally serves Mexico and parts of the US and Canada. The upload.wikimedia.org service serves photos and other media/document files, such as displayed in Wikipedia articles.

2021-11-23 core network routing
Impact: For about 12 minutes, Eqiad was unable to reach hosts in other data centers via public IP addresses. This was due to a BGP routing error. There was no impact on end-user traffic, and impact on internal traffic was limited (only Icinga alerts themselves) because internal traffic generally uses local IP subnets which we currently route with OSPF instead of BGP.

2021-11-25 eventgate-main outage
Impact: For about 3 minutes, eventgate-main was down. This resulted in 25,000 MediaWiki backend errors due to inability to queue new jobs. About 1000 user-facing web requests failed (HTTP 500 Error). Event production briefly dropped from ~3000 per second to 0 per second.

proderr-incidents 2021-11.png (800×2 px, 172 KB)


Incident follow-up

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read more about past incidents at Incident status on Wikitech.

Recently resolved incident follow-up:

Disable DPL on wikis that aren't using it.
Filed after a July 2021 incident, done by Amir (Ladsgroup) and Kunal (Legoktm).

Create easy access to MySQL ports for faster incident response and maintenance.
Filed in Sep 2021, and carried out by Stevie (Kormat).

Create paging alert for primary DB hosts.
Filed after a Sept 2019 incident, done by Stevie (Kormat).


Trends

November saw 27 new production error reports of which 14 were resolved, and 13 remain open and carry over to the next month.

Of the 301 errors still open from previous months, 16 were resolved. Together with the 13 carried over from November that brings the workboard to 298 unresolved tasks.

Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.


Outstanding errors
💡 Did you know:

To find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard.

View Workboard

Issues carried over from recent months:

Apr 20219 of 42 issues left.
May 202116 of 54 issues left.
Jun 20219 of 26 issues left.
Jul 202111 of 31 issues left.
Aug 202110 of 46 issues left.
Sep 202110 of 24 issues left.
Oct 202120 of 49 issues left.
Nov 202113 of 27 new issues are carried forward.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Production Excellence #37: October 2021https://phabricator.wikimedia.org/phame/post/view/260/Krinkle (Timo Tijhof)2021-11-05T02:05:31+00:002021-11-05T02:05:31+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

There were 4 documented incidents last month. This is currently on average, compared to the past five years (per Incident graphs).

2021-10-08 network provider
Impact: For upto an hour, some regions experienced a partial connectivity outage. This primarily affected the US East Coast for ~13 minutes, and Russia for 1 hour. It was caused by a routing problem with one of several redundant network providers.

2021-10-22 eqiad networking
Impact: For ~40 minutes clients that are normally geographically routed to Eqiad experienced connection or timeout errors. We lost about 7K req/s during this time. After initial recovery, Eqiad was ready and repooled in ~10 minutes.

2021-10-25 s3 db replica
Impact: For ~30min MediaWiki backends were slower than usual. For ~12 hours, many wiki replicas were stale for Wikimedia Cloud Services such as Toolforge.

2021-10-29 graphite
Impact: During a server upgrade, historical data was lost for a subset of Graphite metrics. Some were recovered via the redundant server, but others were lost as the redundant was also upgraded since then and lost some in a similar fashion.

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded. Read about past incidents at Incident status on Wikitech.

proderr-incidents 2021-10.png (840×2 px, 182 KB)


Trends
Norwegian blue 🐦

298 bugs were up on the board.
We solved 20 of those over the past thirty days.

How many might now be left unexplored?
We also added new bugs to our database.

Half those bugs are pining for their fjord.
The other 23 carry on, with their dossiers.

All in all, 301 bugs up on the board.

In October, 49 new tasks were reported as production errors. Of these, we resolved 26, and 23 remain unresolved and carry forward to the next month.

Previously, the production error workboard held an accumulated total of 298 still-open error reports. We resolved 20 of those. Together with the 23 new errors carried over from October, this brings us to 301 unresolved errors on the board.

Figure 1: Unresolved error reports by month.

For the month-over-month numbers, refer to the spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Issues carried over from recent months:

Apr 20219 of 42 issues left.
May 202116 of 54 issues left.
Jun 20219 of 26 issues left.
Jul 202112 of 31 issues left.
Aug 202112 of 46 issues left.
Sep 202111 of 24 issues left.
Oct 202123 of 49 new issues are carried forward.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Production Excellence #36: September 2021https://phabricator.wikimedia.org/phame/post/view/259/Krinkle (Timo Tijhof)2021-10-21T23:31:26+00:002021-10-22T15:11:01+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

We've had quite an eventful month, with 8 documented incidents in September. That's the highest since last year (Feb 2020) and one of the three worst months of the last five years.

  • 2021-09-01 partial Parsoid outage
    • Impact: For 9 hours, 10% of Parsoid requests to parse/save pages were failing on all wikis. Little to no end-user impact apart from minor due to RESTBase retries.
  • 2021-09-04 appserver latency
    • Impact: For 37 minutes, MW backends were slow with 2% of requests receiving errors. This affected all wikis through logged-in users, bots/API queries, and some page views from unregistered users (e.g. pages that were recently edited or expired from CDN cache).
  • 2021-09-06 Wikifeeds
    • Impact: For 3 days, the Wikifeeds API failed ~1% of requests (e.g. 5 of 500 req/s).
  • 2021-09-12 Esams upload
    • Impact: For 20 minutes, images were unavailable for people in Europe, affecting all wikis.
  • 2021-09-13 CirrusSearch restart
    • Impact: For ~2 hours, search was unavailable on Wikipedia from all regions. Search suggestions were missing or slow, and the search results page errored with "Try again later".
  • 2021-09-18 appserver latency
    • Impact: For ~10 minutes, MW backends were slow or unavailable for all wikis.
  • 2021-09-26 appserver latency
    • Impact: For ~15 minutes, MW backends were slow or unavailable for all wikis.
  • 2021-09-29 eqiad kubernetes
    • Impact: For 2 minutes, MW backends were affected by a Kubernetes issue (via Kask sessionstore). 1500 edit attempts failed (8% of POSTs), and logged-in pageviews were slowed down, often taking several seconds.

Remember to review and schedule Incident Follow-up work in Phabricator, which are preventive measures and tech debt mitigations written down after an incident is concluded.

proderr-incidents 2021-09.png (830×1 px, 170 KB)

Image from Incident graphs.


Trends

The month of September saw 24 new production error reports of which 11 have since been resolved, and today, three to six weeks later, 13 remain open and have thus carried over to the next month. This is about average, although it makes it no less sad that we continue to introduce (and carry over) more errors than we rectify in the same time frame.

On the other hand, last month we did have a healthy focus on some of the older reports. The workboard stood at 301 unresolved errors last month. Of those, 16 were resolved. With the 13 new errors from September, this reduces the total slightly, to 298 open tasks.

Unresolved error reports, stacked by month.

For the month-over-month numbers, refer to the spreadsheet data.


Did you know
  • 💡 The default "system error" page now includes a request ID. T291192
  • 💡 To zoom in and find your team's error reports, use the appropriate "Filter" link in the sidebar of the workboard.

Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Summary over recent months:

Jan 2021 (50 issues)3 left. Unchanged.
Feb 2021 (20 issues)5 > 4 left.
Mar 2021 (48 issues)10 > 9 left.
Apr 2021 (42 issues)17 > 10 left.
May 2021 (54 issues)20 > 17 left.
Jun 2021 (26 issues)10 > 9 left.
Jul 2021 (31 issues)12 left. Unchanged.
Aug 2021 (46 issues)17 > 12 left.
Sep 2021 (24 issues)13 unresolved issues remaining.

Tally
301issues open, as of Excellence #35 (August 2021)
-16issues closed, of the previous 301 open issues.
+13new issues that survived September 2021.
298issues open, as of today (19 Oct 2021).

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Benchmarking MediaWiki with PHPBenchhttps://phabricator.wikimedia.org/phame/post/view/257/kostajh (Kosta Harlan)2021-10-28T13:54:06+00:002022-03-30T12:14:03+00:00

This post gives a quick introduction to a benchmarking tool, phpbench, ready for you to experiment with in core and skins/extensions.[1]

What is phpbench?

From their documentation:

PHPBench is a benchmark runner for PHP analagous to PHPUnit but for performance rather than correctness.

In other words, while a PHPUnit test will tell you if your code behaves a certain way given a certain set of inputs, a PHPBench benchmark only cares how long that same piece of code takes to execute.

The tooling and boilerplate will be familiar to you if you've used PHPUnit. There's a command-line runner at vendor/bin/phpbench, benchmarks are discoverable by default in tests/Benchmark, a configuration file (benchmark.json) allows for setting defaults across all benchmarks, and the benchmark tests classes and tests look pretty similar to PHPUnit tests.

Here's an example test for the Html::openElement() function:

namespace MediaWiki\Tests\Benchmark;

class HtmlBench {

	/**
	 * @Assert("mode(variant.time.avg) < 85 microseconds +/- 10%")
	 */
	public function benchHtmlOpenElement() {
		\Html::openElement( 'a', [ 'class' => 'foo' ] );
	}
}

So, taking it line by line:

  • class HtmlBench (placed in tests/Benchmark/includes/HtmlBench.php) – the class where you can define the benchmarks for methods in a class. It would make sense to create a single benchmark class for a single class under test, just like with PHPUnit.
  • public function benchHtmlOpenElement() {} – method names that begin with bench will be executed by phpbench; other methods can be used for set-up / teardown work. The contents of the method are benchmarked, so any set-up / teardown work should be done elsewhere.
  • @Assert("mode(variant.time.avg) < 85 microseconds +/- 10%") – we define a phpbench assertion that the average execution time will be less than 85 microseconds, with a tolerance of +/- 10%.

If we run the test with composer phpbench, we will see that the test passes. One thing to be careful with, though, is adding assertions that are too strict – you would not want a patch to fail CI because the assertion for execution was not flexible enough (more on this later on).

Measuring performance while developing

One neat feature in PHPBench is the ability to tag current results and compare with another run. Looking at the HTMLBench benchmark test from above, for example, we can compare the work done in rMW5deb6a2a4546: Html::openElement() micro-optimisations to get before and after comparisons of the performance changes.

Here's a benchmark of e82c5e52d50a9afd67045f984dc3fb84e2daef44, the commit before the performance improvements added to Html::openElement() in rMW5deb6a2a4546: Html::openElement() micro-optimisations

❯ git checkout -b html-before-optimizations e82c5e52d50a9afd67045f984dc3fb84e2daef44 # get the old HTML::openElement code before optimizations
❯ git review -x 727429 # get the core patch which introduces phpbench support
❯ composer phpbench -- tests/Benchmark/includes/HtmlBench.php --tag=original

And the output [2]:

image.png (842×2 px, 176 KB)

Note that we've used --tag=original to store the results. Now we can check out the newer code, and use --ref=original to compare with the baseline:

❯ git checkout -b html-after-optimizations 5deb6a2a4546318d1fa94ad8c3fa54e9eb8fc67c # get the new HTML::openElement code with optimizations
❯ git review -x 727429 # get the core patch which introduces phpbench support
❯ composer phpbench -- tests/Benchmark/includes/HtmlBench.php --ref=original --report=aggregate

And the output [3]:

image.png (798×2 px, 177 KB)

We can see that the execution time roughly halved, from 18 microseconds to 8 microseconds. (For understanding the other columns in the report, it's best to read through the Quick Start guide for phpbench.) PHPBench can also provide an error exit code if the performance decreased. One way that PHPBench might fit into our testing stack would be to have a job similar to Fresnel, where a non-voting comment on a patch alerts developers whether the PHPBench performance decreased in the patch.

Testing with extensions

A slightly more complex example is available in GrowthExperiments (patch). That patch makes use of setUp/tearDown methods to prepopulate the database entries needed for the code being benchmarked:

/**
 * @BeforeMethods ("setUpLinkRecommendation")
 * @AfterMethods ("tearDownLinkRecommendation")
 * @Assert("mode(variant.time.avg) < 20000 microseconds +/- 10%")
 */
public function benchFilter() {
	$this->linkRecommendationFilter->filter( $this->tasks );
}

The setUpLinkRecommendation and tearDownLinkRecommendation methods have access to MediaWikiServices, and generally you can do similar things you'd do in an integration test to setup and teardown the environment. This test is towards the opposite end of the spectrum from the core test discussed above which looks at Html::openElement(); here, the goal is to look at a higher level function that involves database queries and interacting with MediaWiki services.

What's next

You can experiment with the tooling and see if it is useful to you. Some open questions:

  • do we want to use phpbench? or are the scripts in maintenance/benchmarks already sufficient for our benchmarking needs?
  • we already have a benchmarking tools in maintenance/benchmarks that extend a Benchmarker class; would it make sense to convert these to use phpbench?
  • what are sensible defaults for "revs" and "iterations" as well as retry thresholds?
  • do we want to run phpbench assertions in CI?
    • if yes, do we want assertions about using absolute times (e.g. "this function should take less than 20 ms") or relative assertions ("patch code is within 10% +/- of old code)
    • if yes, do we want to aggregate reports over time, so we can see trends for the code we benchmark?
    • should we disable phpbench as part of the standard set of tests run by Quibble, and only have it run as a non-voting job like Fresnel?

Looking forward to your feedback! [4]


[1] thank you, @hashar, for working with me to include this in Quibble and roll out to CI to help with evaluation!

[2]

> phpbench run --config=tests/Benchmark/phpbench.json --report=aggregate 'tests/Benchmark/includes/HtmlBench.php' '--tag=original'
PHPBench (1.1.2) running benchmarks...
with configuration file: /Users/kostajh/src/mediawiki/w/tests/Benchmark/phpbench.json
with PHP version 7.4.24, xdebug ✔, opcache ❌

\MediaWiki\Tests\Benchmark\HtmlBench

    benchHtmlOpenElement....................R1 I1 ✔ Mo18.514μs (±1.94%)

Subjects: 1, Assertions: 1, Failures: 0, Errors: 0
Storing results ... OK
Run: 1346543289c75373e513cc3b11fbf5215d8fb6d0
+-----------+----------------------+-----+------+-----+----------+----------+--------+
| benchmark | subject              | set | revs | its | mem_peak | mode     | rstdev |
+-----------+----------------------+-----+------+-----+----------+----------+--------+
| HtmlBench | benchHtmlOpenElement |     | 50   | 5   | 2.782mb  | 18.514μs | ±1.94% |
+-----------+----------------------+-----+------+-----+----------+----------+--------+

[3]

> phpbench run --config=tests/Benchmark/phpbench.json --report=aggregate 'tests/Benchmark/includes/HtmlBench.php' '--ref=original' '--report=aggregate'
PHPBench (1.1.2) running benchmarks...
with configuration file: /Users/kostajh/src/mediawiki/w/tests/Benchmark/phpbench.json
with PHP version 7.4.24, xdebug ✔, opcache ❌
comparing [actual vs. original]

\MediaWiki\Tests\Benchmark\HtmlBench

    benchHtmlOpenElement....................R5 I4 ✔ [Mo8.194μs vs. Mo18.514μs] -55.74% (±0.50%)

Subjects: 1, Assertions: 1, Failures: 0, Errors: 0
+-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+
| benchmark | subject              | set | revs | its | mem_peak      | mode            | rstdev         |
+-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+
| HtmlBench | benchHtmlOpenElement |     | 50   | 5   | 2.782mb 0.00% | 8.194μs -55.74% | ±0.50% -74.03% |
+-----------+----------------------+-----+------+-----+---------------+-----------------+----------------+

[4] Thanks to @zeljkofilipin for reviewing a draft of this post.

How we deploy codehttps://phabricator.wikimedia.org/phame/post/view/253/thcipriani (Tyler Cipriani)2021-09-27T18:44:05+00:002022-04-13T01:13:23+00:00

I broke Wikipedia and then I fixed it badge

Last week I spoke to a few of my Wikimedia Foundation (WMF) colleagues about how we deploy code—I completely botched it. I got too complex too fast. It only hit me later—to explain deployments, I need to start with a lie.

M. Jagadesh Kumar explains:

Every day, I am faced with the dilemma of explaining some complex phenomena [...] To realize my goal, I tell "lies to students."

This idea comes from Terry Pratchett's "lies-to-children" — a false statement that leads to a more accurate explanation. Asymptotically approaching truth via approximation.

Every section of this post is a subtle lie, but approximately correct.

Release Train

The first lie I need to tell is that we deploy code once a week.

Every Thursday, Release-Engineering-Team deploys a MediaWiki release to all 978 wikis. The "release branch" is 198 different branches—one branch each for mediawiki/core, mediawiki/vendor, 188 MediaWiki extensions, and eight skins—that get bundled up via git submodule.

Progressive rollout

The next lie gets a bit closer to the truth: we don't deploy on Thursday; we deploy Tuesday through Thursday.

The cleverly named TrainBranchBot creates a weekly train branch at 2 am UTC every Tuesday.

Release train process

Progressive rollouts give users time to spot bugs. We have an experienced user-base—as Risker attested on the Wikitech-l mailing list:

It's not always possible for even the best developer and the best testing systems to catch an issue that will be spotted by a hands-on user, several of whom are much more familiar with the purpose, expected outcomes and change impact on extensions than the people who have written them or QA'd them.

Bugs

Now I'm nearing the complete truth: we deploy every day except for Fridays.

Brace yourself: we don't write perfect software. When we find serious bugs, they block the release train — we will not progress from Group1 to Group2 (for example) until we fix the blocking issue. We fix the blocking issue by backporting a patch to the release branch. If there's a bug in this release, we patch that bug in our mainline branch, then git cherry-pick that patch onto our release branch and deploy that code.

We deploy backports three times a day during backport deployment windows.  In addition to backports, developers may opt to deploy new configuration or enable/disable features in the backport deployment windows.

Release engineers train others to deploy backports twice a week.

Emergencies

We deploy on Fridays when there are major issues. Examples of major issues are:

  • Security issues
  • Data loss or corruption
  • Availability of service
  • Preventing abuse
  • Major loss of functionality/visible breakage

We avoid deploying on Fridays because we have a small team of people to respond to incidents. We want those people to be away from computers on the weekends (if they want to be), not responding to emergencies.

Non-MediaWiki code

There are 42 microservices on Kubernetes deployed via helm. And there are 64 microservices running on bare metal. The service owners deploy those microservices outside of the train process.

We coordinate deployments on our deployment calendar wiki page.

The whole truth

We progressively deploy a large bundle of MediaWiki patches (between 150 and 950) every week. There are 12 backport windows a week where developers can add new features, fix bugs, or deploy new configurations. There are microservices deployed by developers at their own pace.

Important Resources:

More resources:


Thanks to @brennen, @greg, @KSiebert, @Risker, and @VPuffetMichel for reading early drafts of this post. The feedback was very helpful. Stay tuned for "How we deploy code: Part II."

Production Excellence #35: August 2021https://phabricator.wikimedia.org/phame/post/view/248/Krinkle (Timo Tijhof)2021-09-08T03:53:18+00:002021-10-20T23:01:24+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

Zero documented incidents last month. Isn't that something!

Learn about past incidents at Incident status on Wikitech. Remember to review and schedule Incident Follow-up in Phabricator, which are preventive measures and other action items to learn from.

proderr-incidents 2021-08.png (834×2 px, 288 KB)

Image from Incident graphs.


Trends

In August we resolved 18 of the 156 reports that carried over from previous months, and reported 46 new failures in production. Of the new ones, 17 remain unresolved as of writing and will carry over to next month.

The number of new errors reports in August was fairly high at 46, compared to 31 reports in July, and 26 reports in June.

Unresolved error reports, stacked by month

The backlog of "Old" issues saw no progress this past month and remained constant at 146 open error reports.

Total open production error tasks, by month.

Unified graph:

proderr-unified 2021-08.png (1×1 px, 78 KB)

💡 Did you know:

You can zoom in to your team's error reports by using the appropriate "Filter" link in the sidebar of our shared workboard.

Take a look at the workboard and look for tasks that could use your help.

View Workboard


Progress

Last few months in review:

Jan 2021 (50 issues)3 left.
Feb 2021 (20 issues)6 > 5 left.
Mar 2021 (48 issues)13 > 10 left.
Apr 2021 (42 issues)18 > 17 left.
May 2021 (54 issues)22 > 20 left.
Jun 2021 (26 issues)11 > 10 left.
Jul 2021 (31 issues)16 > 12 left.
Aug 2021 (46 issues)+ 17 new unresolved issues.

Tally:

156issues open, as of Excellence #34 (July 2021).
-18issues closed, of the previously open issues.
+17new issues that survived August 2021.
155issues open, as of today (3 Sep 2021).

For more month-over-month numbers refer to the spreadsheet.


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Production Excellence #34: July 2021https://phabricator.wikimedia.org/phame/post/view/247/Krinkle (Timo Tijhof)2021-08-19T03:49:57+00:002021-08-21T12:13:43+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

3 documented incidents last month. That's at the median for the past twelve months, and slightly below the median of 4 over the past five years (Incident stats).

proderr-incidents 2021-07.png (796×1 px, 125 KB)

Learn about past incidents at Incident status on Wikitech. Remember to review and schedule Incident Follow-up in Phabricator, which are preventive measures and other action items filed after an incident.


Trends

Last month the workboard held 154 non-old unresolved error reports. Over the past thirty days, the collective efforts of our volunteers and engineering teams have closed 14 of those.

In the month of July we've also introduced or discovered thirty-one new error reports (that's an average of one production regression every day!). Of those new error reports, fifteen were resolved and 16 remain unresolved. The workboard now tallies up to 156 tasks.

Take a look at the workboard and look for tasks that could use your help.

View Workboard

proderr-monthly 2021-08-18.png (1×1 px, 97 KB)

Over on the backlog, we're continuing to ploddingly present progress on production problems from phantoms of christmases past.

proderr-totals 2021-08-18.png (900×1 px, 96 KB)

For more month-over-month numbers refer to the spreadsheet data.


Outstanding errors

Below are various older issues that may have fallen by the wayside, taken from somewhat-random stab-in-the-dark queries.

Oldest unresolved errors that are still reproducible (Phab query):

  • Reported in 2015: Unable to view history of protected Flow board (StructuredDiscussions, Growth team), T118502.
  • Reported in 2016: Error when deleting a heading next to a table (VisualEditor, Editing team), T140871.

Stalled error reports (Phab query):

  • Stalled Mar 2021: Constraints check for Q142 France times out (Wikidata, WMDE), T212282.

Oldest error with a patch for review (Phab query):

  • Reported in 2016: Maps broken during 2nd live preview (Maps, Product Infra), T151524.
  • Reported in 2018: Corrupt connection for cross-wiki db query (Platform team), T193565.
Jan 2021 (3 of 50 issues left)⚠️ Unchanged. Have a look-see!
Feb 2021 (6 of 20 issues left)⚠️ Unchanged. Take a gander!
Mar 2021 (13 of 48 issues left)⚠️ Unchanged. Check it out!
Apr 2021 (18 of 42 issues left)-1
May 2021 (22 of 54 issues left)-3
June 2021 (11 of 26 issues left)-4
July 2021 (16 of 31 issues left)+31; -15

Tally
154issues open, as of Excellence #33 (June 2021).
-14issues closed, of the previous 154 open issues.
+16new issues that survived July 2021.
156issues open, as of today.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

Shrinking the tasks backloghttps://phabricator.wikimedia.org/phame/post/view/244/hashar (Antoine Musso)2021-07-02T15:05:04+00:002021-07-14T18:29:30+00:00

The release engineering team triages tasks flagged Release-Engineering-Team on a weekly basis. It is an all hands on deck one hour meeting in which we pick tasks one by one and find out what to do with them. We have started with more than a hundred of them and are now down to just a dozen or so, most filed since the last meeting.

I have been doing those routine triages for the projects I closely manage, often on Friday afternoon. I have recently started being a bit more serious about it and even allocated a couple weeks entirely dedicated to act on the backlog. This post summarizes some of my discoveries, will hopefully inspire the reader to tackle their own backlogs, technical debt and hopefully in the end we will have improved our ecosystem.

Finding tasks

Tasks you have filed

I keep filing tasks rather than taking notes or writing emails, I find Phabricator interface convenient since it lets me flag a task with labels however I want (Technical-Debt , Documentation, MediaWiki-General), subscribe individuals or even a whole team. It is great. With time those tasks pill up and it is easy to forget old ones, they have to be revisited from time to time. It as easy I searching for any open tasks I have filed and order them by creation date:

AuthorsCurrent viewer
StatusesOpen Stalled
Group ByNone
Order ByCreation (oldest First)

https://phabricator.wikimedia.org/maniphest/query/Wws2E0C7IaFd/#R

The first bug in the list is the oldest you have created and most probably deserve to be acted on. From there pick the tasks one by one.

Some will surely be obsolete since they have been acted on or the underlying infrastructure entirely changed. An example of a 6 years old task I declined is T100099, it followed a meeting to deploy MediaWiki services to Beta-Cluster-Infrastructure . The task has been partially achieved for a few services (notably Parsoid) and was left open since we never moved all services to the same system. Nowadays developers deploy a Docker image and restart the Docker container. The notes are obsolete and the task has thus no purpose anymore.

T149924 came from deploying static web assets using git directly to /srv. However the partition also hosted dynamically generated content such as all the content from https://doc.wikimedia.org/ , https://integration.wikimedia.org/ or state from a CI daemon. The issue is problematic when we reimage the server, specially during OS upgrades which we do every two years, and the task history reflect that:

  • Filed in 2016 after an OS upgrade
  • The part affecting https://integration.wikimedia.org/ is partially addressed in 2018 as part of an OS upgrade
  • In 2020 we had yet another OS upgrade and this time I decided to complete the task

I completed it because that task showed up in my list of oldest bugs, it thus kept showing up whenever I did the triage and that was an incentive to get it gone. We are in a much better shape, the services have been decoupled on different machines, the static assets are deployed using our deployment tool: Scap.

Check your projects

Beside your team projects, you surely have side pet projects or legacy tags you might want to revisit. They can be found in search for your projects you are a member of (assuming you made yourself a member): https://phabricator.wikimedia.org/project/query/JS0zmX.yalpI/#R

I for example introduced Doxygen to generate the MediaWiki PHP documentation, git-review to assist interactions with Gerrit for which bugs are tracked in a column of the Gerrit project, and I am probably the one one actively acting on this task.

You can again list tasks filed against each project sorted by creation dates, and since you are a member of the project you will most probably be able to act on those old tasks.

One of the oldest tasks I had was T48148, which is to hide CI or robot comments from Gerrit change. The task has been filed in 2013, I found the upstream proposed solution back in 2019 and well *cough* forgot about it. Since I encountered the task during a triage, I went to tackle it and in short the required code boils down to add a single line in the CI configuration:

 gerrit:
   verified: 2
+  tag: autogenerated:ci

That took almost 9 months, since I was not actively triaging old tasks.

Technical debt

Just like we have the generic Documentation tag for any tasks relating to documentation, we have Technical-Debt to mark a task as requiring an extra effort and bring us to modernity. When triaging your own or your projects tasks, you can flag them as technical debt to easily find them later on.

Some tasks can immediately be filed as being a technical debt, that was the case of T141324 which is to send logging of the Gerrit code review system to logstash and thus make them easier to dig through or discover. Sounds simple? Well not that much.

The story is a bit complicated, but in short Gerrit is a java application and our team does not necessarily have much experience with it, the state of Java logging is a bit unclear (Gerrit uses log4j). Luckily we had some support from actual Java developers and managed to do some injecting, though the fields were not properly formatted, it was a progress.

After I got assigned as the primary maintainer of our Gerrit setup, I definitely needed proper logging. When we upgraded Gerrit to 3.2, the library we used to format the logs to Json was no longer provided by upstream, forcing us to maintain a fork of Gerrit just for that purpose.

Luckily upstream has made improvements and I found out it supports json logging out of the box while our logging infrastructure learned to ingest json logs. We even got as far as supporting Elastic Common Schema to use predefined field names.

That task has been a technical debt for 5 years, but since I kept seeing it I kept remembering about it and managed to address it.

Some tasks can not be acted on cause they depend on an upstream change that might be delayed for some reasons. A massive issue we have encountered since at least 2015 was slowness when doing a git fetch from our busiest repository. I previously blogged about it Blog Post: Faster source code fetches thanks to git protocol version 2 and Google addressed it by proposing a version 2 of the git protocol. It was one of the incentives for us to upgrade Gerrit, and as soon as we upgraded I made a point to test the fix and make it well known to our developers (do use protocol.version=2 in your .gitconfig).

Grooming pleasure

When processing old tasks, you can find it hard to tackle ones that need to focus for a few days if not weeks as in the example above. But there are also a bunch of little annoying tasks that are surprisingly very easy to solve and give immediate reward. The positive feedback loop would get you in the mood of finding more easy tasks and thus reducing your backlog. A few more examples:

T221510, filed in 2019 and addressed two years later, was requesting to expose a machine readable test coverage report. The file was there (clover.xml) it was simply not exposed in the web page, a simple <a href="clover.xml">clover.xml</a> is the only thing that was required.

My favorite tasks are obviously the ones that already have been solved and are just pending the paperwork to mark them resolved. T138653 was for a user unable to login to Gerrit due to a duplicate account, 3 years after it had been filed the user reported he was able to login properly and I marked it resolved one hour later. I guess that user was grooming their old tasks as well.

And finally, some old tasks might not be worth fixing. We are probably too kind with those and should probably be more strict in declining very old tasks. An example is T63733, the MediaWiki source code is deployed to the Wikimedia production cluster under a directory named php-<version>. Surely the php- prefix does not offer any meaningful information. However, since it is hardcoded in various places and would require moving files around on the whole fleet of servers, it might be a bit challenging and would definitely be a risky change. Should we drop that useless prefix? For sure. Is it worth facing outage and possibly multiple degraded services? Definitely not and I have thus just declined it.

Production Excellence #33: June 2021https://phabricator.wikimedia.org/phame/post/view/240/Krinkle (Timo Tijhof)2021-07-14T03:34:25+00:002021-07-14T14:05:33+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

3 documented incidents. That's lower than June in the previous five years where the month saw 5-9 incidents. I've added a new panel ⭐️ to the Incident statistics tool. This one plots monthly statistics on top of previous years, to more easily compare them:

proderr-incidents 2021-06.png (381×730 px, 75 KB)

Learn more from the Incident documents on Wikitech, and remember to review and schedule Incident Follow-up in Phabricator, which are preventive measures and other action items filed after an incident.


Trends

In June, work on production errors appears to have stagnated a bit. Or more precisely, the work only resulted in relatively few tasks being resolved. 15 of the 26 new tasks are still open as of writing.

Of the tasks from previous months, only 11 were resolved, leaving most columns unchanged. See the table further down for a more detailed breakdown and links to Phabricator queries for the tasks in question.

With the 15 remaining new tasks, and the 11 tasks resolved from our backlog, this raises the chart from 150 to 154 tasks.

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Unresolved error reports, stacked by month.

Total open production error tasks, by month.

Month-over-month plots based on spreadsheet data.


Outstanding errors

Summary over recent months:

Jan 2020 (1 of 7 left)⚠️ Unchanged (over one year old).
Mar 2020 (2 of 2 left)⚠️ Unchanged (over one year old).
Apr 2020 (4 of 14 left)⚠️ Unchanged (over one year old).
May 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
Jun 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
Jul 2020 (4 of 24 issues)⚠️ Unchanged (over one year old).
Aug 2020 (11 of 53 issues)⬇️ One task resolved.-1
Sep 2020 (7 of 33 issues)⚠️ Unchanged (over one year old).
Oct 2020 (19 of 69 issues)⚠️ Unchanged (over one year old).
Nov 2020 (8 of 38 issues)⚠️ Unchanged (over one year old).
Dec 2020 (7 of 33 issues)⚠️ Unchanged (over one year old).
Jan 2021 (3 of 50 issues)⚠️ Unchanged (over one year old).
Feb 2021 (6 of 20 issues)⬇️ One task resolved.-1
Mar 2021 (13 of 48 issues)⬇️ One task resolved.-1
Apr 2021 (19 of 42 issues)⬇️ Four tasks resolved.-4
May 2021 (25 of 54 issues)⬇️ Four tasks resolved.-4
June 2021 (15 of 26 issues)📌 26 new issues, of which 11 were closed.+26, -11

Tally
150issues open, as of Excellence #32 (May 2021).
-11issues closed, of the previous 150 open issues.
+15new issues that survived June 2021.
154issues open as of yesterday.

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof

🕳 O'Neill: We've done this!
Dr Jackson: We do this every day.
O'Neill: I'm not talking about briefings in general, Daniel, I'm talking about this briefing; I'm talking about this day.
Teal'c: Col. O'Neill is correct. Events do appear to be repeating themselves.

Production Excellence #32: May 2021https://phabricator.wikimedia.org/phame/post/view/236/Krinkle (Timo Tijhof)2021-06-21T01:31:27+00:002021-06-21T17:47:50+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

Zero incidents recorded in the past month. Yay! That's only five months after November 2020, the last month without documented incidents (Incident stats).

Remember to review Preventive measures in Phabricator, which are action items filed after an incident.


Trends

In May, we unfortunately saw a repeat of the worrying pattern we saw in April, but with higher numbers. We found 54 new errors. This is the most new errors in a single month, since the Excellence monthly began three years ago in 2018. About half of these (29 of 54) remain unresolved as of writing, two weeks into the following month.

Unresolved error reports, stacked by month.

Total open production error tasks, by month.

Month-over-month plots based on spreadsheet data.


New errors in May

Below is a snapshot of just the 54 new issues found last month, listed by their code steward.

Be mindful that the reporting of errors is not itself a negative point per-se. I think it should be celebrated when teams have good telemetry, detect their issues early, and address them within their development cycle. It might be more worrisome when teams lack telemetry or time to find such issues, or can't keep up with the pace at which issues are found.

Anti Harassment ToolsNone.
Community TechNone.
Editing Team+2, -1Cite (T283755); OOUI (T282176).
Growth Team+17, -4Add-Link (T281960); GrowthExperiments (T281525 T281703 T283546 T283638 T283924); Echo (T282446); Recent-changes (T282047 T282726); StructuredDiscussions (T281521 T281523 T281782 T281784 T282069 T282146 T282599 T282605).
Language Team+1Translate extension (T283828).
Parsing Team+1Parsoid (T281932).
Reading WebNone.
Structured DataNone.
Product Infra Team+1WikimediaEvents (T282580).
AnalyticsNone.
Performance TeamNone.
Platform Engineering+16, -11MediaWiki-API (T282122); MediaWiki-General (T282173); MediaWiki-Page-derived-data (T281714 T281802 T282180 T283282), MediaWiki-Revision-backend (T282145 T282723 T282825 T283170); MediaWiki-User-management (T283167); MW Expedition (T281526 T281981 T282038 T282181 T283196).
Search Platform+3, -2CirrusSearch (T282036 T282207); GeoData (T282735).
WMDE TechWish+2, -1Revision-Slider (T282067); VisualEditor Template dialog (T283511).
WMDE Wikidata+3, -1Wikibase (T282534 T283198 T283862).
No owner+7, -6CentralAuth (T282834 T283635); Change-tagging (T283098 T283099); MapSources (T282833); MediaWiki-Page-information (T283751); Other (T283252).

Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Summary over recent months:

Aug 2019 (0 of 14 left)✅ Last task resolved!-1
Jan 2020 (1 of 7 left)⚠️ Unchanged (over one year old).
Mar 2020 (2 of 2 left)⚠️ Unchanged (over one year old).
Apr 2020 (4 of 14 left)⬇️ One task resolved.-1
May 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
Jun 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
Jul 2020 (4 of 24 issues)⏸ —
Aug 2020 (12 of 53 issues)⬇️ One task resolved.-1
Sep 2020 (7 of 33 issues)⏸ —
Oct 2020 (19 of 69 issues)⬇️ One task resolved.-1
Nov 2020 (8 of 38 issues)⬇️ One task resolved.-1
Dec 2020 (7 of 33 issues)⏸ —
Jan 2021 (3 of 50 issues)⏸ —
Feb 2021 (7 of 20 issues)⬇️ One task resolved.-1
Mar 2021 (14 of 48 issues)⬇️ Four tasks resolved.-4
Apr 2021 (23 of 42 issues)⬇️ Two tasks resolved.-2
May 2021 (29 of 54 issues)54 new issues found, of which 29 remain open.+54; -25
Tally
133issues open, as of Excellence #31 (12 May 2021).
-12issues closed, of the previous 133 open issues.
+29new issues that survived May 2021.
150issues open, as of today (12 June 2021).

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:
Incident status, Wikitech.
Wikimedia incident stats by Krinkle, CodePen.
Production error data (spreadsheet and plots).
Phabricator report charts for Wikimedia-production-error project.

Production Excellence #31: April 2021https://phabricator.wikimedia.org/phame/post/view/235/Krinkle (Timo Tijhof)2021-05-13T03:49:23+00:002021-06-12T17:39:06+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

6 documented incidents. That's above the historical average of 3–4 per month.

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.


Trends

In April, we saw a continuation of the healthy trend that started this January — a trend where the back of the line is moving forward at least as quickly as the front of the line. We did take a little breather in March where we almost broke even, but otherwise the trend is going well.

Last month we bade farewell to the production errors we found in July 2019. This month we cleared out the column for October 2019.

One point of concern is that we did encounter a high number of new production errors — errors that we failed to catch during development, code review, continuous integration, beta testing, or pre-deployment checks. Where we used to discover about a dozen of those a month, we found 42 during this month. As of writing, 17 of the 42 April-discovered errors have been resolved.

The "Old" column (generally tracking pre-2019 tasks) grew for the first time in six months. This increase can largely be attributed to improved telemetry of client-side errors uncovering issues in under-resourced products, such as the old Kaltura video player.

Unresolved error reports, stacked by month.

Total open production error tasks, by month.

Month-over-month plots based on spreadsheet data.


Outstanding errors

View Workboard

Summary over recent months, per spreadsheet:

Aug 2019 (1 of 14 left)⚠️ Unchanged (over one year old).
Oct 2019 (0 of 12 left)✅ Last three tasks resolved!-3
Jan 2020 (1 of 7 left)⚠️ Unchanged (over one year old).
Mar 2020 (2 of 2 left)⚠️ Unchanged (over one year old).
Apr 2020 (5 of 14 left)⚠️ Unchanged (over one year old).
May 2020 (5 of 14 left)⏸ —
Jun 2020 (5 of 14 left)⬇️ One task resolved.-1
Jul 2020 (4 of 24 issues)⬇️ One task resolved.-1
Aug 2020 (13 of 53 issues)⬇️ Two tasks resolved.-2
Sep 2020 (7 of 33 issues)⏸ —
Oct 2020 (20 of 69 issues)⬇️ Two tasks resolved.-2
Nov 2020 (9 of 38 issues)⏸ —
Dec 2020 (7 of 33 issues)⬇️ Four tasks resolved.-4
Jan 2021 (3 of 50 issues)⬇️ One task resolved.-1
Feb 2021 (8 of 20 issues)⬇️ One task resolved.-1
Mar 2021 (18 of 48 issues)⬇️ Sixteen tasks resolved.-16
Apr 2021 (25 of 42 issues)42 new issues found, of which 25 remained open.+42; -17
Tally
139issues open, as of Excellence #30 (March 2021).
-31issues closed, of the previously open issues.
+25new issues that survived April 2021.
133issues open, as of today (12 May 2021).

Take a look at the workboard and look for tasks that could use your help:

View Workboard


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in production!

Until next time,

– Timo Tijhof


🎥 McMurphy: That nurse, man... she, uh, she ain't honest.
Doctor: Ah now, look. Miss Ratched is one of the finest nurses we've got in this institution.
McMurphy: Ha! Well […] She likes a rigged game, know what I mean?

Tracking memory issue in a Java applicationhttps://phabricator.wikimedia.org/phame/post/view/232/hashar (Antoine Musso)2021-03-12T09:38:31+00:002021-04-02T13:01:35+00:00

One of the critical pieces of our infrastructure is Gerrit. It hosts most of our git repositories and is the primary code review interface. Gerrit is written in the Java programming language which runs in the Java Virtual Machine (JVM). For a couple years we have been struggling with memory issues which eventually led to an unresponsive service and unattended restarts. The symptoms were the usual ones: the application responses being slower and degrading until server side errors render the service unusable. Eventually the JVM terminates with:

java.lang.OutOfMemoryError: Java heap space

This post is my journey toward identifying the root cause and having it fixed up by the upstream developers. Given I barely knew anything about Java and much less about its ecosystem and tooling, I have learned more than a few things on the road and felt like it was worth sharing.

Prior work

The first meaningful task was in June 2019 (T225166) which over several months has led us to:

  • replace aging underlying hardware
  • tuning the memory garbage collector and switching to the G1 garbage collector
  • raising the amount of memory allocated to the JVM (the heap)
  • upgraded the Debian operating system by two major release (JessieStretchBuster)
  • conduct a major upgrade of Gerrit (June 2020, Gerrit 2.153.2)
  • bots crawling the repositories get moved to a replica
  • fixing lack of cache in a MediaWiki extension querying Gerrit more than it should have

All of those were sane operations that are part of any application life-cycle, some were meant to address other issues. Raising the maximum heap size (20G to 32G) definitely reduced the frequency of crashes.

Still, we had memory filing over and over. The graph below shows the memory usage from September 2019 to September 2020. The increase of maximum heap usage in October 2020 is the JVM heap being raised from 20G to 32G. Each of the "little green hills" correspond to memory filing up until we either restarted Gerrit or the JVM unattended crash:

Gerrit_3months_usedMemory.png (499×1 px, 21 KB)

Zooming on a week, it is clearly seen the memory was almost entirely filled until we had to restart:

gerrit_used_memory.png (499×1 px, 24 KB)

This had to stop. Complaints about Gerrit being unresponsive, SRE having to respond to java.lang.OutOfMemoryError: Java heap space or us having to "proactively" restart before a week-end. They were not good practices. Back and fresh from vacations, I filed a new task T263008 in September 2020 and started to tackle the problem on my spare time. Would I be able to find my way in an ecosystem totally unknown to me?

Challenge accepted!

stuff learned

  • Routine maintenance are definitely a need
  • Don't expect things to magically solve but commit to thoroughly identify the root cause instead of hoping.

Looking at memory

Since the JVM runs out of memory, lets look at memory allocation. The JDK provides several utilities to interact with a running JVM. Be it to attach a debugger, writing a copy of the whole heap or sending admin commands to the JVM.

jmap lets one take a full capture of the memory used by a Java virtual machine. It has to run as the same user as the application (we use Unix username gerrit2) and when having multiple JDKs installed, one has to make sure to invoke the jmap that is provided by the Java version running the targeted JVM.

Dumping the memory is then a magic:

sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap \
  -dump:live,format=b,file=/var/lib/gerrit-202009170755.hprof <pid of java process here>

It takes a few minutes depending on the number of objects. The resulting .hprof file is a binary format, which can be interpreted by various tools.

jhat, a Java heap analyzer, is provided by the JDK along jmap. I ran it disabling tracking of of object allocations (-stack false) as well as references to object (|-refs false) since even with 64G of RAM and 32 core it took a few hours and eventually crashed. That is due to the insane amount of live objects. On the server I thus ran:

/usr/lib/jvm/java-8-openjdk-amd64/bin/jhat -stack false -refs false gerrit-202009170755.hprof

It spawns a web service which I can reach from my machine over ssh using some port redirection and open a web browser for it:

ssh  -C -L 8080:ip6-localhost:7000 gerrit1001.wikimedia.org &
xdg-open http://ip6-localhost:8080/

Instance Counts for All Classes (excluding native types)

2237744 instances of class org.eclipse.jgit.lib.ObjectId
2128766 instances of class org.eclipse.jgit.lib.ObjectIdRef$PeeledNonTag
735294 instances of class org.eclipse.jetty.util.thread.Locker
735294 instances of class org.eclipse.jetty.util.thread.Locker$Lock
735283 instances of class org.eclipse.jetty.server.session.Session
...

And an other view shows 3.5G of byte arrays.

I got pointed to https://heaphero.io/ however the file is too large to upload and it contains sensitive information (credentials, users personal information) which we can not share with a third party.

Nothing really conclusive at this point, the heap dump has been taken shortly after a restart and Gerrit was not in trouble.

Eventually I found Javamelody has a view providing the exact same information without all the trouble of figuring out jmap, jhat and ssh proper set of parameters. Just browse to the monitoring page and:

gerrit_javamelody_heaphisto.png (517×990 px, 131 KB)

stuff learned

  • jmap to issue commands to the jvm including taking a heap dump
  • jhat to run analysis with some options required to make it workable
  • Use JavaMelody instead

JVM handling of out of memory error

An idea was to take a heap dump whenever the JVM encounters an out of memory error. That can be turned on by passing the extended option HeapDumpOnOutOfMemoryError to the JVM and specifying where the dump will be written to with HeapDumpPath:

java \
  -XX:+HeapDumpOnOutOfMemoryError \
  -XX:HeapDumpPath=/srv/gerrit \
  -jar gerrit.war ...

And surely next time it ran out of memory:

Nov 07 13:43:35 gerrit2001 java[30197]: java.lang.OutOfMemoryError: Java heap space
Nov 07 13:43:35 gerrit2001 java[30197]: Dumping heap to /srv/gerrit/java_pid30197.hprof ...
Nov 07 13:47:02 gerrit2001 java[30197]: Heap dump file created [35616147146 bytes in 206.962 secs]

Which results in a 34GB dump file which was not convenient for a full analysis. Even with 16G of heap for the analyze and a couple hours of CPU churning it was not any helpful

And at this point the JVM is still around, the java process is still there and thus systemd does not restart the service for us even though we have instructed it to do so:

/lib/systemd/system/gerrit.service
[Service]
ExecStart=java -jar gerrit.war
Restart=always
RestartSec=2s

That lead to our Gerrit replica being down for a whole weekend with no alarm whatsoever (T267517). I imagine the reason for the JVM not exiting on an OutOfMemoryError is to let one investigate the reason. Just like heap dump, the behavior can be configured via the ExitOnOutOfMemoryError extended option:

java -XX:+ExitOnOutOfMemoryError

Next time the JVM will exit causing systemd to notice the service went away and so it will happily restart it again.

stuff learned

  • automatic heap dumping with the JVM for future analysis
  • Be sure to have the JVM exit when running out of memory so systemd will restart the service
  • Process can be up while still not serving its purpose

Side track to jgit cache

When I filed the task, I suspected enabling git protocol version 2 (J199) on CI might have been the root cause. That eventually lead me to look at how Gerrit caches git operations. Being a Java application it does not use the regular git command but a pure Java implementation jgit, a project started by the same author as Gerrit (Shawn Pearce).

To speed up operations, jgit keeps git objects in memory with various tuning settings. You can read more about it at T263008#6601490 , but in the end it was of no use for the problem. @thcipriani would later point out that jgit cache does not overgrow past its limit:

Screenshot-2020-12-21-13:09:10.png (749×1 px, 152 KB)

The investigation was not a good lead, but surely it prompted us to have a better view as to what is going on in the jgit cache. But to do so we would need to expose historical metrics of the status of the cache.

Stuff learned

  • Jgit has in memory caches to hold frequently accessed repositories / objects in the JVM memory speeding up access to them.

Metrics collection

We always had trouble determining whether our jgit cache was properly sized and tuned it randomly with little information. Eventually I found out that Gerrit does have a wide range of metrics available which are described at https://gerrit.wikimedia.org/r/Documentation/metrics.html . I always wondered how we could access them without having to write a plugin.

The first step was to add the metrics-reporter-jmx plugin. It registers all the metrics with JMX, a Java system to manage resources. That is then exposed by JavaMelody and at least let us browse the metrics:

gerrit_jgit_cache_metrics.png (329×422 px, 34 KB)

I long had a task to get those metrics exposed (T184086) but never had a strong enough incentive to work it. The idea was to expose those metrics to the Prometheus monitoring system which would scrape them and make them available in Grafana. They can be exposed using the metrics-reporter-prometheus plugin. There is some configuration required to create an authentication token that lets Prometheus scrape the metrics and it is then all set and collected.

In Grafana, discovering which metrics are of interest might be daunting. Surely for the jgit cache it is only a few metrics we are interested in and crafting a basic dashboard for it is simple enough. But since we now collect all those metrics, surely we should have dashboards for anything that could be of interest to us.

While browsing the Gerrit upstream repositories, I found an unadvertised repository: gerrit/gerrit-monitoring. The project aims at deploying to Kubernetes a monitoring stack for Gerrit composed of Grafana, Loki, Prometheus and Promtail. While browsing the code, I found out they already had a Grafana template which I could import to our Grafana instance with some little modifications.

During the Gerrit Virtual Summit I raised that as a potentially interesting project for the whole community and surely a few days later:

In the end we have a few useful Grafana dashboards, the ones imported from the gerrit-monitoring repo are suffixed with (upstream): https://grafana.wikimedia.org/dashboards/f/5AnaHr2Mk/gerrit

And I crafted one dedicated to jgit cache: https://grafana.wikimedia.org/d/8YPId9hGz/jgit-block-cache

Stuff learned

  • Prometheus scraping system with auth token
  • Querying Prometheus metrics in Grafana and its vector selection mechanism
  • Other Gerrit administrators already created Vizualization
  • Raising our reuse prompted upstream to further advertise their solution which hopefully has led to more adoption of their solution.

Despair

After a couple months, there was no good lead. The issue has been around for a while, in a programming language I don't know with assisting tooling completely alien to me. I even found jcmd to issue commands to the JVM, such as dumping a class histogram, the same view provided by JavaMelody:

$ sudo -u gerrit2 jcmd 2347 GC.class_histogram
num     #instances         #bytes  class name
3	----------------------------------------------
4	   5:      10042773     1205132760  org.eclipse.jetty.server.session.SessionData
5	   8:      10042773      883764024  org.eclipse.jetty.server.session.Session
6	  11:      10042773      482053104  org.eclipse.jetty.server.session.Session$SessionInactivityTimer$1
7	  13:      10042779      321368928  org.eclipse.jetty.util.thread.Locker
8	  14:      10042773      321368736  org.eclipse.jetty.server.session.Session$SessionInactivityTimer
9	  17:      10042779      241026696  org.eclipse.jetty.util.thread.Locker$Lock

That is quite handy when already in a terminal, saves a few click to switch to a browser, head to JavaMelody and find the link.

But it is the last week of work of the year.

Christmas is in two days.

Kids are messing up all around the home office since we are under lockdown.

Despair.

Out of rage I just stall the task shamelessly hoping for Java 11 and Gerrit 3.3 upgrades to solve this. Much like we hoped the system would be fixed by upgrading.

Wait..

1 million?

ONE MILLION ??

TEN TO THE POWER OF SIX ???

WHY IS THERE A MILLION HTTP SESSIONS HELD IN GERRIT !!!!!!?11??!!??

10042773  org.eclipse.jetty.server.session.SessionData

There. Right there. It was there since the start. In plain sight. And surely 19 hours later Gerrit had created 500k sessions for 56 MBytes of memory. It is slowly but surely leaking memory.

stuff learned

  • Everything clears up once one has found the root cause

When upstream saves you

At this point it was just an intuition, albeit a strong one. I know not much about Java or Gerrit internals and went to invoke upstream developers for further assistance. But first, I had to reproduce the issue and investigate a bit more to give as many details as possible when filing a bug report.

Reproduction

I copied a small heap dump I took just a few minutes after Gerrit got restarted, it had a manageable size making it easier to investigate. Since I am not that familiar with the Java debugging tools, I went with what I call a clickodrome interface, a UI that lets you interact solely with mouse clicks: https://visualvm.github.io/

Once the heap dump is loaded, I could easily access objects. Notably the org.eclipse.jetty.server.session.Session objects had a property expiry=0, often an indication of no expiry at all. Expired sessions are cleared by Jetty via a HouseKeeper thread which inspects sessions and deletes expired ones. I have confirmed it does run every 600 seconds, but since sessions are set to not expire, they pile up leading to the memory leak.

On December 24th, a day before Christmas, I filed a private security issue to upstream (now public): https://bugs.chromium.org/p/gerrit/issues/detail?id=13858

After the Christmas and weekend break upstream acknowledged and I did more investigating to pinpoint the source of the issue. The sessions are created by a SessionHandler and debug logs show: dftMaxIdleSec=-1 or Default maximum idle seconds set to -1, which means that by default the sessions are created without any expiry. The Jetty debug log then gave a bit more insight:

DEBUG org.eclipse.jetty.server.session : Session xxxx is immortal && no inactivity eviction

It is immortal and is thus never picked up by the session cleaner:

DEBUG org.eclipse.jetty.server.session : org.eclipse.jetty.server.session.SessionHandler
==dftMaxIdleSec=-1 scavenging session ids []
                                          ^^^ --- empty array

Our Gerrit instance has several plugins and the leak can potentially come from one of them. I then booted a dummy Gerrit on my machine (java -jar gerrit-3.3.war) cloned the built-in All-Projects.git repository repeatedly and observed objects with VisualVM. Jetty sessions with no expiry were created, which rules out plugins and point at Gerrit itself. Upstream developer Luca Milanesio pointed out that Gerrit creates a Jetty session which is intended for plugins. I have also narrowed down the leak to only be triggered by git operations made over HTTP. Eventually, by commenting out a single line of Gerrit code, I eliminated the memory leak and upstream pointed at a change released a few versions ago that may have been the cause.

Upstream then went on to reproduce on their side, took some measurement before and after commenting out and confirmed the leak (750 bytes for each git request made over HTTP). Given the amount of traffic we received from humans, systems or bots, it is not surprising we ended up hitting the JVM memory limit rather quickly.

Eventually the fix got released and new Gerrit versions were released. We upgraded to the new release and haven't restarted Gerrit since then. Problem solved!

Stuff learned

  • Even with no knowledge about a programming language, if you can build and run it, you can still debug using print or the universal optimization operator: //.
  • Quickly acknowledge upstream hints, ideas and recommendations. Even if it is to dismiss one of their leads.
  • Write a report, this blog.

Thank you upstream developers Luca Milanesio and David Ostrovsky for fixing the issue!

Thank you @dancy for the added clarifications as well as typos and grammar fixes.

References

Production Excellence #30: March 2021https://phabricator.wikimedia.org/phame/post/view/229/Krinkle (Timo Tijhof)2021-04-03T00:20:25+00:002021-05-12T22:57:28+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

2 documented incidents. That's average for this time of year, when we usually had 1-4 incidents.

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.


Trends

In March we made significant progress on the outstanding errors of previous months. Several of the 2020 months are finally starting to empty out. But with over 30 new tasks from March itself remaining, we did not break even, and ended up slightly higher than last month. This could be reversing two positive trends, but I hope not.

Firstly, there was a steep increase in the number of new production errors that were not resolved within the same month. This is counter the positive trend we started in November. The past four months typically saw 10-20 errors outlive their month of discovery, and this past month saw 34 of its 48 new errors remain unresolved.

Secondly, we saw the overall number of unresolved errors increase again. This January began a downward trend for the first time in thirteen months, which continued nicely through February. But, this past month we broke even and even pushed upward by one task. I hope this is just a breather and we can continue our way downward.

Unresolved error reports, stacked by month.

Total open production error tasks, by month.

Month-over-month plots based on spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help:

View Workboard

Summary over recent months, per spreadsheet:

Jul 2019 (0 of 18 left)✅ Last two tasks resolved!-2
Aug 2019 (1 of 14 left)⚠️ Unchanged (over one year old).
Oct 2019 (3 of 12 left)⬇️ One task resolved.-1
Nov 2019 (0 of 5 left)✅ Last task resolved!-1
Dec 2019 (0 of 9 left)✅ Last task resolved!-1
Jan 2020 (1 of 7 left)⬇️ One task resolved.-1
Feb 2020 (0 of 7 left)✅ Last task resolved!-1
Mar 2020 (2 of 2 left)⚠️ Unchanged (over one year old).
Apr 2020 (5 of 14 left)⬇️ Four tasks resolved.-4
May 2020 (5 of 14 left)⬇️ One task resolved.-1
Jun 2020 (6 of 14 left)⬇️ One task resolved.-1
Jul 2020 (5 of 24 issues)⬇️ Four tasks resolved.-4
Aug 2020 (15 of 53 issues)⬇️ Five tasks resolved.-5
Sep 2020 (7 of 33 issues)⬇️ One task resolved.-1
Oct 2020 (22 of 69 issues)⬇️ Four tasks resolved.-4
Nov 2020 (9 of 38 issues)⬇️ Two tasks resolved.-2
Dec 2020 (11 of 33 issues)⬇️ One task resolved.-1
Jan 2021 (4 of 50 issues)⬇️ One task resolved.-1
Feb 2021 (9 of 20 issues)⬇️ Two tasks resolved.-2
Mar 2021 (34 of 48 issues)34 new tasks survived and remain unresolved.+48; -14
Tally
138issues open, as of Excellence #29 (6 Mar 2021).
-33issues closed, of the previous 138 open issues.
+34new issues that survived March 2021.
139issues open, as of today (2 Apr 2021).

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

Incident status, Wikitech.
Wikimedia incident stats by Krinkle, CodePen.
Production Excellence: Month-over-month spreadsheet and plot.
Report charts for Wikimedia-production-error project, Phabricator.

Production Excellence #29: February 2021https://phabricator.wikimedia.org/phame/post/view/228/Krinkle (Timo Tijhof)2021-03-06T01:03:09+00:002021-03-06T01:03:09+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

3 documented incidents last month, [1] which is average for the time of year. [2]

Learn about these incidents at Incident status on Wikitech, and their Preventive measures in Phabricator.

For those with NDA-restricted access, there may be additional private incident reports 🔒 available.

💡 Did you know: Our Incident reports have switched to using the ISO date format in their titles and listings, for improved readability and edit-ability (esp. when publishing on a later date). So long 202010221, and hello 2021-02-21!

📊 Trends

In February we saw a continuation of the new downward trend that began this January, which came after twelve months of continued rising. Let's make sure this trend sticks with us as we work our way through the debt, whilst also learning to have a healthy week-to-week iteration where we monitor and follow-up on any new developments such that they don't introduce lasting regressions.

The recent tally (issues filed since we started reporting in March 2019) is down to 138 unresolved errors, from 152 last month. The old backlog (pre-2019 issues) also continued its 5-month streak and is down to 148, from 160 last month. If this progress continues we'll soon have fewer "Old" issues than "Recent" issues, and possibly by the start of 2022 we may be able to report and focus only on our rotation through recent issues as hopefully we are then balancing our work such that issues reported this month are addressed mostly in the same month or otherwise later that quarter within 2-3 months. Visually that would manifest as the colored chunks having a short life on the chart with each drawn at a sharp downwards angle – instead of dragged out where it was building up an ever-taller shortcake. I do like cake, but I prefer the kind I can eat. 🍰

Month-over-month plots based on spreadsheet data. [3] [4]

Unresolved error reports stacked by recent month
Total open production error tasks, by month

📖 Outstanding errors

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ⚠️ October 2019 (4 of 12 issues): no change.
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (1 of 9 issues): One task resolved (-1).
  • ⚠️ January 2020 (2 of 7 issues): no change.
  • ⚠️ February 2020 (1 of 7 issues): no change.
  • ⚠️ March 2020 (2 of 2 issues): no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (6 of 14 issues left): no change.
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (20 of 53 new issues): Two tasks resolved (-2).
  • September 2020 (9 of 33 new issues): Five tasks resolved (-5).
  • October 2020 (26 of 69 new issues): Five tasks resolved (-5).
  • November 2020 (11 of 38 new issues): Three tasks resolved (-3).
  • December 2020 (12 of 33 new issues): Seven tasks resolved (-7).
  • January 2021 (5 of 50 new issues): Two tasks resolved (-2).
  • February 2021: 11 of 20 new issues survived the month and remained unresolved (+20; -9)
Recent tally
152issues open, as of Excellence #28 (16 Feb 2021).
-25issues closed since, of the previous 152 open issues.
+11new issues that survived Feb 2021.
138issues open, as of today 5 Mar 2021.

For the on-going month of March 2021, we've got 12 new issues so far.

Take a look at the workboard and look for tasks that could use your help!

View Workboard


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.

Production Excellence #28: January 2021https://phabricator.wikimedia.org/phame/post/view/227/Krinkle (Timo Tijhof)2021-02-19T06:45:02+00:002021-03-05T23:57:55+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

1 documented incident last month. That's the third month in a row that we are at or near zero major incidents – not bad! [1] [2]

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.

💡 Did you know: Our Incident status page provides a green-yellow status reflection over the past ten days, with a link to the most recent incident doc if there was any during that time.

📊 Trends

This January saw a small recovery in our otherwise negative upward trend. For the first time in twelve month more reports were closed than new reports having outlived the previous month without resolution. What happened twelve months ago? In January 2020, we also saw a small recovery during the otherwise upward trend before and after it.

Perhaps it's something about the post-December holidays that temporarily improves the quality and/or reduces the quantity — of code changes. Only time will tell if this is the start of a new positive trend, or merely a post-holiday break. [3]

Unresolved error reports stacked by recent month

While our month-to-month trend might not (yet) be improving, we do see persistent improvements in our overall backlog of pre-2019 reports. This is in part because we generally don't file new reports there, so it makes sense that it doesn't go back up, but it's still good to see downward progress every month, unlike with reports from more recent months which often see no change month-to-month (see "Outstanding errors" below, for example).

This positive trend on our "Old" backlog started in October 2020 and has consistently progressed every month since then (refer to the "Old" numbers in red on the below chart, or the same column in the spreadsheet). [3][4]

Total open production error tasks, by month


📖 Outstanding errors

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ✅ September 2019 (0 of 12 issues): Last two tasks were resolved (-2).
  • ⚠️ October 2019 (4 of 12 issues): One task resolved (-1).
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (2 of 9 issues), Two tasks resolved (-2).
  • ⚠️ January 2020 (2 of 7 issues), no change.
  • ⚠️ February 2020 (1 of 7 issues left), One task resolved (-1).
  • March 2020 (2 of 2 issues left), no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (6 of 14 issues left): One task resolved (-1).
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (22 of 53 new issues): One task resolved (-1).
  • September 2020 (13 of 33 new issues): One task resolved (-1).
  • October 2020 (31 of 69 new issues): Four tasks fixed (-4).
  • November 2020 (14 of 38 new issues): no change.
  • December 2020 (19 of 33 new issues) Three tasks resolved (-3)
  • January 2021: 7 of 50 new issues survived the month and remained unresolved (+50; -43)
Recent tally
160issues open, as of Excellence #27 (4 Feb 2021).
-15issues closed since, of the previous 160 open issues.
+7new issues that survived January 2021.
152issues open, as of today (16 Feb 2021).

January saw +50 new production errors reported in a single month, which is an unfortunate all-time high. However, we've also done remarkably well on addressing 43 of them within a month, when the potential root cause and diagnostics data were still fresh in our minds. Well done!

For the on-going month of February, there have been 16 new issues reported so far.

Take a look at the workboard and look for tasks that could use your help!

View Workboard


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.

Production Excellence #27: December 2020https://phabricator.wikimedia.org/phame/post/view/219/Krinkle (Timo Tijhof)2021-02-04T05:46:03+00:002021-02-04T18:35:07+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

1 documented incident in December. [1] In previous years, December typically had 4 or fewer documented incidents. [3]

Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

Unresolved error reports stacked by recent month
Total open production error tasks, by month

Month-over-month plots based on spreadsheet data. [4] [2]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ⚠️ September 2019 (2 of 12 issues): One task resolved (-1).
  • ⚠️ October 2019 (5 of 12 issues): no change.
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (4 of 9 issues), no change.
  • ⚠️ January 2020 (2 of 7 issues), no change.
  • February 2020 (2 of 7 issues left), no change.
  • March 2020 (2 of 2 issues left), no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (7 of 14 issues left): no change.
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (23 of 53 new issues): no change.
  • September 2020 (13 of 33 new issues): One task resolved (-1).
  • October 2020 (35 of 69 new issues): Four issues fixed (-4).
  • November 2020 (14 of 38 new issues): Five issues fixed (-5).
  • December 2020: 22 of 33 new issues survived the month and remained unresolved (+33; -22)
Recent tally
149as of Excellence #26 (15 Dec 2020).
-11closed of the 149 recent issues.
+22new issues survived December 2020.
160 as of 27 Jan 2020.

🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incident documentation 2020, Wikitech.
[2] Open tasks, Wikimedia-prod-error, Phabricator.
[3] Wikimedia incident stats by Krinkle, CodePen.
[4] Month-over-month, Production Excellence spreadsheet.

Production Excellence #26: November 2020https://phabricator.wikimedia.org/phame/post/view/218/Krinkle (Timo Tijhof)2020-12-15T21:49:56+00:002021-02-04T18:34:14+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

Zero documented incidents in November. [1] That's the only month this year without any (publicly documented) incidents. In 2019, November was also the only such month. [3]

Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

Unresolved error reports stacked by recent month
Total open production error tasks, by month

The overall increase in errors was relatively low this past month, similar to the November-December period last year.

What's new is that we can start to see a positive trend emerging in the backlogs where we've shrunk issue count three months in a row, from the 233 high in October, down to the 181 we have in the ol' backlog today.

Month-over-month plots based on spreadsheet data. [4]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 tasks): One task closed (-1).
  • ⚠️ August 2019 (1 of 14 tasks): no change.
  • ⚠️ September 2019 (3 of 12 tasks): no change.
  • ⚠️ October 2019 (5 of 12 tasks): no change.
  • ⚠️ November 2019 (1 of 5 tasks): no change.
  • ⚠️ December 2019 (3 of 9 tasks left), no change.
  • January 2020 (3 of 7 tasks left), One task closed (-1).
  • February (2 of 7 tasks left), no change.
  • March (2 of 2 tasks left), no change.
  • April (9 of 14 tasks left): no change.
  • May (7 of 14 tasks left): no change.
  • June (7 of 14 tasks left): no change.
  • July 2020 (9 of 24 new tasks): no change.
  • August 2020 (23 of 53 new tasks): Three tasks closed (-3).
  • September 2020 (14 of 33 new tasks): One task closed (-1).
  • October 2020 (39 of 69 new tasks): Six tasks closed (-6).
  • November 2020: 19 of 38 new tasks survived the month and remain open today (+38; -19)
Recent tally
142as of Excellence #25 (23 Oct 2020).
-12closed of the 142 recent tasks.
+19survived November 2020.
149 as of today, 15 Dec 2020.

The on-going month of December, has 19 unresolved tasks so far.


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


❝   The plot "thickens" as they say. Why, by the way? Is it a soup metaphor? ❞

Footnotes:

[1] Incident documentation 2020, Wikitech.
[2] Open tasks, Wikimedia-prod-error, Phabricator.
[3] Wikimedia incident stats, Krinkle, CodePen.
[4] Month-over-month, Production Excellence (spreadsheet).

Runnable runbookshttps://phabricator.wikimedia.org/phame/post/view/217/mmodell (Mukunda Modell)2020-12-11T23:51:17+00:002023-09-03T05:36:30+00:00

Recently there has been a small effort on the Release-Engineering-Team to encode some of our institutional knowledge as runbooks linked from a page in the team's wiki space.

What are runbooks, you might ask? This is how they are described on the aforementioned wiki page:

This is a list of runbooks for the Wikimedia Release Engineering Team, covering step-by-step lists of what to do when things need doing, especially when things go wrong.

So runbooks are each essentially a sequence of commands, intended to be pasted into a shell by a human. Step by step instructions that are intended to help the reader accomplish an anticipated task or resolve a previously-encountered issue.

Presumably runbooks are created when someone encounters an issue, and, recognizing that it might happen again, helpfully documents the steps that were used to resolve said issue.

This all seems pretty sensible at first glance. This type of documentation can be really valuable when you're in an unexpected situation or trying to accomplish a task that you've never attempted before and just about anyone reading this probably has some experience running shell commands pasted from some online tutorials, setup instructions for a program, etc.

Despite the obvious value runbooks can provide, I've come to harbor a fairly strong aversion to the idea of encoding what are essentially shell scripts as individual commands on a wiki page. As someone who's job involves a lot of automation, I would usually much prefer a shell script, a python program, or even a "maintenance script" over a runbook.

After a lot of contemplation, I've identified a few reasons that I don't like runbooks on wiki pages:

  • Runbooks are tedious and prone to human errors.
    • It's easy to lose track of where you are in the process.
    • It's easy to accidentally skip a step.
    • It's easy to make typos.
  • A script can be code reviewed and version controlled in git.
  • A script can validate it's arguments which helps to catch typos.
  • I think that command line terminal input is more like code than it is prose. I am more comfortable editing code in my usual text editor as apposed to editing in a web browser. The wikitext editor is sufficient for basic text editing, and visual editor is quite nice for rich text editing, but neither is ideal for editing code.

I do realize that mediawiki does version control. I also realize that sometimes you just can't be bothered to write and debug a robust shell script to address some rare circumstances. The cost is high and it's uncertain whether the script will be worth such an effort. In those situations a runbook might be the perfect way to contribute to collective knowledge without investing a lot of time into perfecting a script.

My favorite web comic, xkcd, has a lot few things to say about this subject:

the_general_problem.png (230×550 px, 23 KB)
"The General Problem" xkcd #974.

automation_2x.png (817×807 px, 55 KB)
"Automation" xkcd #1319.

is_it_worth_the_time_2x.png (927×1 px, 125 KB)
"Is It Worth the Time?" xkcd #1205.

Potential Solutions

I've been pondering a solution to these issues for a long time. Mostly motivated by the pain I have experienced (and the mistakes I've made) while executing the biggest runbook of all on a regular basis.

Over the past couple of years I've come across some promising ideas which I think can help the problems I've identified with runbooks. I think that one of the most interesting is Do-nothing scripting. Dan Slimmon identifies some of the same problems that I've detailed here. He uses the term *slog* to refer to long and tedious procedures like the Wikimedia Train Deploys. The proposed solution comes in the form of a do-nothing script. You should go read that article, it's not very long. Here are a few relevant quotes:

Almost any slog can be turned into a do-nothing script. A do-nothing script is a script that encodes the instructions of a slog, encapsulating each step in a function.

...

At first glance, it might not be obvious that this script provides value. Maybe it looks like all we’ve done is make the instructions harder to read. But the value of a do-nothing script is immense:

  • It’s now much less likely that you’ll lose your place and skip a step. This makes it easier to maintain focus and power through the slog.
  • Each step of the procedure is now encapsulated in a function, which makes it possible to replace the text in any given step with code that performs the action automatically.
  • Over time, you’ll develop a library of useful steps, which will make future automation tasks more efficient.

A do-nothing script doesn’t save your team any manual effort. It lowers the activation energy for automating tasks, which allows the team to eliminate toil over time.

I was inspired by this and I think it's a fairly clever solution to the problems identified. What if we combined the best aspects of gradual automation with the best aspects of a wiki-based runbook? Others were inspired by this as well, resulting in tools like braintree/runbook, codedown and the one I'm most interested in, rundoc.

Runnable Runbooks

My ideal tool would combine code and instructions in a free-form "literate programming" style. By following some simple conventions in our runbooks we can use a tool to parse and execute the embedded code blocks in a controlled manner. With a little bit of tooling we can gain many benefits:

  • The tooling will keep track of the steps to execute, ensuring that no steps are missed.
  • Ensure that errors aren't missed by carefully checking / logging the result of each step.
  • We could also provide a mechanism for inputting the values of any variables / arguments and validate the format of user input.
  • With flexible control flow management we can even allow resuming from anywhere in the middle of a runbook after an aborted run.
  • Manual steps can just consist of a block of prose that gets displayed to the operator. With embedded markup we can format the instructions nicely and render them in the terminal using [Rich][7]. Once the operator confirms that the step is complete then the workflow moves on to the next step.

Prior Art

I've found a few projects that already implement many of these ideas. Here are a few of the most relevant:

The one I'm most interested in is Rundoc. It's almost exactly the tool that I would have created. In fact, I started writing code before discovering rundoc but once I realized how closely this matched my ideal solution, I decided to abandon my effort. Instead I will add a couple of missing features to Rundoc in order to get everything that I want and hopefully I can contribute my enhancements back upstream for the benefit of others.

Demo: https://asciinema.org/a/MKyiFbsGzzizqsGgpI4Jkvxmx
Source: https://github.com/20after4/rundoc

References

[1]: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Runbooks "runbooks"
[2]: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys "Train deploys"
[3]: https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-the-key-to-gradual-automation/ "Do-nothing scripting: the key to gradual automation by Dan Slimmon"
[4]: https://github.com/braintree/runbook "runbook by braintree"
[5]: https://github.com/earldouglas/codedown "codedown by earldouglas"
[6]: https://github.com/eclecticiq/rundoc "rundoc by eclecticiq"
[7]: https://rich.readthedocs.io/en/latest/ "Rich python library"

Production Excellence #25: October 2020https://phabricator.wikimedia.org/phame/post/view/213/Krinkle (Timo Tijhof)2020-11-24T05:13:15+00:002020-11-24T05:50:14+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

2 documented incidents in October. [1] Historically, that's just below the median of 3 for this time of year. [3]

Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

Unresolved error reports stacked by recent month
Total open production error tasks, by month

Month-over-month plots based on spreadsheet data. [5]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (3 of 18 tasks): One task closed.
  • ⚠️ August 2019 (1 of 14 tasks): no change.
  • ⚠️ September 2019 (3 of 12 tasks): no change.
  • ⚠️ October 2019 (5 of 12 tasks): One task closed.
  • ⚠️ November 2019 (1 of 5 tasks): Two tasks closed.
  • December (3 of 9 tasks left), no change.
  • January 2020 (4 of 7 tasks left), no change.
  • February (2 of 7 tasks left), no change.
  • March (2 of 2 tasks left), no change.
  • April (9 of 14 tasks left): One task closed.
  • May (7 of 14 tasks left): no change.
  • June (7 of 14 tasks left): no change.
  • July 2020 (9 of 24 new tasks): One task closed.
  • August 2020 (26 of 53 new tasks): Five tasks closed.
  • September 2020 (15 of 33 new tasks): Two tasks closed.
  • October 2020: 45 of 69 new tasks survived the month of October and remain open today.
Recent tally
110as of Excellence #24 (23rd Oct).
-13closed of the 110 recent tasks.
+45survived October 2020.
142as of today, 23rd Nov.

For the on-going month of November, there are 25 new tasks so far.


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


 👤  Howard Salomon:

❝   Problem is when they arrest you, you get put on the justice train, and the train has no brain. ❞  

Footnotes:

[1] Incident documentation 2020, Wikitech
[2] Open tasks in Wikimedia-prod-error, Phabricator
[3] Wikimedia incident stats by Krinkle, CodePen
[4] Month-over-month, Production Excellence (spreadsheet)

CI now updates your deployment-chartshttps://phabricator.wikimedia.org/phame/post/view/208/jeena (Jeena Huneidi)2020-09-24T17:34:47+00:002020-11-17T23:46:14+00:00

If you're making changes to a service that is deployed to Kubernetes, it sure is annoying to have to update the helm deployment-chart values with the newest image version before you deploy. At least, that's how I felt when developing on our dockerfile-generating service, blubber.

Over the last two months we've added

And I'm excited to say that CI can now handle updating image versions for you (after your change has merged), in the form of a change to deployment-charts that you'll need to +2 in Gerrit. Here's what you need to do to get this working in your repo:

Add the following to your .pipeline/config.yaml file's publish stage:

promote: true

The above assumes the defaults, which are the same as if you had added:

promote:
  - chart: "${setup.projectShortName}"  # The project name
    environments: []                    # All environments
    version: '${.imageTag}'             # The image published in this stage

You can specify any of these values, and you can promote to multiple charts, for example:

promote:
  - chart: "echostore"
    environments: ["staging", "codfw"]
  - chart: "sessionstore"

The above values would promote the production image published after merging to all environments for the sessionstore service, and only the staging and codfw environments for the echostore service. You can see more examples at https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Promote

If your containerized service doesn't yet have a .pipeline/config.yaml, now is a great time to migrate it! This tutorial can help you with the basics: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial#Publishing_Docker_Images

This is just one step closer to achieving continuous delivery of our containerized services! I'm looking forward to continuing to make improvements in that area.

Production Excellence #24: September 2020https://phabricator.wikimedia.org/phame/post/view/205/Krinkle (Timo Tijhof)2020-10-23T23:51:43+00:002020-10-23T23:59:26+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

5 documented incidents. [1] Historically, that's right on average for the time of year. [3]

For more about recent incidents see Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

month-stack-2020-10-23.png (1×1 px, 113 KB)
total-stack-2020-10-23.png (941×1 px, 97 KB)

Month-over-month plots based on spreadsheet data. [5]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (4 of 18 tasks left): no change.
  • ⚠️ August 2019 (1 of 14 tasks left): no change.
  • ⚠️ September 2019 (3 of 12 tasks left): no change.
  • ⚠️ October 2019 (6 of 12 tasks left), no change.
  • November (3 of 5 tasks left): no change.
  • December (3 of 9 tasks left), no change.
  • January 2020 (4 of 7 tasks left), One task closed.
  • February (2 of 7 tasks left), no change.
  • March (2 of 2 tasks left), no change.
  • April (10 of 14 tasks left): no change.
  • May (7 of 14 tasks left): no change.
  • June (7 of 14 tasks left): Three tasks closed.
  • July 2020 (10 of 24 new tasks): Three tasks closed.
  • August 2020 (31 of 53 new tasks): Six tasks closed.
  • September 2020: 17 of 33 new tasks survived the month of September and remain open today.
Recent tally
106as of Excellence #23 (Sep 23rd).
-13closed of the 106 recent tasks.
+17survived September 2020.
110as of today, Oct 23rd.

Previously, we had 106 unresolved production errors from the recent months up to August. Since then, 13 of those were closed. But, the 18 errors surviving September raise our recent tally to 110.

The workboard overall (including errors from 2019 and earlier) holds 343 open tasks in total, an increase of +47 compared to the 296 total on Sept 23rd.


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


🕵️‍♀️ Holmes: “So, she pulled five bullets out of you?”

     Shinwell: ”That's right.”
     Holmes: “I too have been shot five times. But, uh..., separate occasions.”
     Shinwell: “That's... great.”

Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation
[2] Open tasks. – phabricator.wikimedia.org/maniphest/query…
[3] Wikimedia incident stats. – codepen.io/Krinkle/full/wbYMZK
[4] Month-over-month plots. – docs.google.com/spreadsheets/d/1tRC…

Production Excellence #23: July & August 2020https://phabricator.wikimedia.org/phame/post/view/204/Krinkle (Timo Tijhof)2020-09-23T18:10:33+00:002020-09-23T18:10:33+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈   Incidents

4 documented incidents in July, and 2 documented incidents in August. [1] Historically, that's on average for this time of year. [5]

For more about recent incidents see Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊   Trends

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (4 of 18 tasks left): One task closed.
  • ⚠️ August 2019 (1 of 14 tasks left): no change.
  • ⚠️ September 2019 (3 of 12 tasks left): Two tasks closed.
  • October (6 of 12 tasks left), no change.
  • November (3 of 5 tasks left): no change.
  • December (3 of 9 tasks left), Two tasks closed.
  • January 2020 (5 of 7 tasks lef), no change.
  • February (2 of 7 tasks left), Two tasks closed.
  • March (2 of 2 tasks left), no change.
  • April (10 of 14 tasks left): One task closed.
  • May (7 of 14 tasks left): Four tasks closed.
  • June (10 of 14 tasks left): Four tasks closed.
  • July 2020: 13 of 24 new tasks survived the month of July and remain open today.
  • August 2020: 37 of 53 new tasks survived the month of August and remain open today.
Recent tally
72open, as of Excellence #22 (Jul 23rd).
-16closed, of the previous 72 recent tasks.
+13opened and survived July 2020.
+37opened and survived August 2020.
106open, as of today (Sep 23rd).

Previously, we had 72 open production errors over the recent months up to June. Since then, 16 of those were closed. But, the 13 and 37 errors surviving July and August raise our recent tally to 106.

The workboard overall (including tasks from 2019 and earlier) held 192 open production errors on July 23rd. As of writing, the workboard holds 296 open tasks in total. [4] This +104 increase is largely due to the merged backlog of JavaScript client errors, which were previously untracked. Note that we backdated the majority of these JS errors under “Old”, and thus are not amongst the elevated numbers of July and August.


🎉   Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


👊🍺 Tyler: “You know man, it could be worse! […]” Narrator: “[but] I was close... to being complete.”

Tyler: “Martha's polishing the brass on the Titanic. It's all going down, man. […] Evolve! Let the chips fall where they may.”
Narrator: “What!?” Tyler: “The things you own..., they end up owning you.”

Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query…
[5] Wikimedia incident stats. – https://codepen.io/Krinkle/full/wbYMZK

Production Excellence #22: June 2020https://phabricator.wikimedia.org/phame/post/view/203/Krinkle (Timo Tijhof)2020-07-23T03:25:50+00:002020-07-28T16:54:55+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Month in review
  • 4 documented incidents in June. [1]
  • 37 new production errors were filed and 27 were closed. [2] [3]
  • 72 recent production errors still open (up from 68).
  • 203 total Wikimedia-prod-error tasks currently open (up from 192). [4]

For more about recent incidents see Incident documentation, on Wikitech or Preventive measures in Phabricator.


📖 Outstanding errors

Breakdown of new errors reported in June that are still open today:

  1. (Needs owner) / Newsletter extension: Unexpected locking SELECT query. T253926
  2. (Needs owner) / FlaggedRevs extension: Unable to submit review of page due to bad fr_page_id record. T256296
  3. Editing team / MassMessage extension: Delivery fails due to system user conflict. T171003
  4. Parsing team / Parsoid: Pagebundle data unavailable due to a bad UTF-8 string. T236866
  5. Growth team / Recent changes: Update for ActiveUsers data failing due to deadlock. T255059
  6. Growth team / GrowthExperiments: Issue with question display on personal homepage. T255616
  7. Language team / Translate extension: Update jobs fail due to invalid function call. T255669
  8. Language team / ContentTranslation: Save action fails due to duplicate insert query. T256230
  9. Core Platform team / Content handling: Incompatible content type during content merge/stash. T255700
  10. Core Platform team / Monolog: API usage logs and error logs sometimes missing due to socket failure. T255578
  11. Search Platform team / WikibaseCirrus: Elevated error levels from EntitySearchElastic warnings. T255658
  12. Wikidata / API: Generator query fails due to invalid API result format. T254334
  13. Wikidata / API: EntityData query emits warning about bad RDF. T255054
  14. Wikidata / Repo: Entity relation update jobs fail due to deadlock. T255706

📊 Trends
Take a look at the workboard and look for tasks that could use your help.

Summary over recent months:

  • July 2019 (5 of 18 tasks left): Two tasks closed.
  • August (1 of 14 tasks left): Another task closed, only one remaining! 🚀
  • September (5 of 12 tasks left): Two tasks closed.
  • October (6 of 12 tasks left), no change.
  • November (3 of 5 tasks left): Another task closed.
  • December (5 of 9 tasks left), no change.
  • January 2020 (5 of 7 tasks lef), no change.
  • February (4 of 7 tasks left), no change.
  • March (2 of 2 tasks left), no change.
  • April (11 of 14 tasks left): Three tasks closed.
  • May (11 tasks left): Three tasks closed.
  • June: 14 new tasks survived the month of June. ⚠️

At the end of May the number of open production errors over recent months was 68. Of those, 10 got closed, but with 14 new tasks from June still open, the total has grown further to 72.

The workboard had 192 open tasks last month, which saw another increase, to now 203 open tasks (this includes tasks from 2019 and earlier).


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


ATC: “Do you want to report a UFO?” Pilot: “Negative. We don't want to report.”
   ATC: “Do you wish to file a report of any kind to us?” Pilot: “I wouldn't know what kind of report to file.”
  ATC: “Me neither…”

Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/VTpmvaJLYVL1/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/qn5yeURqyl3D/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/Fw3RdXt1Sdxp/#R

Faster source code fetches thanks to git protocol version 2https://phabricator.wikimedia.org/phame/post/view/199/hashar (Antoine Musso)2020-07-06T10:57:02+00:002020-10-29T10:21:40+00:00

In 2015 I noticed git fetches from our most active repositories to be unreasonably slow, sometimes up to a minute which hindered fast development and collaboration. You can read some of the debugging details I have conducted at the time on T103990. Gerrit upstream was aware of the issue and a workaround was presented though we never went to implement it.

When fetching source code from a git repository, the client and server conduct a negotiation to discover which objects have to be sent. The server sends an advertisement that lists every single reference it knows about. For a very active repository in Gerrit it means sending references for each patchset and each change ever made to the repository, or almost 200,000 references for mediawiki/core. That is a noticeable amount of data resulting in a slow fetch, especially on a slow internet connection.

Gerrit originated at Google and has full time maintainers. In 2017 a team at Google went to tackle the problem and proposed a new protocol to address the issue, and they closely worked with git maintainers while doing so. The new protocol makes git smarter during the advertisement phase, notably to filter out references the client is not interested in. You can read Google introduction post at https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html

Since June 28th 2020, our Gerrit has been upgraded and now supports git protocol version 2. But to benefit from faster fetches, your client also needs to know about the newer protocol and have it explicitly enabled. For git, you will want version 2.18 or later. Enable the new protocol by setting git configuration protocol.version to 2.

It can be done either on an on demand basis:

git -c protocol.version=2 fetch

Or enabled in your user configuration file:

$HOME/.gitconfig
[protocol]
    version = 2

On my internet connection, fetching for mediawiki/core.git went from ~15 seconds to just 3 seconds. A noticeable difference in my day to day activity.

If you encounter any issue with the new protocol, you can file a task in our Phabricator and tag it with git-protocol-v2.

Production Excellence #21: May 2020https://phabricator.wikimedia.org/phame/post/view/198/Krinkle (Timo Tijhof)2020-06-24T20:00:31+00:002020-06-24T20:00:31+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊   Month in numbers
  • 5 documented incidents in May. [1]
  • 28 new production error tasks filed in May. [2] [3]
  • 68 recent production errors currently open (up from 61).
  • 193 currently open Wikimedia-prod-error tasks (up from 178). [4]

For more about recent incidents see Incident documentation on Wikitech, or Preventive measures in Phabricator.


📉   Outstanding reports

Take a look at the workboard and look for tasks that could use your help.
→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Breakdown of recent months:

  • July 2019: One task closed, 7 of 18 tasks left. ⚠️
  • August: 2 of 14 tasks left (unchanged).
  • September: 7 of 12 tasks left (unchanged).
  • October: 4 of 12 tasks left (unchanged).
  • November: 4 of 5 tasks left (unchanged).
  • December: 4 of 9 tasks left (unchanged).
  • January 2020: 5 of 7 tasks left (unchanged).
  • February: Two tasks closed, 4 of 7 tasks left. ⚠️
  • March: 2 of 2 tasks left (unchanged).
  • April: 14 of 14 tasks left (unchanged).
  • May: 14 new tasks survived the month of May.

At the end of April the total of open production errors over recent months was 61. Of those, 7 got closed, but with 14 new tasks from May still open, the total has grown to 68.

The workboard had 178 open tasks in April, which saw a steep increase to now 192 open tasks (this includes June 2020 so far, and pre-2019 tasks).


🎉   Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/7Z4Us2BS02Uo/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/FoIFMu5UO8pw/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/Fw3RdXt1Sdxp/#R

Celebrating 600,000 commits for Wikimediahttps://phabricator.wikimedia.org/phame/post/view/197/Jdforrester-WMF (James D. Forrester)2020-05-29T22:47:22+00:002020-07-10T11:27:00+00:00

Earlier today, the 600,000th commit was pushed to Wikimedia's Gerrit server. We thought we'd take this moment to reflect on the developer services we offer and our community of developers, be they Wikimedia staff, third party workers, or volunteers.

At Wikimedia, we currently use a self-hosted installation of Gerrit to provide code review workflow management, and code hosting and browsing. We adopted this in 2011–12, replacing Apache Subversion.

Within Gerrit, we host several thousand repositories of code (2,441 as of today). This includes MediaWiki itself, plus all the many hundreds of extensions and skins people have created for use with MediaWiki. Approximately 90% of the MediaWiki extensions we host are not used by Wikimedia, only by third parties. We also host key Wikimedia server configuration repositories like puppet or site config, build artefacts like vetted docker images for production services or local .deb build repos for software we use like etherpad-lite, ancillary software like our special database exporting orchestration tool for dumps.wikimedia.org, and dozens of other uses.

Gerrit is not just (or even primarily) a code hosting service, but a code review workflow tool. Per the Wikimedia code review policy, all MediaWiki code heading to production should go through separate development and code review for security, performance, quality, and community reasons. Reviewers are required to use their "good judgement and careful action", which is a heavy burden, because "[m]erging a change to the MediaWiki core or an extension deployed by Wikimedia is a big deal". Gerrit helps them do this, providing clear views of what is changing, supporting itemised, character-level, file-level, or commit-level feedback and revision, and allowing series of complex changes to be chained together across multiple repositories, and ensuring that forthcoming and merged changes are visible to product owners, development teams, and other interested parties.

Across all of repositories, we average over 200 human commits a day, though activity levels vary widely. Some repositories have dozens of patches a week (MediaWiki itself gets almost 20 patches a day; puppet gets nearly 30), whereas others get a patch every few years. There are over 8,000 accounts registered with Gerrit, although activity is not distributed uniformly throughout that cohort.

To focus engineer time where it's needed, a fair amount of low-risk development work is automated. This happens in both creating patches and also, in some cases, merging them.

For example, for many years we have partnered with TranslateWiki.net's volunteer community to translate and maintain MediaWiki interfaces in hundreds of languages. Exports of translators' updates are pushed and merged automatically by one of the TWN team each day, helping our users keep a fresh, usable system whatever their preferred language.

Another key area is LibraryUpgrader, a custom tool to automatically upgrade the libraries we use for continuous integration across hundreds of repositories, allowing us to make improvements and increase standards without a single central breaking change. Indeed, the 600,000th commit was one of these automatic commits, upgrading the version of the mediawiki-codesniffer tool in the GroupsSidebar extension to the latest version, ensuring it is written following the latest Wikimedia coding conventions for PHP.

Right now, we're working on upgrading our installation of Gerrit, moving from our old version based on the 2.x branch through 2.16 to 3.1, which will mean a new user interface and other user-facing changes, as well as improvements behind the scenes. More on those changes will be coming in later posts.


Header image: A vehicle used to transport miners to and from the mine face by 'undergrounddarkride', used under CC-BY-2.0.

Production Excellence #20: April 2020https://phabricator.wikimedia.org/phame/post/view/193/Krinkle (Timo Tijhof)2020-05-14T16:10:41+00:002020-05-25T16:23:11+00:00

How are we doing on that strive for operational excellence during these unprecedented times?

📊  Numbers for March and April
  • 3 documented incidents. [1]
  • 60 new Wikimedia-prod-error reports. [2]
  • 58 Wikimedia-prod-error reports closed. [3]
  • 178 currently open Wikimedia-prod-error reports in total. [4]

For more about recent incidents and pending actionables see Wikitech and Phabricator.


📉  Outstanding reports

Take a look at the workboard and look for tasks that could use your help.

→  https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Breakdown of recent months:

  • April 2019: Two reports closed, 2 of 14 left.
  • May: (All clear!)
  • June: 4 of 11 left (unchanged). ⚠️
  • July: 8 of 18 left (unchanged).
  • August: 2 of 14 reports left (unchanged).
  • September: 7 of 12 left (unchanged).
  • October: Two reports closed, 4 of 12 left.
  • November: One report closed, 4 of 5 left.
  • December: Two reports closed, 4 of 9 left.
  • January 2020: One report closed, 5 of 7 reports left.
  • February: One report closed, 6 of 7 reports left.
  • March: 2 new reports survived the month of March.
  • April: 13 new reports survived the month of April.

At the end of February the total of open reports over recent months was 58. Of those, 12 got closed, but with 15 new reports from March/April still open, the total is now up at 61 open reports.

The workboard overall (which includes pre-2019 tasks) has 178 tasks open. This is actually down by a bit for the first time since October with December at 196, January at 198, and February at 199, and now April at 178. This was largely due to the Release Engineering and Core Platform teams closing out forgotten reports that have since been resolved or otherwise obsoleted.

💡 Tip: Verifying existing tasks is a good way to (re)familiarise yourself with Kibana. For example: Does the error still occur in the last 30 days? Does it only happen on a certain wiki? What do the URLs or stack traces have in common?

🎉  Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:
[1] Incidents. – https://wikitech.wikimedia.org/wiki/Incident_documentation
[2] Tasks created. – https://phabricator.wikimedia.org/maniphest/query/HjopcKClxTfw/#R
[3] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/ts62HKYPBxod/#R
[4] Open tasks. – https://phabricator.wikimedia.org/maniphest/query/Fw3RdXt1Sdxp/#R

Production Excellence #19: February 2020https://phabricator.wikimedia.org/phame/post/view/192/Krinkle (Timo Tijhof)2020-03-24T21:40:10+00:002020-03-25T13:46:41+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊  Month in numbers
  • 8 documented incidents. [1]
  • 27 new Wikimedia-prod-error reports. [2]
  • 26 Wikimedia-prod-error reports closed. [3]
  • 199 currently open Wikimedia-prod-error reports in total. [4]

With a median of 4–5 documented incidents per month (over the last three years), there were a fairly large number of them this past month.

To read more about these incidents and pending actionables; check Incident documentation § 2020, or Explore Wikimedia incident stats (interactive).


📖  Unset vs array splice

Our error monitor (Logstash) received numerous reports about an “Undefined offset” error from the OATHAuth extension. This extension powers the Two-factor auth (2FA) login interface on Wikipedia.

@ItSpiderman and @Reedy investigated the problem. The error message:

PHP Notice: Undefined offset: 8
at /srv/mediawiki/extensions/OATHAuth/src/Key/TOTPKey.php:188

This error means that the code was accessing item number 8 from a list (an array), but the item does not exist. Normally, when a “2FA scratch token” is used, we remove it from a list, and save the remaining list for next time.

The code used the count() function to compute the length of the list, and used a for-loop to iterate through the list. When the code found the user’s token, it used the unset( $list[$num] ) operation to remove token $num from the list, and then save $list for next time.

The problem with removing a list item in this way is that it leaves a “gap”. Imagine a list with 4 items, like [ 1: …, 2: …, 3: … , 4: … ]. If we unset item 2, then the remaining list will be [ 1: …, 3: …, 4: … ]. The next time we check this list, the length of the list is now 3 (so far so good!), but the for-loop will access the items as 1-2-3. The code would not know that 3 comes after 1, causing an error because item 2 does not exist. And, the code would not even look at item 4!

When a user used their first ever scratch token, everything worked fine. But from their second token onwards, the tokens could be rejected as “wrong” because the code was not able to find them.

To avoid this bug, we changed the code to use array_splice( $list, $num, 1 ) instead of unset( $list[$num] ). The important thing about array_splice is that it renumbers the items in the list, leaving no gaps.

T244308 / https://gerrit.wikimedia.org/r/570253


📉  Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Breakdown of recent months:

  • March: 3 of 10 reports left (unchanged). ⚠️
  • April: 4 of 14 left (unchanged).
  • May: (All clear!)
  • June: 4 of 11 left (unchanged).
  • July: 8 of 18 left (unchanged).
  • August: Two reports closed! 2 of 14 reports left.
  • September: One report closed, 7 of 12 left.
  • October: Two reports closed, 6 of 12 left.
  • November: 5 of 5 left (unchanged).
  • December: 6 of 9 left (unchanged).
  • January: One report closed, 6 of 7 reports left.
  • February: 7 new reports survived the month of February.

Last month’s total over recent months was 57 open reports. Of those, 6 got closed, but with 7 new reports from February still open, the total is now up at 58 open reports.


🎉  Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.

Together, we’re getting there!

Until next time,

– Timo Tijhof


Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2020
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Production Excellence #18: January 2020https://phabricator.wikimedia.org/phame/post/view/180/Krinkle (Timo Tijhof)2020-02-28T19:39:20+00:002020-03-24T22:08:17+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊  Month in numbers
  • 3 documented incidents. [1]
  • 26 new Wikimedia-prod-error reports. [2]
  • 26 Wikimedia-prod-error reports closed. [3]
  • 198 currently open Wikimedia-prod-error reports in total. [4]

To read more about these incidents and pending actionables; check Incident documentation § 2020, or Explore Wikimedia incident stats (interactive).


📖  Paradoxical array key

Wikimedia encountered several Zend engine bugs that could corrupt a PHP program at run-time, during the upgrade from HHVM to PHP 7.2. (Some of these bugs are still being worked on.) One of the bugs we fixed last month was particularly mysterious. Investigation led by @hashar and @tstarling.

MediaWiki would create an array in PHP and add a key-value pair to it. We could iterate this array, and see that our key was there. Moments later, if we tried to retrieve the key from that same array, sometimes the key would no longer exist!

After many ad-hoc debug logs, core dumps, and GDB sessions, the problem was tracked down to the string interning system of Zend PHP. String interning is a memory reduction technique. It means we only store one copy of a character sequence in RAM, even if many parts of the code use the same character sequence. For example, the words “user” and “edit” are frequently used in the MediaWiki codebase. One of those sequences is the empty string (“”), which is also used a lot in our code. This is the string we found disappearing most often from our PHP arrays. This bug affected several components, including Wikibase, the wikimedia/rdbms library, and ResourceLoader.

Tim used a hardware watchpoint in GDB, and traced the root cause to the Memcached client for PHP. The php-memcached client would “free” a string directly from the internal memory manager after doing some work. It did this even for “interned” strings that other parts of the program may still be depending on.

@jijiki and @Joe backported the upstream fix to our php-memcached package and deployed it to production. Thanks! — T232613


📉  Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Breakdown of recent months (past two weeks not included):

  • March: 3 of 10 reports left (unchanged). ⚠️
  • April: Two reports closed, 4 of 14 left.
  • May: (All clear!)
  • June: Two reports closed. 4 of 11 left.
  • July: Four reports closed, 8 of 18 left.
  • August: 4 of 14 reports left (unchanged).
  • September: One report closed, 8 of 12 left.
  • October: 8 of 12 left (unchanged).
  • November: 5 of 5 left (unchanged).
  • December: Three reports closed, 6 of 9 left.
  • January: 7 new reports survived the month of January.

There are a total of 57 reports filed in recent months that remain open. This is down from 62 last month.


🎉  Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Production Excellence #17: December 2019https://phabricator.wikimedia.org/phame/post/view/179/Krinkle (Timo Tijhof)2020-01-10T02:51:24+00:002020-07-23T03:09:36+00:00

How’d we do in our strive for operational excellence in November and December? Read on to find out!

📊 Month in numbers
  • 0 documented incidents in November, 5 incidents in December. [1]
  • 17 new Wikimedia-prod-error reports. [2]
  • 23 Wikimedia-prod-error reports closed. [3]
  • 190 currently open Wikimedia-prod-error reports in total. [4]

November had zero reported incidents. Prior to this, the last month with no documented incidents was December 2017. To read about past incidents and unresolved actionables; check Incident documentation § 2019.

Explore Wikimedia incident graphs (interactive)

cap.png (654×1 px, 33 KB)


📖 Many dots, do not a query make!

@dcausse investigated a flood of exceptions from SpecialSearch, which reported “Cannot consume query at offset 0 (need to go to 7296)”. This exception served as a safeguard in the parser for search queries. The code path was not meant to be reached. The root cause was narrowed down to the following regex:

/\G(?<negated>[-!](?=[\w]))?(?<word>(?:\\\\.|[!-](?!")|[^"!\pZ\pC-])+)/u

This regex looks complex, but it can actually be simplified to:

/(?:ab|c)+/

This regex still triggers the problematic behavior in PHP. It fails with a PREG_JIT_STACKLIMIT_ERROR, when given a long string. Below is a reduced test case:

$ret = preg_match( '/(?:ab|c)+/', str_repeat( 'c', 8192 ) );
if ( $ret === false ) {
    print( "failed with: " . preg_last_error() );
}
  • Fails when given 1365 contiguous c on PHP 7.0.
  • Fails with 2731 characters on PHP 7.2, PHP 7.1, and PHP 7.0.13.
  • Fails with 8192 characters on PHP 7.3. (Might be due to php-src@bb2f1a6).

In the end, the fix we applied was to split the regex into two separate ones, and remove the non-capturing group with a quantifier, and loop through at the PHP level (Gerrit change 546209).

The lesson learned here is that the code did not properly check the return value of preg_match, this is even more important as the size allowed for the JIT stack changes between PHP versions.

For future reference, @dcausse concluded: The regex could be optimized to support more chars (~3 times more) by using atomic groups, like so /(?>ab|c)+/. — T236419


📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:

→ Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • March: 3 of 10 reports left. (unchanged). ⚠️
  • April: Three reports closed, 6 of 14 left.
  • May: (All clear!)
  • June: Three reports closed. 6 of 11 left (unchanged). ⚠️
  • July: One report closed, 12 of 18 left.
  • August: Two reports closed, 4 of 14 left.
  • September: One report closed, with 9 of 12 left.
  • October: Four reports closed, 8 of 12 left.
  • November: 5 new reports survived the month of November.
  • December: 9 new reports survived the month of December.

🎉 Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production.

Until next time,

– Timo Tijhof


Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Incident_documentation#2019
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Production Excellence #16: October 2019https://phabricator.wikimedia.org/phame/post/view/178/Krinkle (Timo Tijhof)2019-11-08T05:57:12+00:002020-07-23T03:09:58+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 3 documented incidents. [1]
  • 33 new Wikimedia-prod-error reports. [2]
  • 30 Wikimedia-prod-error reports closed. [3]
  • 207 currently open Wikimedia-prod-error reports in total. [4]

There were three recorded incidents last month, which is slightly below our median of the past two years (Explore this data). To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.


📖 To Log or not To Log

MediaWiki uses the PSR-3 compliant Monolog library to send messages to Logstash (via rsyslog and Kafka). These messages are used to automatically detect (by quantity) when the production cluster is in an unstable state. For example, due to an increase in application errors when deploying code, or if a backend system is failing. Two distinct issues hampered the storing of these messages this month, and both affected us simultaneously.

Elasticsearch mapping limit

The Elasticsearch storage behind Logstash optimises responses to Logstash queries with an index. This index has an upper limit to how many distinct fields (or columns) it can have. When reached, messages with fields not yet in the index are discarded. Our Logstash indexes are sharded by date and source (one for “mediawiki”, one for “syslog”, and one for everthing else).

This meant that error messages were only stored if they only contained fields used before, by other errors stored that day. Which in turn would only succeed if that day’s columns weren’t already fully taken. A seemingly random subset of error messages was then rejected for a full day. Each day it got a new chance at reserving its columns, so long as the specific kind of error is triggered early enough.

To unblock deployment automation and monitoring of MediaWiki, an interim solution was devised. The subset of messages from “mediawiki” that deal with application errors now have their own index shard. These error reports follow a consistent structure, and contain no free-form context fields. As such, this index (hopefully) can’t reach its mapping limit or suffer message loss.

The general index mapping limit was also raised from 1000 to 2000. For now that means we’re not dropping any non-critical/debug messages. More information about the incident at T234564. The general issue with accommodating debug messages in Logstash long-term, is tracked at T180051. Thanks @matmarex, @hashar, and @herron.

Crash handling

Wikimedia’s PHP configuration has a “crash handler” that kicks in if everything else fails. For example, when the memory limit or execution timeout is reached, or if some crucial part of MediaWiki fails very early on. In that case our crash handler renders a Wikimedia-branded system error page (separate from MediaWiki and its skins). It also increments a counter metric for monitoring purposes, and sends a detailed report to Logstash. In migrating the crash handler from HHVM to PHP7, one part of the puzzle was forgotten. Namely the Logstash configuration that forwards these reports from php-fpm’s syslog channel to the one for mediawiki.

As such, our deployment automation and several Logstash dashboards were blind to a subset of potential fatal errors for a few days. Regressions during that week were instead found by manually digging through the raw feed of the php-fpm channel instead. As a temporary measure, Scap was updated to consider the php-fpm’s channel as well in its automation that decides whether a deployment is “green”.

We’ve created new Logstash configurations that forward PHP7 crashes in a similar way as we did for HHVM in the past. Bookmarked MW dashboards/queries you have for Logstash now provide a complete picture once again. Thanks @jijiki and @colewhite! – T234283


📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • March: 1 report fixed. (3 of 10 reports left).
  • April: 8 of 14 reports left (unchanged). ⚠️
  • May: (All clear!)
  • June: 9 of 11 reports left (unchanged). ⚠️
  • July: 13 of 18 reports left (unchanged).
  • August: 2 reports were fixed! (6 of 14 reports left).
  • September: 2 reports were fixed! (10 of 12 new reports left).
  • October: 12 new reports survived the month of October.

🎉 Thanks!

Thank you, to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


🌴“Gotta love crab. In time, too. I couldn't take much more of those coconuts. Coconut milk is a natural laxative. That's something Gilligan never told us.

Footnotes:
[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…
[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…
[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…
[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Introducing Phatalityhttps://phabricator.wikimedia.org/phame/post/view/177/mmodell (Mukunda Modell)2019-10-07T00:36:27+00:002019-10-18T13:39:06+00:00

Introduction

This past week marks the release of a little tool that I've been working on for a while. In fact, it's something I've wanted to build for more than a year. But before I tell you about the solution, I need to describe the problem that I set out to solve.

Problem

Production errors are tracked with the tag Wikimedia-production-error. As a member of the Release-Engineering-Team, I've spent a significant amount of time copying details from Kibana log entries and pasting into the Production Error Report form here in Phabricator. There are several of us who do this on a regular basis, including most of my team and several others as well. I don't know precisely how much time is spent on error reporting but at least a handful of people are going through this process several times each week.

This is what lead to the idea for rPHAT Phatality: I recognized immediately that if I could streamline the process and save even a few seconds each time, the aggregate time savings could really add up quickly.

Solution

So after considering a few ways in which the process could be automated or otherwise streamlined, I finally focused on what seemed like the most practical: build a Kibana plugin that will format the log details and send them over to Phabricator, eliminating the tedious series of copy/paste operations.

Phatality has a couple of other tricks up it's sleeve but the essence of it is just that: capture all of the pertinent details from a single log message in Kibana and send it to Phabricator all at once with the click of a button in Kibana.

Phatality screenshot showing the submit and search buttons

Clicking the [Submit] button, as seen in the above screenshot, will take you to the phabricator Production Error form with all of the details pre-filled and ready to submit:

Screenshot from 2019-10-06 14-05-09.png (742×990 px, 81 KB)

Conclusion

Now that Phatality is deployed to production and a few of us have had a chance to use it to submit error reports, I can say that I definitely think it was a worthwhile effort. The Kibana plugin wasn't terribly difficult to write, and thanks to @fgiunchedi's help, the deployment went fairly smoothly. Phatality definitely streamlines the reporting process, saving several clicks each time and ensuring accuracy in the details that get sent to Phabricator. In a future version of the tool I plan to add more features such as duplicate detection to help avoid duplicate submissions.

If you use Wikimedia's Kibana to report errors in Phabricator then I encourage you to look for the Phatality tab in the log details section and save some clicks!

What other repetitive tasks are ripe for automation? I'd love to hear suggestions and ideas in the comments.

Integrating code coverage metrics with your development workflowhttps://phabricator.wikimedia.org/phame/post/view/174/kostajh (Kosta Harlan)2019-10-09T10:04:00+00:002019-11-23T14:01:50+00:00

In Changes and improvements to PHPUnit testing in MediaWiki, I wrote about efforts to help speed up PHPUnit code coverage generation for local development.[0] While this improves code coverage generation time for local development, it could be better.

As the Manual:PHP unit testing/Code coverage page advises, adjusting the whitelist in the PHPUnit XML configuration can speed things up dramatically. The problem is, adjusting that file is a manual process and a little cumbersome, so I usually didn't do it. And then because code coverage generation reports were slow locally[1], I ended up not running them while working on a patch. True, you will get feedback on code coverage metrics from CI, but it would be nicer if you could quickly get this information in your local environment first.

This was the motivation to add a Composer script in MediaWiki core that will help you adjust the PHPUnit coverage whitelist quickly while you're working on a patch for an extension or skin.

You can run it with composer phpunit:coverage-edit -- extensions/$EXT_NAME, e.g. composer phpunit:coverage-edit -- extensions/GrowthExperiments.

The ComposerPhpunitXmlCoverageEdit.php script copies the phpunit.xml.dist file to phpunit.xml (not version controlled), and modifies the whitelist to add directories for that extension/skin. vendor/bin/phpunit then reads phpunit.xml instead of the phpunit.xml.dist file. Tip: Make sure "Edit configurations" in your IDE (PhpStorm in my case) is using vendor/bin/phpunit and phpunit.xml, not phpunit.xml.dist, when executing the tests.

generating phpunit.xml and running code coverage in phpstorm

When you want to reset your configuration, you can rm phpunit.xml and vendor/bin/phpunit will read from phpunit.xml.dist again.

Further improvements to the script could include:

  • Reading the extension.json file to determine which directories to add to the whitelist, rather than using a hardcoded list (T235029)
  • Allow passing arbitrary directories/filenames, e.g. for working with subsections of core or of a larger extension (T235030)
  • Adding a flag for flipping the addUncoveredFilesFromWhitelist property, so that phpunit-suite-edit.py in the integration/config repo could be removed in favor of the Composer script (T235031)

Thanks to @Mainframe98 and @Krinkle for review of the patch and to @AnneT for reviewing this post. Happy hacking!


[0] [[ https://gerrit.wikimedia.org/r/c/mediawiki/core/+/520459 | One patch changed <whitelist addUncoveredFilesFromWhitelist="true"> to false ]] to help speed up PHPUnit code coverage generation, the [[ https://gerrit.wikimedia.org/r/c/integration/config/+/521190 | second patch flipped the flag back to true in CI ]] for generating complete coverage reports.
[1] For GrowthExperiments, generating coverage reports without a customized whitelist takes ~17 seconds. With a custom whitelist, it takes ~1 second. While 17 seconds is arguably not a lot of time, the near-instant feedback with a customized whitelist means one is less likely to face interruptions to their flow or concentration while working on a patch.

Production Excellence #15: September 2019https://phabricator.wikimedia.org/phame/post/view/173/Krinkle (Timo Tijhof)2019-10-24T23:25:57+00:002020-04-03T16:16:21+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 5 documented incidents. [1]
  • 22 new errors reported. [2]
  • 31 error reports closed. [3]
  • 213 currently open Wikimedia-prod-error reports in total. [4]

There were five recorded incidents last month, equal to the median for this and last year. – Explore this data.

To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.


*️⃣ A Tale of Three Great Upgrades

This month saw three major upgrades across the MediaWiki stack.

Migrate from HHVM to PHP 7.2

The client-side switch to toggle between HHVM and PHP 7.2 saw its final push — from the 50% it was at previously, to 100% of page view sessions on 17 September. The switch further solidified on 24 September when static MediaWiki traffic followed suit (e.g. API and ResourceLoader). Thanks @jijiki and @Joe for the final push. – More details at T219150 and T176370.

Drop support for IE6 and IE7

The RFC to discontinue basic compatibility for the IE6 and IE7 browsers entered Last Call on 18 September. It was approved on 2 Oct (T232563). Thanks to @Volker_E for leading the sprint to optimise our CSS payloads by removing now-redundant style rules for IE6-7 compat. – More at T234582.

Transition from PHPUnit 4/6 to PHPUnit 8

With HHVM behind us, our Composer configuration no longer needs to be compatible with a “PHP 5.6 like” run-time. Support for the real PHP 5.6 was dropped over 2 years ago, and the HHVM engine supports PHP 7 features. But, the HHVM engine identifies as “PHP 5.6.999-hhvm”. As such, Composer refused to install PHPUnit 6 (which requires PHP 7.0+). Instead, Composer could only install PHPUnit 4 under HHVM (as for PHP 5.6). Our unit tests have had to remain compatible with both PHPUnit 4 and PHPUnit 6 simultaneously.

Now that we’re fully on PHP 7.2+, our Composer configuration effectively drops PHP 5.6, 7.0 and 7.1 all at once. This means that we no longer run PHPUnit tests on multiple PHPUnit versions (PHPUnit 6 only). The upgrade to PHPUnit 8 (PHP 7.2+) is also unlocked! Thanks @MaxSem, @Jdforrester-WMF and @Daimona for leading this transition. – T192167


📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • February: 1 report was closed. (1 / 5 reports left).
  • March: 4 / 10 reports left (unchanged).
  • April: 8 / 14 reports left (unchanged). ⚠️
  • May: The last 4 reports were resolved. Done! ❇️
  • June: 9 of 11 reports left (unchanged). ⚠️
  • July: 4 reports were fixed! (13 / 18 reports left).
  • August: 6 reports were fixed! (8 / 14 reports left).
  • September: 12 new reports survived the month of September.

🎉 Thanks!

Thank you, to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


📖“I'm not crazy about reality, but it's still the only place to get a decent meal.

Footnotes:

[1] Incidents. –
wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…

[2] Tasks created. –
phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. –
phabricator.wikimedia.org/maniphest/query…

[4] Open tasks. –
phabricator.wikimedia.org/maniphest/query…

Production Excellence #14: August 2019https://phabricator.wikimedia.org/phame/post/view/172/Krinkle (Timo Tijhof)2019-10-03T04:27:16+00:002020-04-03T16:20:49+00:00

How’d we do in our strive for operational excellence in August? Read on to find out!

📊 Month in numbers
  • 3 documented incidents. [1]
  • 42 new Wikimedia-prod-error reports. [2]
  • 31 Wikimedia-prod-error reports closed. [3]
  • 210 currently open Wikimedia-prod-error reports in total. [4]

The number of recorded incidents in August, at three, was below average for the year so far. However, in previous years (2017-2018), August also has 2-3 incidents. – Explore this data.

To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.


*️⃣ When you have eliminated the impossible...

Reports from Logstash indicated that some user requests were aborted by a fatal PHP error from the MessageCache class. The user would be shown a generic system error page. The affected requests didn’t seem to have anything obvious in common, however. This made it difficult to diagnose.

MessageCache is responsible for fetching interface messages, such as the localised word “Edit” on the edit button. It calls a “load()” function and then tries to access the loaded information. However, sometimes the load function would claimed to have finished its work, but yet the information was not there.

When the load function initialises all the messages for a particular language, it keeps track of this, so as to not do the same a second time. From any one angle I could look at this code, no obvious mistakes stood out. A deeper investigation revealed that two unrelated changes (more than a year apart), each broke 1 assumption that was safe to break. But, put together, and this seemingly impossible problem emerges. Check out T208897#5373846 for the details of the investigation.


📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • January: 1 report left (unchanged). ⚠️
  • February: 2 reports left (unchanged). ⚠️
  • March: 4 reports left (unchanged). ⚠️
  • April: 2 reports got fixed! (8 of 14 reports left). ❇️
  • May: 4 of 10 reports left (unchanged).
  • June: 1 report got fixed! (8 of 11 reports left). ❇️
  • July: 2 reports got fixed (17 of 18 reports left).
  • August: 14 new reports remain unsolved.
  • September: 11 new reports remain unsolved.

🎉 Thanks!

Thank you to @aaron, @Catrope, @Daimona, @dbarratt, @Jdforrester-WMF, @kostajh, @pmiazga, @Tarrow, @zeljkofilipin, and everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


🎭“I think you should call it Seb's because no one will come to a place called Chicken on a Stick.

Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…

[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…

[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Changes and improvements to PHPUnit testing in MediaWikihttps://phabricator.wikimedia.org/phame/post/view/169/kostajh (Kosta Harlan)2019-07-16T04:13:53+00:002020-11-25T10:32:58+00:00

Building off the work done at the Prague Hackathon (T216260), we're happy to announce some significant changes and improvements to the PHP testing tools included with MediaWiki.

PHP unit tests can now be run statically, without installing MediaWiki

You can now download MediaWiki, run composer install, and then composer phpunit:unit to run core's unit test suite (T89432).

The standard PHPUnit entrypoint can be used, instead of the PHPUnit Maintenance class

You can now use the plain PHPUnit entrypoint at vendor/bin/phpunit instead of the MediaWiki maintenance class which wraps PHPUnit (tests/phpunit/phpunit.php).

Both the unit tests and integration tests can be executed with the standard phpunit entrypoint (vendor/bin/phpunit) or if you prefer, with the composer scripts defined in composer.json (e.g. composer phpunit:unit). We accomplished this by writing a new bootstrap.php file (the old one which the maintenance class uses was moved to tests/phpunit/bootstrap.maintenance.php) which executes the minimal amount of code necessary to make core, extension and skin classes discoverable by test classes.

Tests should be placed in tests/phpunit/{integration,unit}

Integration tests should be placed in tests/phpunit/integration while unit tests go in tests/phpunit/unit, these are discoverable by the new test suites (T87781). It sounds obvious now to write this, but a nice side effect is that by organizing tests into these directories it's immediately clear to authors and reviewers what type of test one is looking at.

Introducing MediaWikiUnitTestCase

A new base test case, MediaWikiUnitTestCase has been introduced with a minimal amount of boilerplate (@covers validator, ensuring the globals are disabled, and that the tests are in the proper directory, the default PHPUnit 4 and 6 compatibility layer). The MediaWikiTestCase has been renamed to MediaWikiIntegrationTestCase for clarity.

Please migrate tests to be unit tests where appropriate

A significant portion of core's unit tests have been ported to use MediaWikiUnitTestCase, approximately 50% of the total. We have also worked on porting extension tests to the unit/integration directories. @Ladsgroup wrote a helpful script to assist with automating the identification and moving of unit tests, see P8702. Migrating tests from MediaWikiIntegrationTestCase to MediaWikiUnitTestCase makes them faster.

Note that unit tests in CI are still run with the PHPUnit maintenance class (tests/phpunit/phpunit.php), so when reviewing unit test patches please execute them locally with vendor/bin/phpunit /path/to/tests/phpunit/unit or composer phpunit -- /path/to/tests/phpunit/unit.

Generating code coverage is now faster

The PHPUnit configuration file now resides at the root of the repository, and is called phpunit.xml.dist. (As an aside, you can copy this to phpunit.xml and make local changes, as that file is git-ignored, although you should not need to do that.) We made a modification (T192078) to the PHPUnit configuration inside MediaWiki to speed up code coverage generation. This makes it feasible to have a split window in your IDE (e.g. PhpStorm), run "Debug with coverage", and see the results in your editor fairly quickly after running the tests.

Debug coverage in PhpStorm

What is next?

Things we are working on:

  • Porting core tests to integration/unit
  • Porting extension tests to integration/unit.
  • Removing legacy testsuites or ensuring they can be run in a different way (passing the directory name for example).
  • Switching CI to use new entrypoint for unit tests, then for unit and integration tests

Help is wanted in all areas of the above! We can be found in the #wikimedia-codehealth channel and via the phab issues linked in this post.

Credits

The above work has been done and supported by Máté (@TK-999), Amir (@Ladsgroup), Kosta (@kostajh), James (@Jdforrester-WMF), Timo (@Krinkle), Leszek (@WMDE-leszek), Kunal (@Legoktm), Daniel (@daniel), Michael Große (@Michael), Adam (@awight), Antoine (@hashar), JR (@Jrbranaa) and Greg (@greg) along with several others. Thank you!

thanks for reading, and happy testing!

Amir, Kosta, & Máté

Production Excellence #13: July 2019https://phabricator.wikimedia.org/phame/post/view/164/Krinkle (Timo Tijhof)2019-08-30T20:08:00+00:002020-04-03T16:30:28+00:00

How’re we doing on that strive for operational excellence? Read this first anniversary edition to find out!

📊 Month in numbers
  • 5 documented incidents. [1]
  • 53 new Wikimedia-prod-error reports. [2]
  • 44 closed Wikimedia-prod-error reports. [3]
  • 218 currently open Wikimedia-prod-error reports in total. [4]

The number of recorded incidents over the past month, at five, is equal to the median number of incidents per month (2016-2019). – Explore this data.

To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.


📖 One year of Excellent adventures!

Exactly one year ago this periodical started to provide regular insights on production stability. The idea was to shorten the feedback cycle between deployment of code that leads to fatal errors and the discovery of those errors. This allows more people to find reports earlier, which (hopefully) prevents them from sneaking into a growing pile of “normal” errors.

576 reports were created between 15 July 2018 and 31 July 2019 (tagged Wikimedia-prod-error).
425 reports got closed over that same time period.

Read the first issue in story format, or the initial e-mail.


📉 Outstanding reports

Take a look at the workboard and look for tasks that might need your help. The workboard lists error reports, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone who already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • November: 1 report left (unchanged). ⚠️
  • December: 3 reports left (unchanged). ⚠️
  • January: 1 report left (unchanged). ⚠️
  • February: 2 reports left (unchanged). ⚠️
  • March: 4 reports left (unchanged). ⚠️
  • April: 10 of 14 reports left (unchanged). ⚠️
  • May: 2 reports got fixed! (4 of 10 reports left). ❇️
  • June: 2 reports got fixed! (9 of 11 reports left). ❇️
  • July: 18 new reports from last month remain unsolved.

🎉 Thanks!

Thank you to @aaron, @Anomie, @ArielGlenn, @Catrope, @cscott, @Daimona, @dbarratt, @dcausse, @EBernhardson, @Jdforrester-WMF, @jeena, @MarcoAurelio, @SBisson, @Tchanders, @Tgr, @tstarling, @Urbanecm; and everyone else who helped by finding, investigating, or resolving error reports in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Quote: 🎙 “Unlike money, hope is for all: for the rich as well as for the poor.”

Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex?prefix=Incident…

[2] Tasks created. – phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query…

[4] Open tasks. – phabricator.wikimedia.org/maniphest/query…

Production Excellence #12: June 2019https://phabricator.wikimedia.org/phame/post/view/163/Krinkle (Timo Tijhof)2019-07-31T18:44:42+00:002020-04-03T16:29:38+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 11 documented incidents. ⚠️ [1]
  • 39 new Wikimedia-prod-error reports. [2]
  • 25 Wikimedia-prod-error reports closed. [3]

The number of incidents in June was high compared to previous years. At 11 incidents, this is higher than this year’s median (5), the 2018 median (4), and the 2017 median (5). It is also higher than any month of June in the last 4 years. – More data at CodePen.

To read more about these incidents, their investigations, and pending actionables; check Incident documentation § 2019.

There are currently 204 open Wikimedia-prod-error reports (up from 186 in April, and 201 in May). [4]


📖 [Op-ed] Integrated maintenance cost

Hereby a shoutout to the Wikidata and Core Platform teams, at WMDE and WMF respectively. They both recently established a rotating subteam that focuses on incidental work. Such as maintenance, and other work that might otherwise hinder feature development.

I expect this to improve efficiency by avoiding context switches between feature and incidental work. The rotational aspect should distribute the work more evenly among team members (avoiding burnout). And, it may increase exposure to other teams, and lesser-known areas of our code; which provide opportunities for personal growth and to retain institutional knowledge.


📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error

Or help someone who already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • November: 1 issue got fixed! (1 issue left).
  • December: 3 issues left (unchanged). ⚠️
  • January: 1 issue left (unchanged). ⚠️
  • February: 2 issues left (unchanged). ⚠️
  • March: 4 issues left (unchanged). ⚠️
  • April: 2 issues got fixed! (10 of 14 issues, that survived April, remain open). ❇️
  • May: 4 issues got fixed! (6 of 10 issues, that survived May, are left). ❇️
  • June: 11 new issues from last month remain unresolved.

By steward and software component, the unresolved issues that survived June:

  • CPT / MW Auth (PHP fatal): T228717
  • CPT / MW Actor (DB contention): T227739
  • CPT or Multimedia / Thumb handler (MultiCurl error): T225197
  • Multimedia / File metadata (PHP error): T226751
  • Wikidata / Commons page view (PHP fatal): T227360
  • Wikidata / Jobrunner (PHP memory fatal): T227450
  • Wikidata / Jobrunner (Trx error): T225098
  • Product-Infra / ReadingList API (PHP fatal): T226593
  • (Unknown?) / Special:ConfirmEmail (PHP fatal): T226337
  • (Unknown?) / Page renaming (DB timeout): T226898
  • (Unknown?) / Page renaming (Bad revision fatal): T225366
💡Ideas: To suggest something to investigate or highlight in a future edition, contact me by e-mail or private IRC message.

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @Anomie, @brion, @Catrope, @cscott, @daniel, @dcausse, @DerFussi, @Ebe123, @fgiunchedi, @Jdforrester-WMF, @kostajh, @Legoktm, @Lucas_Werkmeister_WMDE, @matmarex, @matthiasmullie, @Michael, @Nikerabbit, @SBisson, @Smalyshev, @Tchanders, @Tgr, @Tpt, @Umherirrender, and @Urbanecm.

Thanks!

Until next time,

– Timo Tijhof

🔮These are his marbles...” “Ha! He really did lose his marbles, didn't he?” “Yeah, he lost them good.

Footnotes:

  1. Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex…
  2. Tasks created. – phabricator.wikimedia.org/maniphest/query…
  3. Tasks closed. – phabricator.wikimedia.org/maniphest/query…
  4. Open tasks. – phabricator.wikimedia.org/maniphest/query…
Production Excellence #11: May 2019https://phabricator.wikimedia.org/phame/post/view/162/Krinkle (Timo Tijhof)2019-07-01T18:56:32+00:002020-10-04T22:05:06+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 6 documented incidents. [1]
  • 41 new Wikimedia-prod-error tasks created. [2]
  • 36 Wikimedia-prod-error tasks closed. [3]

The number of incidents in May of this year was comparable to previous years (6 in May 2019, 2 in May 2018, 5 in May 2017), and previous months (6 in May, 8 in April, 8 in March) – comparisons at CodePen.

To read more about these incidents, their investigations, and pending actionables; check wikitech.wikimedia.org/wiki/Incident_documentation#2019.

As of writing, there are 201 open Wikimedia-prod-error tasks (up from 186 last month). [4]


📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the month in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error

Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • November: 2 issues left (unchanged).
  • December: 1 issue got fixed. 3 issues left (down from 4).
  • January: 1 issue left (unchanged).
  • February: 2 issues left (unchanged).
  • March: 1 issue got fixed. 4 issues remaining (down from 5).
  • April: 2 issues got fixed. 12 issues remain unresolved (down from 14).
  • May: 10 new issues found last month survived the month of May, and remain unresolved.

By steward and software component, unresolved issues from April and May:

  • Wikidata / Lexeme (API query fatal): T223995
  • Wikidata / WikibaseRepo (API Fatal hasSlot): T225104
  • Wikidata / WikibaseRepo (Diff link fatal): T224270
  • Wikidata / WikibaseRepo (Edit undo fatal): T224030
  • Growth / Echo (Notification storage): T217079
  • Growth / Flow (Topic link fatal): T224098
  • Growth / Page deletion (File pages): T222691
  • Multimedia or CPT / API (Image info fatal): T221812
  • CPT / PHP7 refactoring (File descriptions): T223728
  • CPT / Title refactor (Block log fatal): T224811
  • CPT / Title refactor (Pageview fatals): T224814
  • (Unstewarded) Page renaming: T223175, T205675
💡Ideas: To suggest an investigation to write about in a future edition, contact me by e-mail, or private message on IRC.

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production.

Until next time,

– Timo Tijhof

🎙“It’s not too shabby is it?

Footnotes:

[1] Incidents. –
wikitech.wikimedia.org/wiki/Special:PrefixIndex…

[2] Tasks created. –
phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. –
phabricator.wikimedia.org/maniphest/query…

[4] Open tasks. –
phabricator.wikimedia.org/maniphest/query…

Introducing the codehealth pipeline betahttps://phabricator.wikimedia.org/phame/post/view/160/kostajh (Kosta Harlan)2019-05-14T20:29:35+00:002019-06-12T02:54:51+00:00

After many months of discussion, work and consultation across teams and departments[0], and with much gratitude and appreciation to the hard work and patience of @thcipriani and @hashar, the Code-Health-Metrics group is pleased to announce the introduction of the code health pipeline. The pipeline is currently in beta and enabled for GrowthExperiments, soon to be followed by Notifications, PageTriage, and StructuredDiscussions. (If you'd like to enable the pipeline for an extension you maintain or contribute to, please reach out to us via the comments on this post.)

What are we trying to do?

The Code-Health-Metrics group has been working to define a set of common code health metrics. Our current understanding of code health factors are: simplicity, readability, testability, buildability. Beyond analyzing a given patch set for these factors, we also want to have a historical view of code as it evolves over time. We want to be able to see which areas of code lack test coverage, where refactoring a class due to excessive complexity might be called for, and where possible bugs exist.

After talking through some options, we settled on a proof-of-concept to integrate Wikimedia's gerrit patch sets with SonarQube as the hub for analyzing and displaying metrics on our code[1]. SonarQube is a Java project that analyzes code according to a set of a rules. SonarQube has a concept of a "Quality Gate", which can be defined organization wide or overridden on a per-project basis. The default Quality Gate says that of code added in a patch set, over 80% of it must be covered by tests, less than 3% of it may contain duplicated lines of code, and the maintainability, reliability and security ratings should be graded as an A. If code passes these criteria then we say it has passed the quality gate, otherwise it has failed.

Here's an example of a patch that failed the quality gate:

screenshot of sonarqube quality gate

If you click through to the report, you can see that it failed because the patch introduced an unused local variable (code smell), so the maintainability score for that patch was graded as a C.

How does it integrate with gerrit?

For projects that have been opted in to the code health pipeline, submitting a new patch or commenting with "check codehealth" will result in the following actions:

  1. The mwext-codehealth-patch job checks out the patchset and installs MediaWiki
  2. PHPUnit is run and a code coverage report is generated
  3. npm test:unit is run which may generate a code coverage report if the package.json file is configured to do so
  4. sonar-scanner binary runs which sends 1) the code, 2) PHP code coverage, and 3) the JavaScript code coverage to Sonar
  5. After Sonar is done analyzing the code and coverage reports, the pipeline reports if the quality gate passed or failed. The outcome does not prevent merge in case of failure.

pipeline screenshot

If you click the link, you'll be able to view the analysis in SonarQube. From there you can also view the code of a project and see which lines are covered by tests, which lines have issues, etc.

Also, when a patch merges, the mwext-codehealth-master-non-voting job executes which will update the default view of a project in SonarQube with the latest code coverage and code metrics.[3]

What's next?

We would like to enable the code health pipeline for more projects, and eventually we would like to use it for core. One challenge with core is that it currently takes ~2 hours to generate the PHPUnit coverage report. We also want to gather feedback from the developer community on false positives and unhelpful rules. We have tried to start with a minimal set of rules that we think everyone could agree with but are happy to adjust based on developer feedback[2]. Our current list of rules can be seen in this quality profile.

If you'll be at the Hackathon, we will be presenting on the code health pipeline and SonarQube at the Code health and quality metrics in Wikimedia continuous integration session on Friday at 3 PM. We look forward to your feedback!

Kosta, for the Code-Health-Metrics group


[0] More about the Code Health Metrics group: https://www.mediawiki.org/wiki/Code_Health_Group/projects/Code_Health_Metrics, currently comprised of Guillaume Lederrey (R), Jean-Rene Branaa (A), Kosta Harlan (R), Kunal Mehta (C), Piotr Miazga (C), Željko Filipin (R). Thank you also to @daniel for feedback and review of rules in SonarQube.
[1] While SonarQube is an open source project, we currently use the hosted version at sonarcloud.io. We plan to eventually migrate to our own self-hosted SonarQube instance, so we have full ownership of tools and data.
[2] You can add a topic here https://www.mediawiki.org/wiki/Talk:Code_Health_Group/projects/Code_Health_Metrics
[3] You might have also noticed a post-merge job over the last few months, wmf-sonar-scanner-change. This job did not incorporate code coverage, but it did analyze most of our extensions and MediaWiki core, and as a result there is a set of project data and issues that might be of interest to you. The Issues view in SonarQube might be interesting, for example, as a starting point for new developers who want to contribute to a project and want to make some small fixes.

Quibble hibernated, it is time to flourishhttps://phabricator.wikimedia.org/phame/post/view/155/hashar (Antoine Musso)2019-03-28T11:48:29+00:002019-03-29T11:01:09+00:00

Writing blog is neither my job nor something that I enjoy, I am thus late in the Quibble updates. The last one Blog Post: Quibble in summer has been written in September 2018 and I forgot to publish it until now. You might want to read it first to get a glance about some nice changes that got implemented last summer.

I guess personal changes that happened in October and the traditional norther hemisphere winter hibernation kind of explain the delay (see note [ 1 ]). Now that spring is finally there ({{NPOV}}), it is time for another update.

Quibble went from 0.0.26 to 0.0.30 which I have cut just before starting this post. I wanted to highlight a few changes from an overall small change log:

  • Use stronger password in Quibble related browser tests - T204569
  • Parallelize ext/skin linter
  • Parallelize mediawiki/core linter
  • PHPUnit generates Junit results - T207841
  • readme: how to reproduce a CI build - T200991
  • doc: quibble-stretch no more has php
  • mediawiki.d: Avoid vars that look like core or wmf names
  • Drop /p from Gerrit clone URL - T218844
  • Support to clone repositories in parallel - T211701
  • Properly abort when git submodule processing fails - T198980
  • mediawiki.d: Improve docs about dev settings and combine env sections
  • mediawiki.d: Merge into one file

Parallelism [ 2 ]

The first inception of Quibble did not have much thoughts put into it with regard to speed. The main goal at the time was simply to gather all the complicated logic from CI shell scripts, Jenkins jobs shell snippets, python or javascript scripts all in one single command. That in turn made it easier to reproduce a build but with a serious limitation: commands are just run serially which is far from being optimum.

Quibble would now run the lint commands in parallel for both extensions/skins and mediawiki/core. Internally, it forks run composer test and npm test in parallel, that slightly speed up the time to get linting commands to complete.

Another annoyance is when testing multiple repositories together, preparing the git repositories could takes several minutes. An example is for an extension depending on several other extensions or the gated wmf-quibble-* jobs which run tests for several Wikimedia deployed extensions. Even when using a local cache of git repositories (--git-cache) the serially run git commands take a while. Quibble 0.0.30 learned --git-parallel to run the git commands in parallel. An example speed up using git cache, several repositories and a DSL connection:

git-parallelDuration
1630 seconds
150 seconds

The option defaults to 1 which retain the exact same behavior / code path as before. I invite you to try --git-parallel=8 for example and draw your own conclusion. Wikimedia CI will be updated once Quibble 0.0.30 is deployed.

Parallelism added by myself, @hashar, and got partly tracked in T211701.

Documentation

Some part of the documentation referred to a Wikimedia CI containers that were no more suitable for running tests due to refactoring. The documentation as thus been updated to use the proper containers: docker-registry.wikimedia.org/releng/quibble-stretch-php72 or docker-registry.wikimedia.org/releng/quibble-stretch-hhvm. -- @hashar

In August, Wikidata developers used Quibble to reproduce a test failure and they did the extra step to capture their session and document how to reproduce it. Thank you @Pablo-WMDE for leading this and @Tarrow, @Addshore, @Michael, @Ladsgroup for the reviews - T200991.

You can read the documentation online at:

Note: as of this writing, the CI git servers are NOT publicly reachable (git://contint1001.wikimedia.org and git://contint2001.wikimedia.org).

Submodule failures

Some extensions or skins might have submodules, however we never caught errors when they failed to process and kept continuing. That later causes tests to fail in non obvious way and caused several people to loose time recently. T198980

The reason is Quibble simply borrowed a legacy shell script to handle submodules and that script has been broken since its first introduction in 2014. It relied on the find command which still exit 0 even with -exec /bin/false. The reason is that although /bin/false exit code is 1, that simply causes find to consider the -exec predicate to be false, find thus abort processing further predicates but that is not an error.

The logic has been ported to pure python and now properly abort when git submodule fails. That also drop the requirement to have the find command available which might help on Windows. -- @hashar

Miscellaneous tweaks

The configuration injected by Quibble in LocalSettings.php is now a single file when it previously was made of several small PHP files glued together by shelling out to php. The inline comments have been improved. -- @Krinkle

MediaWiki installer uses a slightly stronger password (testwikijenkinspass) to accommodate for a security hardening in MediaWiki core itself. -- @Reedy T204569

The Gerrit URL to clone the canonical git repository from has been updated to catch up with a change in Gerrit. Updated r/p to simply /r. -- @Legoktm T218844

PHPUnit generates JUnit test results in the log directory, intended to be captured and interpreted by CI. -- @hashar T207841

NOTE: those changes have not all been deployed to Wikimedia CI as of March 28th 2019 but should be next week.

footnotes

[ 1 ] Seasons are location based and a cultural agreement, they are quite interesting in their own. They are reversed in the Norther and Southern hemisphere, do not exist at the equator while in India they define six seasons. Thus when I refer to a winter hibernation, it really just reflect my own biased point of view.

[ 2 ] Parallelism is fun, I can never manage to write that word without mixing up the number of r or l for some reason. As a sideway note, my favorite sport to watch is parallel bars (enwiki).

CI working group report, with recommendations of new tools to tryhttps://phabricator.wikimedia.org/phame/post/view/153/LarsWirzenius (Lars Wirzenius)2019-03-25T18:29:49+00:002019-03-29T18:49:57+00:00

The working group to consider future CI tooling for Wikimedia has finished and produced a report. The report is at https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG/Report and the short summary is that the release engineering team should do prototype implementations of Argo, GitLab CI/CD, and Zuul v3.

Help my CI job fails with exit status -11https://phabricator.wikimedia.org/phame/post/view/152/hashar (Antoine Musso)2019-03-21T09:52:59+00:002022-09-01T13:16:09+00:00

For a few weeks, a CI job had PHPUnit tests abruptly ending with:

returned non-zero exit status -11

The connoisseur [ 1 ] would have recognized that the negative exit status indicates the process exited due to a signal. On Linux, 11 is the value for the SIGSEGV signal, which is usually sent by the kernel to the process as a result of an improper machine instruction. The default behavior is to terminate the process (man 7 signal) and to generate a core dump file (I will come to that later).

But why? Some PHP code ended up triggering a code path in HHVM that would eventually try to read outside of its memory range, or some similar low level fault. The kernel knows that the process completely misbehaved and thus, well, terminates it. Problem solved, you never want your program to misbehave when the kernel is in charge.

The job had recently been switched to use a new container in order to benefit from more recent lib and to match the OS distributions used by the Wikimedia production system. My immediate recommendation was to rollback to the previous known state, but eventually I have let the task to go on and have been absorbed by other tasks (such as updating MediaWiki on the infrastructure).

Last week, the job suddenly began to fail constantly. We prevent code from being merged when a test fails, and thus the code stays in a quarantine zone (Gerrit) and cannot be shipped. A whole team could not ship code (the Language-Team ) for one of their flagship projects (ContentTranslation .) That in turn prevents end users from benefiting from new features they are eager for. The issue had to be acted on and became an unbreak now! kind of task. And I went to my journey.

returned non-zero exit status -11, that is a good enough error message. A process in a Docker container is really just an isolated process and is still managed by the host kernel. First thing I did was to look at the kernel syslog facility on our instances, which yields:

kernel: [7943146.540511] php[14610]:
  segfault at 7f1b16ffad13 ip 00007f1b64787c5e sp 00007f1b53d19d30
     error 4 in libpthread-2.24.so[7f1b64780000+18000]

php there is just HHVM invoked via a php symbolic link. The message hints at libpthread which is where the fault is. But we need a stacktrace to better determine the problem, and ideally a reproduction case.

Thus, what I am really looking for is the core dump file I alluded to earlier. The file is generated by the kernel and contains an image of the process memory at the time of the failure. Given the full copy of the program instructions, the instructions it was running at that time, and all the memory segments, a debugger can reconstruct a human readable state of the failure. That is a backtrace, and is what we rely on to find faulty code and fix bugs.

The core file is not generated. Or the error message would state it had coredumped, i.e. the kernel generated the core dump file. Our default configuration is to not generate any core file, but usually one can adjust it from the shell with ulimit -c XXX where XXX is the maximum size a core file can occupy (in kilobytes, in order to prevent filling the disk). Docker being just a fancy way to start a process, it has a setting to adjust the limit. The docker run inline help states:

--ulimit ulimit Ulimit options (default [])

It is as far as useful as possible, eventually the option to set is: --ulimit core=2147483648 or up to 2 gigabytes. I have updated the CI jobs and instructed them to capture a file named core, the default file name. After a few runs, although I could confirm failures, no files got captured. Why not?

Our machines do not use core as the default filename. It can be found in the kernel configuration:

name=/proc/sys/kernel/core_pattern
/var/tmp/core/core.%h.%e.%p.%t

I thus went on the hosts looking for such files. There were none.

Or maybe I mean None or NaN.

Nada, rien.

The void.

The result is obvious, try to reproduce it! I ran a Docker container doing a basic while loop, from the host I have sent the SIGSEGV signal to the process. The host still had no core file. But surprise it was in the container. Although the kernel is handling it from the host, it is not namespace-aware when it comes time to resolve the path. My quest will soon end, I have simply mounted a host directory to the containers at the expected place:

mkdir /tmp/coredumps
docker run --volume /tmp/coredumps:/var/tmp/core ....

After a few builds, I had harvested enough core files. The investigation is then very straightforward:

$ gdb /usr/bin/hhvm /coredump/core.606eb29eab46.php.2353.1552570410
Core was generated by `php tests/phpunit/phpunit.php --debug-tests --testsuite extensions --exclude-gr'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
813	pthread_create.c: No such file or directory.
[Current thread is 1 (Thread 0x7f55614be3c0 (LWP 2354))]

(gdb) bt
#0  0x00007f557214ac5e in __pthread_create_2_1 (newthread=newthread@entry=0x7f55614b9e18, attr=attr@entry=0x7f5552aa62f8, 
    start_routine=start_routine@entry=0x7f556f461c20 <timer_sigev_thread>, arg=<optimized out>) at pthread_create.c:813
#1  0x00007f556f461bb2 in timer_helper_thread (arg=<optimized out>) at ../sysdeps/unix/sysv/linux/timer_routines.c:120
#2  0x00007f557214a494 in start_thread (arg=0x7f55614be3c0) at pthread_create.c:456
#3  0x00007f556aeebacf in __libc_ifunc_impl_list (name=<optimized out>, array=0x7f55614be3c0, max=<optimized out>)
    at ../sysdeps/x86_64/multiarch/ifunc-impl-list.c:387
#4  0x0000000000000000 in ?? ()

Which @Anomie kindly pointed out is an issue solved in libc6. Once the container has been rebuilt to apply the package update, the fault disappears.

One can now expect new changes to appear to ContentTranslation.


[ 1 ] ''connoisseur'', from obsolete French, means "to know" https://en.wiktionary.org/wiki/connoisseur . I guess the English language forgot to apply update on due time and can not make any such change for fear of breaking back compatibility or locution habits.

The task has all the technical details and log leading to solving the issue: T216689: Merge blocker: quibble-vendor-mysql-hhvm-docker in gate fails for most merges (exit status -11)

(Some light copyedits to above -- Brennen Bearnes)

Production Excellence #10: April 2019https://phabricator.wikimedia.org/phame/post/view/151/Krinkle (Timo Tijhof)2019-05-31T19:21:08+00:002020-04-03T16:27:51+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

  • Month in numbers.
  • Highlighted stories.
  • Current problems.
📊 Month in numbers
  • 8 documented incidents. [1]
  • 30 new Wikimedia-prod-error tasks created. [2]
  • 31 Wikimedia-prod-error tasks closed. [3]

The number of incidents in April was relatively high at 8. Both compared to this year (4 in January, 7 in February, 8 in March), and compared to last year (4 in April 2018).

To read more about these incidents, their investigations, and conclusions; check wikitech.wikimedia.org/wiki/Incident_documentation#2019.

As of writing, there are 186 open Wikimedia-prod-error issues (up from 177 last month). [4]

📖 Rehabilitation of MediaWiki-DateFormatter

Following the report of a PHP error that happened when saving edits to certain pages, Tim Starling investigated. The investigation motivated a big commit that brings this class into the modern era. I think this change serves as a good overview of what’s changed in MediaWiki over the last 10 years, and demonstrates our current best practices.

Take a look at Gerrit change 502678 / T220563.

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error

Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • November: 2 issues left (unchanged).
  • December: 4 issues left (unchanged).
  • January: 1 issue got fixed. One last issue remaining (down from 2).
  • February: 2 issues were fixed. Another 3 issues remaining (down from 5).
  • March: 5 issues were fixed. Another 5 issues remaining (down from 10).
  • April: 14 new issues were found last month that remain unresolved.

By steward and software component, issues left from March and April:

  • Anti-Harassment / User blocking: T222170
  • CPT / Revision-backend (Save redirect pages): T220353
  • CPT / Revision-backend (Import a page): T219702
  • CPT / Revision-backend (Export pages for dumps): T220160
  • Growth / Watchlist: T220245
  • Growth / Page deletion (Restore an archived page): T219816
  • Growth / Page deletion (File pages): T222691
  • Growth / Echo (Job execution): T217079
  • Multimedia / File management (Upload mime error): T223728
  • Performance / Deferred-Updates: T221577
  • Search Platform / CirrusSearch (Job execution): T222921
  • (Unstewarded) / Page renaming: T223175, T221763, T221595

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @aaron, @ArielGlenn, @Daimona, @dcausse, @EBernhardson, @Jdforrester-WMF, @Joe, @KartikMistry, @Ladsgroup, @Lucas_Werkmeister_WMDE, @MaxSem, @MusikAnimal, @Mvolz, @Niharika, @Nikerabbit, @Pchelolo, @pmiazga, @Reedy, @SBisson, @tstarling, and @Umherirrender.

Thanks!

Until next time,

– Timo Tijhof

🏴‍☠️ “One good deed is not enough to save a man.” “Though it seems enough to condemn him?” “Indeed…

Footnotes:

[1] Incidents reports by month and year. –
codepen.io/Krinkle/…

[2] Tasks created. –
phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. –
phabricator.wikimedia.org/maniphest/query…

[4] Open tasks. –
phabricator.wikimedia.org/maniphest/query…

Production Excellence #9: March 2019https://phabricator.wikimedia.org/phame/post/view/150/Krinkle (Timo Tijhof)2019-04-21T18:51:31+00:002020-04-03T16:26:51+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 8 documented incidents. [1]
  • 31 new Wikimedia-prod-error issues reported. [2]
  • 28 Wikimedia-prod-error issues closed. [3]

The number of incidents this month was slightly above average compared to earlier this year (7 in February, 4 in January), and this time last year (4 in March 2018, 7 in February 2018).

To read more about these incidents, their investigations, and conclusions, check wikitech.wikimedia.org/wiki/Incident_documentation#2019-03.

There are currently 177 open Wikimedia-prod-error issues, similar to last month. [4]

💡 Ideas: To suggest an investigation to highlight in a future edition, feel free contact me by e-mail, or private message on IRC.

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Or help someone that’s already started with their patch:
Open prod-error tasks with a Patch-For-Review

Breakdown of recent months (past two weeks not included):

  • September: Done! The last two issues were resolved.
  • October: Done! The last issue was resolved.
  • November: 2 issues left (from 1.33-wmf.2). 1 issue was fixed.
  • December: 4 issues left (from 1.33-wmf.9). 1 issue was fixed.
  • January: 2 issues left (1.33-wmf.13 – 14). 1 issue was fixed.
  • February: 5 issues (1.33-wmf.16 – 19).
  • March: 10 new issues (1.33-wmf.20 – 23).

By steward and software component, for issues remaining from February and March:


🎉 Thanks!

Thanks to @aaron, @Anomie, @Arlolra, @Daimona, @hashar, @Jdforrester-WMF, @kostajh, @matmarex, @MaxSem, @Niedzielski, @Nikerabbit, @Petar.petkovic, @santhosh, @ssastry, @Umherirrender, @WMDE-leszek, @zeljkofilipin, and everyone else who helped last month by reporting, investigating, or patching errors found in production!

Until next time,

– Timo Tijhof

🦅 “This isn’t flying. This is falling… with style!

Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:PrefixIndex/Incident_documentation/201903 …

[2] Tasks created. – phabricator.wikimedia.org/maniphest/query …

[3] Tasks closed. – phabricator.wikimedia.org/maniphest/query …

[4] Open tasks. – phabricator.wikimedia.org/maniphest/query …

Work progresses on CI tool evaluationhttps://phabricator.wikimedia.org/phame/post/view/149/LarsWirzenius (Lars Wirzenius)2019-03-08T16:59:04+00:002019-03-14T15:13:28+00:00

The working group to consider future tooling for continuous integration is making progress (see previous blog post J148 for more information). We're looking at and evaluating alternatives and learning of new needs within WMF.

If you have CI needs that are not covered by building from git in a Linux container, we would like to hear from you. For example, building iOS applications is difficult without a Mac/OS X build worker, so we're looking into what we can do to provide that. What else is needed?

We're currently aiming to make CI much more "self-serve" so that as much as possible can be done by developers themselves, without having to go via or through the Release Engineering team.

Our list of candidates include systems that are not open source or are "open core" (open source, but with optional proprietary parts). We will be self-hosting, and open source is going to be a hard requirement. "Open core" may be an acceptable compromise for a system that is otherwise very good. We want to look at all alternatives, however, so that we know what's out there and what's possible.

We track our work in Phabricator, ticket T217325.

Choosing tools for continuous integrationhttps://phabricator.wikimedia.org/phame/post/view/148/LarsWirzenius (Lars Wirzenius)2019-02-28T18:27:09+00:002019-03-07T00:36:29+00:00

The Release Engineering team has started a working group to discuss and consider our future continuous integration tooling. Please help!

The RelEng team is working with SRE to build a continuous delivery and deployment pipeline, as well as changing production to run things in containers under Kubernetes. We aim to improve the process of making changes to software behind our various sites by making it take less effort, happen faster, be less risky, and as automated as possible. The developers will have a better development experience, be more empowered, and more productive.

Wikimedia has had a CI system for many years now, but is based on versions of tools that are reaching the end of their useful life. Those tools need to be upgraded, and this will probably require further changes due to how the new versions function. This is a good point to consider what tools and functionality we need and want.

The working group is tasked to consider the needs and wants, and evaluate the available options, and make a recommendation of what to use in the future. The deadline is March 25. The work is being documented at https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/CI_Futures_WG and we're currently collecting requirements and candidates to evaluate.

We would welcome any feedback on those! Via IRC (#wikimedia-pipeline), on the talk page of the working group's wiki page above, or as a comment to this blog post.

Production Excellence #8: February 2019https://phabricator.wikimedia.org/phame/post/view/141/Krinkle (Timo Tijhof)2019-03-21T19:11:32+00:002020-04-03T16:24:44+00:00

How’d we do in our strive for operational excellence? Read on to find out!

📊 Month in numbers
  • 7 documented incidents. [1]
  • 30 new Wikimedia-prod-error tasks created. [2] (17 new in Jan, and 18 in Dec.)
  • 27 Wikimedia-prod-error tasks closed. [3] (16 closed in Jan, and 20 in Dec.)

There are in total 177 open Wikimedia-prod-error tasks today. (188 in Feb, 172 in Jan, and 165 in Dec.)

📉 Current problems

There’s been an increase in how many application errors are reported each week. And, we’ve also managed to mostly keep up with those each week, so that’s great!

But, it does appear that most weeks we accumulated one or two unresolved errors, which is starting to add up. I believe this is mainly because they were reported a day after the branch went out. That is, if the same issues had been reported 24 hours earlier in a given week, then they might’ve blocked the train as a regression.

https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Below is breakdown of unresolved prod errors since last quarter. (I’ve omitted the last three weeks.)

By month:

  • February: 5 reports (1.33-wmf.16, 1.33-wmf.17, 1.33-wmf.18).
  • January: 3 reports (1.33-wmf.13, 1.33-wmf.14).
  • December 2018: 5 reports (1.33-wmf.9).
  • November 2018: 3 reports (1.33-wmf.2).
  • October 2018: 1 report (1.32-wmf.26).
  • September 2018: 2 reports (1.32-wmf.20).

By steward and software component:

📖 Fixed exposed fatal error on Special:Contributions

Previously, a link to Special:Contributions could pass invalid options to a part of MediaWiki that doesn’t allow invalid options. Why would anything allow invalid options? Let’s find out.

Think about software as an onion. Software tends to have an outer layer where everything is allowed. If this layer finds illegal user input, it has to respond somehow. For example, by informing the user. In this outer layer, illegal input is not a problem in the software. It is a normal thing to see as we interact with the user. This outer layer responds directly to a user, is translated, and can do things like “view recent changes”, “view user contributions” or “rename a page”.

Internally, such action is divided into many smaller tasks (or functions). For example, a function might be “get talk namespace for given subject namespace”. This would answer “Talk:” to “(Article)”, and “Wikipedia_talk:” to “Wikipedia:”. When searching for edits on My Contributions with “Associated namespaces” ticked, this function is used. It is also used by Move Page if renaming a page together with its talk page. And it’s used on Recent Changes and View History, for all those little “talk” links next to each page title and username.

If one of your edits is for a page that has no discussion namespace, what should MediaWiki do? Show no edits? Skip that edit and tell the user “1 edit was hidden”? Show normally, but without a talk link? That decision is made by the outer layer for a feature, when it catches the internal exception. Alternatively, it can sometimes avoid an exception by asking a different question first – a question that cannot fail. Such as “Does namespace X have a talk space?”, instead of “What is the talk space for X?”.

When a program doesn’t catch or avoid an exception, a fatal error occurs. Thanks to @D3r1ck01 for fixing this fatal error. – T150324

💡 ProTip: If your Jenkins build is failing and you suspect it’s unrelated to the project itself, be sure to report it to Phabricator under “Shared Build Failure”.
🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: @aaron, @Addshore, @alaa_wmde, @Amorymeltzer, @Anomie @D3r1ck01 @Daimona @daniel @hashar @hoo, @jcrespo, @KaMan, @Mainframe98, @Marostegui, @matej_suchanek, @Ottomata, @Pchelolo, @Reedy, @revi, @Smalyshev, @Tarrow, @Tgr, @thcipriani, @Umherirrender, and @Volker_E.

Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages…

[2] Tasks created. — phabricator.wikimedia.org/maniphest/query…

[3] Tasks closed. — phabricator.wikimedia.org/maniphest/query…


🍏 He got me invested in some kind of.. fruit company.

Production Excellence #7: January 2019https://phabricator.wikimedia.org/phame/post/view/140/Krinkle (Timo Tijhof)2019-02-13T03:53:12+00:002020-04-03T16:23:33+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

📊 Month in numbers
  • 4 documented incidents in January 2019. [1]
  • 16 Wikimedia-prod-error tasks closed. [2]
  • 17 Wikimedia-prod-error tasks created. [3]

📖 Unable to move certain file pages

Xiplus reported that renaming a File page on zh.wikipedia.org led to a fatal database exception. Andre Klapper identified the stack trace from the logs, and Brad (@Anomie) investigated.

The File renaming failed because the File page did not have a media file associated with it (such move action is not currently allowed in MediaWiki). But, while handling this error the code caused a different error. The impact was that the user didn't get informed about why the move failed. Instead, they received a generic error page about a fatal database exception.

@Tgr fixed the code a few hours later, and it was deployed by Roan later that same day.
Thanks! — T213168

📖 DBPerformance regression detected and fixed

During a routine audit of Logstash dashboards, I found a DBPerformance warning. The warning indicated that the limit of 0 for “master connections” was violated. That's a cryptic way of saying it found code in MediaWiki that uses a database master connection on a regular page view.

MediaWiki can have many replica database servers, but there can be only one master database at any given moment. To reduce chances of overload, delaying edits, or network congestion; we make sure to use replicas whenever possible. We usually involve the master only when source data is being changed, or is about to be changed. For example, when editing a page, or saving changes.

As the vast majority of traffic is page views, we have lower thresholds for latency and dependency on page views. In particular, page views may (in the future) be routed to secondary data centres that don’t even have a master DB.

@Tchanders from the Anti-Harassment team investigated the issue, found the culprit, and fixed it in time for the next MediaWiki train. Thanks! — T214735

📖 TemplateData missing in action

@Tacsipacsi and @Evad37 both independently reported the same TemplateData issue. TemplateData powers the template insertion dialog in VisualEditor. It wasn't working for some templates after we deployed the 1.33-wmf.13 branch.

The error was “Argument 1 passed to ApiResult::setIndexedTagName() must be an instance of array, null given”. This means there was code that calls a function with the wrong parameter. For example, the variable name may've been misspelled, or it may've been the wrong variable, or (in this case) the variable didn't exist. In such case, PHP implicitly assumes “null”.

Bartosz (@matmarex) found the culprit. The week before, I made a change to TemplateData that changed the “template parameter order” feature to be optional. This allows users to decide whether VisualEditor should force an order for the parameters in the wikitext. It turned out I forgot to update one of the references to this variable, which still assumed it was always present.

Brad (Anomie) fixed it later that week, and it was deployed the next day. Thanks! — T213953

📈 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

phabricator.wikimedia.org/tag/wikimedia-production-error

There are currently 188 open Wikimedia-prod-error tasks as of 12 February 2019. (We’ve had a slight increase since November; 165 in December, 172 in January.)

For this month’s edition, I’d like to draw attention to a few older issues that are still reproducible:

  • [2013; Collection extension] Special:Book fatal error for blocked users. T56179
  • [2013; CentralNotice] Fatal error when placeholder key contains a space. T58105
  • [2014; LQT] Fatal error when attempting to view certain threads. T61791
  • [2015; MassMessage] Warning about Invalid message parameters. T93110
  • [2015; Wikibase] Warning “UnresolvedRedirectException” for some pages on Wikidata (and Commons). T93273
💡 Terminology:

A “Fatal error” (or uncaught exception) prevents a user action. For example — a page might display “MWException: Unknown class NotificationCount.”, instead the article content.
A “Warning” (or non-fatal, or PHP error) lets the program continue to display a mostly page regardless. This may cause corrupt, incorrect, or incomplete information to be shown. For example — a user may receive a notification that says “You have (null) new messages”.


🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including: A2093064‚ @Anomie, @Daimona @Gilles, @He7d3r, @Jdforrester-WMF, @matmarex, @mmodell, @Nikerabbit, @Catrope, @Tchanders, @Tgr, and @thiemowmde.

Thanks!

Until next time,

— Timo Tijhof

👢There's a snake in my boot. Reach for the sky!


Footnotes:

[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages…

[2] Tasks closed. — phabricator.wikimedia.org/maniphest/query…

[3] Tasks created. — phabricator.wikimedia.org/maniphest/query…

Gerrit now automatically adds reviewershttps://phabricator.wikimedia.org/phame/post/view/139/hashar (Antoine Musso)2019-01-17T16:53:56+00:002021-03-05T10:19:18+00:00
WARNING: 20210305 the reviewers by blame Gerrit plugin got disabled after it got announced by this blog post. It turns out the author of change is not necessarily an adequate reviewer suggestion in our context and some were being added to review for a whole lot code than they would expect. The post still have some worthy information as to how one can find reviewers.

Finding reviewers for a change is often a challenge, especially for a newcomer or folks proposing changes to projects they are not familiar with. Since January 16th, 2019, Gerrit automatically adds reviewers on your behalf based on who last changed the code you are affecting.

Antoine "@hashar" Musso exposes what lead us to enable that feature and how to configure it to fit your project. He will offers tip as to how to seek more reviewers based on years of experience.


When uploading a new patch, reviewers should be added automatically, that is the subject of the task T91190 opened almost four years ago (March 2015). I declined the task since we already have the Reviewer bot (see section below), @Tgr found a plugin for Gerrit which analyzes the code history with git blame and uses that to determine potential reviewers for a change. It took us a while to add that particular Gerrit plugin and the first version we installed was not compatible with our Gerrit version. The plugin was upgraded yesterday (Jan 16th) and is working fine (T101131).

Let's have a look at the functionality the plugin provides, and how it can be configured per repository. I will then offer a refresher of how one can search for reviewers based on git history.

Reviewers by blame plugin

NOTE: the reviewers by blame plugin has been removed the day after this announce blog post got posted. This section thus does not apply to the Wikimedia Gerrit instance anymore. It is left here for historical reason.

The Gerrit plugin looks at affected code using git blame, it extracts the top three past authors which are then added as reviewers to the change on your behalf. Added reviewers will thus receive a notification showing you have asked them for code review.

The configuration is done on a per project basis and inherits from the parent project. Without any tweaks, your project inherits the configuration from All-Projects. If you are a project owner, you can adjust the configuration. As an example the configuration for operations/mediawiki-config which shows inherited values and an exception to not process a file named InitialiseSettings.php:

mwconfig-reviewers-by-blame-config.png (136×542 px, 16 KB)

The three settings are described in the documentation for the plugin:

plugin.reviewers-by-blame.maxReviewers
The maximum number of reviewers that should be added to a change by this plugin.
By default 3.

plugin.reviewers-by-blame.ignoreFileRegEx
Ignore files where the filename matches the given regular expression when computing the reviewers. If empty or not set, no files are ignored.
By default not set.

plugin.reviewers-by-blame.ignoreSubjectRegEx
Ignore commits where the subject of the commit messages matches the given regular expression. If empty or not set, no commits are ignored.
By default not set.

By making past authors aware of a change to code they previously altered, I believe you will get more reviews and hopefully get your changes approved faster.

Previously we had other methods to add reviewers, one opt-in based and the others being cumbersome manual steps. They should be used to compliment the Gerrit reviewers by blame plugin, and I am giving an overview of each of them in the following sections.

Gerrit watchlist

gerrit-watched-projects.png (493×1 px, 72 KB)

The original system from Gerrit lets you watch projects, similar to a user watch list on MediaWiki. In Gerrit preferences, one can get notified for new changes, patchsets, comments... Simply indicate a repository, optionally a search query and you will receive email notifications for matching events.

The attached image is my watched projects configuration, I thus receive notifications for any changes made to the integration/config config as well as for changes in mediawiki/core which affect either composer.json or one of the Wikimedia deployment branches for that repo.

One drawback is that we can not watch a whole hierarchy of projects such as mediawiki and all its descendants, which would be helpful to watch our deployment branch. It is still useful when you are the primary maintainer of a repository since you can keep track of all activity for the repository.

Reviewer bot

The reviewer bot has been written by Merlijn van Deen (@valhallasw), it is similar to the Gerrit watched projects feature with some major benefits:

  • watcher is added as a reviewer, the author thus knows you were notified
  • it supports watching a hierarchy of projects (eg: mediawiki/*)
  • the file/branch filtering might be easier to gasp compared to Gerrit search queries
  • the watchers are stored at a central place which is public to anyone, making it easy to add others as reviewers.

One registers reviewers on a single wiki page: https://www.mediawiki.org/wiki/Git/Reviewers.

Each repository filter is a wikitext section (eg: === mediawiki/core ===) followed by a wikitext template and a file filter using using python fnmatch. Some examples:

Listen to any changes that touch i18n:

== Listen to repository groups ==
=== * ===
* {{Gerrit-reviewer|JohnDoe|file_regexp=<nowiki>i18n</nowiki>}}

Listen to MediaWiki core search related code:

=== mediawiki/core ===
* {{Gerrit-reviewer|JaneDoe|file_regexp=<nowiki>^includes/search/</nowiki>

The system works great, given maintainers remember to register on the page and that the files are not moved around. The bot is not that well known though and most repositories do not have any reviewers listed.

Inspecting git history

A source of reviewers is the git history, one can easily retrieve a list of past authors which should be good candidates to review code. I typically use git shortlog --summary --no-merges for that (--no-merges filters out merge commit crafted by Gerrit when a change is submitted). Example for MediaWiki Job queue system:

$ git shortlog --no-merges --summary --since "one year ago" includes/jobqueue/|sort -n|tail -n4
     3	Petr Pchelko
     4	Brad Jorsch
     4	Umherirrender
    16	Aaron Schulz

Which gives me four candidates that acted on that directory over a year.

Past reviewers from git notes

When a patch is merged, Gerrit records in git trace votes and the canonical URL of the change. They are available in git notes under /refs/notes/review, once notes are fetched, they can be show in git show or git log by passing --show-notes=review, for each commit, after the commit messages, the notes get displayed and show votes among other metadata:

$ git fetch refs/notes/review:refs/notes/review
$ git log --no-merges --show-notes=review -n1
commit e1d2c92ac69b6537866c742d8e9006f98d0e82e8
Author: Gergő Tisza <tgr.huwiki@gmail.com>
Date:   Wed Jan 16 18:14:52 2019 -0800

    Fix error reporting in MovePage
    
    Bug: T210739
    Change-Id: I8f6c9647ee949b33fd4daeae6aed6b94bb1988aa

Notes (review):
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Verified+2: jenkins-bot
    Submitted-by: jenkins-bot
    Submitted-at: Thu, 17 Jan 2019 05:02:23 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/484825
    Project: mediawiki/core
    Branch: refs/heads/master

And I can then get the list of authors that previously voted Code-Review +2 for a given path. Using the previous example of includes/jobqueue/ over a year, the list is slightly different:

$ git log --show-notes=review --since "1 year ago" includes/jobqueue/|grep 'Code-Review+2:'|sort|uniq -c|sort -n|tail -n5
      2     Code-Review+2: Umherirrender <umherirrender_de.wp@web.de>
      3     Code-Review+2: Jforrester <jforrester@wikimedia.org>
      3     Code-Review+2: Mobrovac <mobrovac@wikimedia.org>
      9     Code-Review+2: Aaron Schulz <aschulz@wikimedia.org>
     18     Code-Review+2: Krinkle <krinklemail@gmail.com>

User Krinkle has approved a lot of patches, even if he doesn't show in the list of authors obtained by the previous mean (inspecting git history).

Conclusion

The Gerrit reviewers by blame plugin acts automatically which offers a good chance your newly uploaded patch will get reviewers added out of the box. For finer tweaking one should register as a reviewer on https://www.mediawiki.org/wiki/Git/Reviewers which benefits everyone. The last course of action is meant to compliment the git log history.

For any remarks, support, concerns, reach out on IRC freenode channel #wikimedia-releng or fill a task in Phabricator.

Thank you @thcipriani for the proof reading and english fixes.

Code Health Metrics and SonarQubehttps://phabricator.wikimedia.org/phame/post/view/133/zeljkofilipin (Željko Filipin)2019-01-10T14:54:38+00:002019-01-15T11:42:06+00:00

Code Health

Inside a broad Code Health project there is a small Code Health Metrics group. We meet weekly and discuss how code health could be improved by metrics. Each member has only a few hours each week to work on this, so our projects are small.

In our discussions, we have agreed on a few principles. Some of them are:

  • Metrics are about improving the process as much improving the code.
  • Focus on new code, not existing one.
  • Humans are smarter than tools.

The goal of the project is to provide fast and actionable feedback on code health metrics. Since our time for this project is limited, we've decided to make a spike (T207046). The spike focuses on:

  • one repository,
  • one language,
  • one metric,
  • one tool,
  • one feedback mechanism.

All of the above tasks are already completed, except for the last one. In parallel to finishing the spike, we are also working on expanding the scope to more repositories, languages and metrics. At the moment, the spike works for several Java repositories.

SonarQube

After some investigation, the tool we have selected is SonarQube. The tool does everything we need, and more. In this post I'll only mention one feature. We have decided not to host SonarQube ourselves at the moment. We are using a hosted solution, SonarCloud. You can see the our current dashboart at wmftest organization at SonarCloud.

As mentioned in the principles, in order to make the metrics actionable, we've decided to focus only on new code, ignoring existing code for now. That means that when you make a change to a repository with a lot of code, you are not overwhelmed with all metrics (and problems) the tool has found. Instead, the tool focuses just on the code you have wrote. So, for example, if a small patch you have submitted to a big repository does not introduce new problems, the tool says so. If the patch introduces new problems (like decreased branch coverage) the tools let's you know.

Members of the Code Health Metrics group have reminded me multiple times that I have to mention SonarLint, an IDE extension. I don't use it myself, since it doesn't support my favorite editor.

Example

A good example is at at wmftest organization at SonarCloud. Elasticsearch extra plugins has failed quality gate.

wmftest.png (821×1 px, 173 KB)

Opening the project Elasticsearch extra plugins project you see that the failure is related to test coverage (less than 80%).

extra-parent.png (821×1 px, 181 KB)

Click the warning and you get more details: Coverage on New Code 0.0%.

new-coverage.png (821×1 px, 175 KB)

Click the ExtraCorePlugin.java file. New lines have yellow background. It's easy to see that there are lines that are marked red (meaning no coverage) but it's also easy to see which new lines (yellow background) have no coverage (red sidebar).

ExtraCorePlugin.png (793×1 px, 259 KB)

Talks

We have planned to present what we have so far during Wikimedia Foundation All Hands. The prepare for that, we're created this blog post and presented at 5 Minute Demo and Testival Meetup.

I would like to thank all members of the Code Health Metrics Working group for help writing this post and especially to Guillaume Lederrey and Kosta Harlan.

FAQ

Q: Sonar-what?!
A: SonarQube is the tool. SonarCloud is the hosted version of the tool. SonarLint in an IDE extension.

Q: When can I use this on my project?
A: Soon. Probably when T207046 is resolved. If there are no blockers, in a few weeks.

Q: Why are we using SonarCloud instead of hosting SonarQube ourselves?
A: We did not want to invest time in hosting it ourselves until we're sure the tool is the right choice for us.

Production Excellence #6: December 2018https://phabricator.wikimedia.org/phame/post/view/130/Krinkle (Timo Tijhof)2019-01-22T02:54:23+00:002020-04-03T16:18:09+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

  • Month in numbers.
  • Lightning round.
  • Current problems.

📊 Month in numbers

  • 4 documented incidents. [1]
  • 20 Wikimedia-prod-error tasks closed. [2]
  • 18 Wikimedia-prod-error tasks created. [3]
  • 172 currently open Wikimedia-prod-error tasks (as of 16 January 2019).

Terminology:

  • An Exception (or fatal) prevents a user action. For example, a page would display “Exception: Unable to render page”, instead the article content.
  • An Error (or non-fatal, warning) can produce pages that are technically unaware of a problem, but may show corrupt, incorrect, or incomplete information. For example — a user may receive a notification that says “You have (null) new messages”.

For December, I haven’t prepared any stories or taken interviews. Instead, I’ve got a lightning round of errors in various areas that were found and fixed this past month.

⚡️ Contributions view fixed

MarcoAurelio reported that Special:Contributions failed to load for certain user names on meta.wikimedia.org (PHP Fatal error, due to a faulty database record). Brad Jorsch investigated and found a relation to database maintenance from March 2018. He corrected the faulty records, which resolved the problem. Thanks! — T210985

⚡️ Undefined talk space now defined

The newly created Cantonese Wiktionary (yue.wiktionary.org) was encountering errors from the Siteinfo API. We found this was due to invalid site configuration. Urbanecm patched the issue, and also created a new unit test for wmf-config that will prevent this issue from happening on other wikis in the future. Thanks! — T211529

⚡️ The undefined error status... error

After deploying the 1.33.0-wmf.8 train to all wikis, we found a regression in the HTTP library for MediaWiki. When MediaWiki requested an HTTP resource from another service, and this resource was unavailable, then MediaWiki failed to correctly determine the HTTP status code of that error. Which then caused another error! This happened, for example, when Special:Collection was unable to reach the PediaPress.com backend in some cases. Patched by Bill Pirkle. Thanks! — T212005

⚡️ Fatal error: Call to undefined function in Kartographer API

When the 1.33.0-wmf-9 train reached the canary phase on Tue 18 December (aka, group0 [1]), Željko spotted a new fatal error in the logs. The fatal originated in the Kartographer extension and would have affected various users of the MediaWiki API. Patched the same day by Michael Holloway, reviewed by James Forrester, and deployed by Željko. Thanks! — T212218

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

→ https://phabricator.wikimedia.org/tag/wikimedia-production-error

November's theme will continue for now, as I imagine lots of you were on vacation during that time! I’d like to draw attention to a subset of PHP fatal errors. Specifically, those that are publicly exposed (e.g. don’t need elevated user rights) and emit an HTTP 500 error code.

  1. Wikibase: Clicking “undo” for certain revisions fatals with a PatcherException. — T97146
  2. Flow: Unable to view certain talk pages due to workflow InvalidDataException. — T70526
  3. Translate: Certain Special:Translate urls fatal. — T204833
  4. MediaWiki (Special-pages): SpecialDoubleRedirects unavailable on tt.wikipedia.org. — T204800
  5. MediaWiki (Parser): Parse API exposes fatal content model error. — T206253
  6. CentralNotice: Certain SpecialCentralNoticeBanners urls fatal. — T149240
  7. PageViewInfo: Certain “mostviewed” API queries fail. — T208691

Public user requests resulting in fatals can (and have) caused alerts to fire that notify SRE of wikis potentially being less available or down.

💡 ProTip:

Use “Report Error” on https://phabricator.wikimedia.org/tag/wikimedia-production-error/ to create a task with a helpful template. This template is also available as “Report Application Error”, from the “Create Task” dropdown menu, on any task creation form.

🎉 Thanks!

Thank you to everyone who has helped by reporting, investigating, or resolving problems in Wikimedia production. Including @MarcoAurelio, @Anomie, @Urbanecm, @BPirkle, @zeljkofilipin, @Mholloway, @Esanders, @Jdforrester-WMF, and @hashar.

Until next time,

— Timo Tijhof


Footnotes:

[1] Incidents. — wikitech.wikimedia.org/wiki/Special:AllPages...

[2] Tasks closed. — phabricator.wikimedia.org/maniphest/query...

[3] Tasks opened. — phabricator.wikimedia.org/maniphest/query...

[4] What is group0? — wikitech.wikimedia.org/wiki/Deployments/One_week#Three_groups

Production Excellence #5: November 2018https://phabricator.wikimedia.org/phame/post/view/129/Krinkle (Timo Tijhof)2018-12-12T04:40:26+00:002020-04-03T16:17:55+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

  • Month in numbers.
  • Highlighted stories.
  • Current problems.

📊 Month in numbers

  • 4 documented incidents in November 2018. [1]
  • 42 Wikimedia-prod-error tasks closed in November 2018. [2]
  • 36 Wikimedia-prod-error tasks created in November 2018. [3]
  • 165 currently open Wikimedia-prod-error tasks (as of 12 December 2018).

Terminology:

  • An Exception (or fatal) causes user actions to be prevented. For example, a page would display "Exception: Unable to render page", instead the article content.
  • An Error (or non-fatal, or warning) can produce page views that are technically unaware of a problem, but may show corrupt, incorrect, or incomplete information. Examples – an article would display the code word “null” instead of the actual content, a user looking for Vegetables may be taken to an article about Vegetarians, a user may receive a notification that says “You have (null) new messages.

With that behind us... Let’s celebrate this month’s highlights!

*️⃣ Fatal DB exception at wikitech.wikimedia.org

Quiddity reported that he was unable to disable a spam account, due to a fatal exception. Andre Klapper used the Exception ID to find the stack trace in the logs. The trace revealed that a table was missing in Wikitech’s database.

The MediaWiki software was recently expanded with a “Partial blocking” ability. [4] This involved introducing a new database table that stores block metadata differently. This software update was deployed to Wikitech, but this new table was not created.

@Marostegui (Database administrator) quickly applied the schema patches that create the missing table. Thanks Manuel, Andre, and Quiddity; Teamwork!

T209674

*️⃣ Big-page Deletion Unleashed!

It had been known for years, [5] that users are unable to delete or restore pages with more than a few hundred revisions. Attempts to do so could fail, with a fatal “DBTransactionSizeError” exception. This error indicates that the change is too big or too slow. Such changes risk replication lag, and may impact the stability of the infrastructure.

The database structure used by MediaWiki for page archives dates back to 2003 (over 15 years ago). I'll spare you the details, but it depends on database interactions that are inherently slow when applied to systems as big as Wikipedia! RFC T20493 intends to modernise this structure for the long-term.

Then along came @BPirkle. Bill joined the WMF Core platform team earlier this year. He took on the challenge of making page deletion work for any size page, today.

Previously, page deletion happened in a single step. This simple approach had the benefit of either succeeding in its entirety, or safely rolling back like nothing happened. It also meant that the database protected us against conflicting changes. In August, Bill started a two-month effort that carefully split the logic for “delete a page” into smaller steps that each are safe and quick. It now uses our JobQueue to schedule and run these steps, without the user waiting for it.

T198176

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

→ https://phabricator.wikimedia.org/tag/wikimedia-production-error

I’d like to draw attention to a subset of PHP fatal errors. Specifically, those that are publicly exposed (e.g. don’t require elevated user rights) and use an HTTP 500 status code.

  1. CentralNotice: Some Special:CentralNoticeBanners urls fatal. – T149240
  2. Flow: Unable to view certain talk pages due to workflow InvalidDataException. – T70526
  3. JsonConfig: Unable to diff certain “.map” pages on Commons. – T203063
  4. MediaWiki (Parser): Parse API exposes fatal content model error. – T206253
  5. MediaWiki (Special-pages): Special:DoubleRedirects unavailable on ttwiki. – T204800
  6. MobileFrontend: Some Special:MobileDiff urls fatal. – T156293
  7. ProofreadPage: Unable to edit certain pages on Wikisource. – T176196
  8. Translate: Some Special:Translate urls fatal. – T204833
  9. Wikibase: Clicking “undo” for some revisions fatals with a PatcherException. – T97146

Public user requests resulting in fatals can (and have) caused alerts to fire that notify SRE of wikis potentially being less available or down.

💡 ProTip:

Cross-reference one workboard with another via Open TasksAdvanced Filter and enter Tag(s) to apply as a filter.

🎉 Thank you

Thank you to everyone who helped by reporting or investigating problems in Wikimedia production; and for implementing or reviewing their solutions. Including: @tstarling, @thiemowmde, @thcipriani, @Tgr, @Steinsplitter, @Quiddity, @pmiazga, @Nikerabbit, @Mvolz, @Lucas_Werkmeister_WMDE, @kostajh, @jrbs, @JJMC89, @Jdforrester-WMF, @hashar, @Gilles, @Daimona, @Ciencia_Al_Poder, @Catrope, @BPirkle, @Barkeep49, @Anomie, and @Aklapper.

Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:AllPages...
[2] Tasks closed. – phabricator.wikimedia.org/maniphest/query...
[3] Tasks opened. – phabricator.wikimedia.org/maniphest/query...
[4] Partial blocks. – meta.wikimedia.org/wiki/Community_health_initiative
[5] Bug report about page deletion, 2007. – T13402

Incident Documentation: An Unexpected Journeyhttps://phabricator.wikimedia.org/phame/post/view/128/zeljkofilipin (Željko Filipin)2018-11-22T18:06:07+00:002019-01-25T11:41:28+00:00

Introduction

The Release Engineering team wants to continually improve the quality of our software over time. One of the ways in which we hoped to do that this year is by creating more useful Selenium smoke tests. (From now on, test will be used instead of Selenium test.) This blog post is about how we determined where the tests should focus and the relative priority.

At first, I thought this would be a trivial task. A few hours of work. A few days at most. A week or two if I've completely underestimated it. A couple of months later, I know I have completely underestimated it.

Things I needed to do:

  • Define prioritization scheme.
  • Prioritize target repositories.

Define Prioritization Scheme

In general:

  • Does a repository have stewards? (Do the stewards want tests?)
  • Does a repository have existing tests?

For the last year:

  • How much change did happen for a repository? Simply put: more change can lead to more risk.
  • How many incidents is a repository connected to? We wanted to make sure we didn't miss any obvious problematic areas.

Coverage Change Incidents Stewards.png (559×945 px, 25 KB)

Does a Repository Have Stewards?

This was relatively simple task. The best source of information is Developers/Maintainers page.

Does a Repository Have Existing Tests?

This was also easy. Selenium/Node.js page has list of repositories that have tests in Node.js. I already had all repositories with Node.js and Ruby tests on my machine, so a quick search for webdriverio (Node.js) and mediawiki_selenium (Ruby) found all the tests. In order to be really sure I've found all repositories with tests, I've cloned all repositories from Gerrit.

$ ack --json webdriverio
extensions/Echo/package.json
27:        "webdriverio": "4.12.0"
...
$ ack --type-add=lock:ext:lock --lock mediawiki_selenium
skins/MinervaNeue/Gemfile.lock
42:    mediawiki_selenium (1.7.3)
...

To make extra sure I have not missed any repositories, I've used MediaWiki code search (mediawiki_selenium, webdriverio) and GitHub search (org:wikimedia extension:lock mediawiki_selenium, org:wikimedia extension:json webdriverio)

This is the list.

RepositoryLanguage
mediawiki/coreJavaScript
mediawiki/extensions/AdvancedSearchJavaScript
mediawiki/extensions/CentralAuthRuby
mediawiki/extensions/CentralNoticeRuby
mediawiki/extensions/CirrusSearchJavaScript
mediawiki/extensions/CiteJavaScript
mediawiki/extensions/EchoJavaScript
mediawiki/extensions/ElectronPdfServiceJavaScript
mediawiki/extensions/GettingStartedRuby
mediawiki/extensions/MathJavaScript
mediawiki/extensions/MobileFrontendRuby
mediawiki/extensions/MultimediaViewerRuby
mediawiki/extensions/NewsletterJavaScript
mediawiki/extensions/ORESJavaScript
mediawiki/extensions/PopupsJavaScript
mediawiki/extensions/QuickSurveysRuby
mediawiki/extensions/RelatedArticlesJavaScript
mediawiki/extensions/RevisionSliderRuby
mediawiki/extensions/TwoColConflictJavaScript, Ruby
mediawiki/extensions/WikibaseJavaScript, Ruby
mediawiki/extensions/WikibaseLexemeJavaScript, Ruby
mediawiki/extensions/WikimediaEventsPHP
mediawiki/skins/MinervaNeueRuby
phab-deploymentJavaScript
wikimedia/community-tech-toolsRuby
wikimedia/portals/deployJavaScript

How Much Change Did Happen for a Repository?

After reviewing several tools, I've found that we already use Bitergia for various metrics. There is even a nice list of top 50 repositories by the number of commits. The tool even supports limiting the report from a date to a date. Exactly what I needed.

Bitergia > Last 90 days > Absolute > From 2017-11-01 00:00:00.000 > To 2018-10-31 23:59:59.999 > Go > Git > Overview > Repositories (raw data: P7776, direct link).

This is the top 50 list (excludes empty commits and bots).

RepositoryCommits
mediawiki/extensions11300
operations/puppet7988
mediawiki/core4590
operations/mediawiki-config4005
integration/config1652
operations/software/librenms1169
pywikibot/core927
mediawiki/extensions/Wikibase806
apps/android/wikipedia789
mediawiki/services/parsoid700
mediawiki/extensions/VisualEditor692
operations/dns653
VisualEditor/VisualEditor599
mediawiki/skins570
mediawiki/extensions/MobileFrontend504
mediawiki/extensions/ContentTranslation491
translatewiki486
oojs/ui469
wikimedia/fundraising/crm457
mediawiki/extensions/BlueSpiceFoundation414
mediawiki/extensions/CirrusSearch357
mediawiki/extensions/AbuseFilter306
phabricator/phabricator302
mediawiki/services/restbase290
mediawiki/extensions/Flow232
mediawiki/extensions/Echo223
mediawiki/vagrant221
mediawiki/extensions/Popups184
mediawiki/extensions/Translate182
mediawiki/extensions/DonationInterface180
analytics/refinery178
mediawiki/extensions/PageTriage177
mediawiki/extensions/Cargo176
mediawiki/tools/codesniffer156
mediawiki/extensions/TimedMediaHandler152
mediawiki/extensions/UniversalLanguageSelector142
mediawiki/vendor140
mediawiki/extensions/SocialProfile139
analytics/refinery/source138
operations/software137
mediawiki/services/restbase/deploy136
operations/debs/pybal123
mediawiki/extensions/CentralAuth116
mediawiki/tools/release116
mediawiki/services/cxserver112
mediawiki/extensions/BlueSpiceExtensions110
mediawiki/extensions/WikimediaEvents110
labs/private108
operations/debs/python-kafka104
labs/tools/heritage96

I've got similar results with running git rev-list for all repositories (script, results: P7834).

How Many Incidents Is a Repository Connected To?

This proved to be the most time consuming task.

I have started by reviewing existing incident documentation. Take a look at a few incidents. Can you tell which incident report is connected to which repository? I couldn't. (If you can, please let me know. I need your help.)

Incident reports are a wall of text. It was really hard for me to connect an incident report to a repository. An incident report has a title and text, example: 20180724-Train. Text has several sections, including Actionables. Text contains links to Gerrit patches and Phabricator tasks. (From now on, I'll use patches instead of Gerrit patches and tasks instead of Phabricator tasks.)

A patch belongs to a repository. Wikitext [[gerrit:448103]] is patch mediawiki/extensions/Wikibase/+/448103, so repository is mediawiki/extensions/Wikibase. That is the strongest link between an incident and a repository.

A task usually has patches associated with it. Wikitext [[phab:T181315]] is patch T181315. Gerrit search bug:T181315 finds many connected patches, many of them in operations/puppet and one in mediawiki/vagrant. That is an useful, but not a strong link between an incident and a repository. Some tasks have several related patches, so it provides a lot of data.

A task also usually has several tags. Most of them are not useful in this context, but tags that are components (and not for example milestones or tags) could be useful, if the component can be linked to a repository. It is also not a strong link between an incident and a repository, and it usually does not provide a lot of data.

At the end, I wrote a tool with imaginative name, Incident Documentation. The tool currently collects data from patches and tasks from Actionables section of the incident report. It does not collect data from task components. It is tracked as issue #5.

Incident Review 2017-11-01 to 2018-10-31

After reviewing Actionables section for each incident report, related patches and tasks, here are the results. Please note this table only connects incident report and repositories. It does not show how many patches from a repository are connected to an incident report. It is tracked as issue #11.

RepositoryIncidents
operations/puppet22
mediawiki/core6
operations/mediawiki-config4
mediawiki/extensions/Wikibase4
wikidata/query/rdf2
operations/debs/pybal2
mediawiki/extensions/ORES2
integration/config2
wikidata/query/blazegraph1
operations/software1
operations/dns1
mediawiki/vagrant1
mediawiki/tools/release1
mediawiki/services/ores/deploy1
mediawiki/services/eventstreams1
mediawiki/extensions/WikibaseQualityConstraints1
mediawiki/extensions/PropertySuggester1
mediawiki/extensions/PageTriage1
mediawiki/extensions/Cognate1
mediawiki/extensions/Babel1
maps/tilerator/deploy1
maps/kartotherian/deploy1
integration/jenkins1
eventlogging1
analytics/refinery/source1
analytics/refinery1
All-Projects1

Selecting Repositories

This table is sorted by the amount of change. The only column that needs explanation is Selected. It shows if a test makes sense for the repository, taking into account all available data. Repositories without maintainers and with existing tests are excluded.

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions11300
operations/puppet7988SRE22
mediawiki/core4590Core PlatformJavaScript6
operations/mediawiki-config4005Release Engineering4
integration/config1652Release Engineering2
operations/software/librenms1169SRE
pywikibot/core927
mediawiki/extensions/Wikibase806WMDEJavaScript, Ruby4
apps/android/wikipedia789
mediawiki/services/parsoid700Parsing
mediawiki/extensions/VisualEditor692Editing
operations/dns653SRE1
VisualEditor/VisualEditor599Editing
mediawiki/skins570Reading
mediawiki/extensions/MobileFrontend504ReadingRuby
mediawiki/extensions/ContentTranslation491Language engineering
translatewiki486
oojs/ui469
wikimedia/fundraising/crm457Fundraising tech
mediawiki/extensions/BlueSpiceFoundation414
mediawiki/extensions/CirrusSearch357Search PlatformJavaScript
mediawiki/extensions/AbuseFilter306Contributors
phabricator/phabricator302Release Engineering
mediawiki/services/restbase290Core Platform
mediawiki/extensions/Flow232Growth
mediawiki/extensions/Echo223GrowthJavaScript
mediawiki/vagrant221Release Engineering1
mediawiki/extensions/Popups184ReadingJavaScript
mediawiki/extensions/Translate182Language engineering
mediawiki/extensions/DonationInterface180Fundraising tech
analytics/refinery178Analytics1
mediawiki/extensions/PageTriage177Growth1
mediawiki/extensions/Cargo176
mediawiki/tools/codesniffer156
mediawiki/extensions/TimedMediaHandler152Reading
mediawiki/extensions/UniversalLanguageSelector142Language engineering
mediawiki/vendor140
mediawiki/extensions/SocialProfile139
analytics/refinery/source138Analytics1
operations/software137SRE1
mediawiki/services/restbase/deploy136Core Platform
operations/debs/pybal123SRE2
mediawiki/extensions/CentralAuth116Ruby
mediawiki/tools/release1161
mediawiki/services/cxserver112
mediawiki/extensions/BlueSpiceExtensions110
mediawiki/extensions/WikimediaEvents110PHP
labs/private108
operations/debs/python-kafka104SRE
labs/tools/heritage96

Since some of the repositories connected to incidents are not in the top 50 Bitergia report, I've used git rev-list to sort them. Numbers are different because Bitergia excludes empty commits and bots (script, results: P7834).

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions/WikibaseQualityConstraints910WMDE1
mediawiki/extensions/ORES364GrowthJavaScript2
wikidata/query/rdf204WMDE2
mediawiki/extensions/Babel146Editing1
mediawiki/services/ores/deploy84Growth1
maps/kartotherian/deploy801
mediawiki/extensions/PropertySuggester67WMDE1
maps/tilerator/deploy611
mediawiki/extensions/Cognate47WMDE1
All-Projects371
eventlogging261
integration/jenkins19Release Engineering1
mediawiki/services/eventstreams161
wikidata/query/blazegraph10WMDE1

Prioritize Repositories

Change column uses Bitergia numbers. Numbers in italic are from git rev-list.

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions/VisualEditor692Editing
mediawiki/extensions/ContentTranslation491Language engineering
mediawiki/extensions/AbuseFilter306Contributors
phabricator/phabricator302Release Engineering
mediawiki/extensions/Flow232Growth
mediawiki/extensions/Translate182Language engineering
mediawiki/extensions/DonationInterface180Fundraising tech
mediawiki/extensions/PageTriage177Growth1
mediawiki/extensions/TimedMediaHandler152Reading
mediawiki/extensions/UniversalLanguageSelector142Language engineering
mediawiki/extensions/WikibaseQualityConstraints910WMDE1
mediawiki/extensions/Babel146Editing1
mediawiki/extensions/PropertySuggester67WMDE1
mediawiki/extensions/Cognate47WMDE1

The same table grouped by stewards.

RepositoryChangeStewardsCoverageIncidentsSelected
mediawiki/extensions/VisualEditor692Editing
mediawiki/extensions/Babel146Editing1
mediawiki/extensions/ContentTranslation491Language engineering
mediawiki/extensions/Translate182Language engineering
mediawiki/extensions/UniversalLanguageSelector142Language engineering
mediawiki/extensions/AbuseFilter306Contributors
phabricator/phabricator302Release Engineering
mediawiki/extensions/Flow232Growth
mediawiki/extensions/PageTriage177Growth1
mediawiki/extensions/DonationInterface180Fundraising tech
mediawiki/extensions/TimedMediaHandler152Reading
mediawiki/extensions/WikibaseQualityConstraints910WMDE1
mediawiki/extensions/PropertySuggester67WMDE1
mediawiki/extensions/Cognate47WMDE1

Conclusions

  • There are some repositories that do not fit the Selenium/end-to-end testing model (eg: operations/puppet or operations/mediawiki-config) but could benefit from other testing mechanisms or deployment practices.
  • A test could prevent an outage if it runs:
    • Every time a patch is uploaded to Gerrit. That way it could find a problem during development. That is already done for repositories that have tests.
    • After deployment. That way it could find a problem that was not found during development. In ideal case, deployment would be made to a test server in production, a test would run targeting the tests server. If it fails, further deployment would be cancelled. This is not yet done.
  • Automattic runs tests targeting WordPress.com production:

We decided to implement some basic e2e test scenarios which would only run in production – both after someone deploys a change and a few times a day to cover situations where someone makes some changes to a server or something.

Next steps:

  • I will contact owners of selected repositories (see Prioritize Repositories section) and offer help in creating the first test.
  • I will add results from Incident Documentation tool to incident reports as a new Related Repositories section. The section will link to the tool and explain how it got the data. It will also ask for edits if the data is not correct.
  • I will reach out to people that created (or edited) incident reports and ask them to populate Related Repositories section. This might have mixed results. For best results, the section will already be populated with the data from Incident Documentation tool.
  • I will add Related Repositories section to the incident report template.

Incident Documentation tool improvements:

  • There are several way to link from a wiki page to a patch or task. The tool for now only supports [[gerrit:]] and [[phab:]]. Tracked as issue #6.
  • Gerrit patches and Phabricator tasks from Actionables section do not provide enough data. The entire incident report should be used. I have limited it first because I was collecting data manually (and Actionables looked like the most important part of the incident report), later because of #6. Tracked as issue #4.
  • Find Gerrit repository from task component. Tracked as issue #5.
  • A table with the number of patches from each repository would be helpful. Tracked as issue #11.
  • A report with folder/file names from a repository that are mentioned the most. Especially useful for big repositories like operations/puppet and mediawiki/core. Tracked as issue #12.
Bring in 'da noise, bring in defunct. It's a zombie party!https://phabricator.wikimedia.org/phame/post/view/127/dduvall (Dan Duvall)2018-11-16T19:22:51+00:002023-02-07T22:01:01+00:00

Halloween is a full two weeks behind us here in the United States, but it's still on my mind. It happens to be my favorite holiday, and I receive it both gleefully and somberly.

Some of the more obvious and delightful ways I appreciate Halloween include: busting out my giant spider to hang in the front yard; getting messy with gory and gaudy decorations; scaring neighborhood children; stuffing candy in my face. What's not to like about all that, really?

But there are more deeply felt reasons to appreciate Halloween, reasons that aren't often fully internalized or even discussed. Rooted in its pagan Celtic traditions and echoed by similar traditions worldwide, like Día de los Muertos of Mexico and Obon of Japan, Halloween asks us, for a night, to put away our timidness about living and dying. It asks us to turn toward the growing darkness of winter, turn toward the ones we've lost, turn toward the decay of our own bodies, and honor these very real experiences as equal partners to the light, birth, and growth embodied by our everyday expectations. More precisely it asks us to turn toward these often difficult aspects of life not with hesitation or fear but with strength, jubilation, a sense of humor. It is this brave posture of Halloween's traditions that I appreciate so very much.

So Halloween is over and I'm looking back. What does that have to do with anything here at WMF and in Phabricator no less? Well, I want to take you into another dark and ominous cauldron of our experience that most would rather just forget about.

I want to show you some Continuous Integration build metrics for the month of October!

Will we see darkness? Oh yes. Will we see decay? Surely. Was that an awkward transition to the real subject of this post? Yep! Sorry, but I just had to have a thematic introduction, and brace yourself with a sigh because the theme will continue.

DOCKER WHALE – BRIIIIIINE!

You see this past October, Release Engineering battled a HORDE OF ZOMBIE CONTAINERS! And we'll be seeing in our metrics proof that this horde was, for longer than anyone wishes zombies to ever hang around, chowing down on the brains of our CI.

Before I get to the zombies, let's look briefly at a big picture view of last month's build durations... Let's also get just a bit more serious.

Daily 75th, 95th, and 98th percentiles for successful build durations – October 2018

What are we looking at? We're looking at statistics for build durations. The above chart plots the daily 75th, 95th, and 98th percentiles of successful build durations during the month of October as well as the number of job configuration changes made within the same range of time.

These data points were chosen for a few reasons.

First, percentiles are used over daily means to better represent what the vast majority of users experience when they're waiting on CI[1]. It excludes outliers, build durations that occur only about 2 percent of the time, not because they're unimportant to us, but because setting them aside temporarily allows us to find patterns of most common use and issues that might otherwise be obfuscated by the extra noise of extraordinarily long builds.

Next, three percentiles were chosen so that we might look for patterns among both faster builds and the longer running ones. Practically this means we can measure the effects of our changes on the chosen percentiles independently, and if we make changes to improve the build durations of jobs that typically perform closer to one percentile, we can measure the effect discretely while also making sure performance at other percentiles has not regressed.

Finally, job configuration changes are plotted alongside daily duration percentiles to help find indications of whether our changes to integration/config during October had an impact on overall build performance. Of course, measuring the exact impact of these changes is quite a bit more difficult and requires the build data used to populate this chart to be classified and analyzed much further—as we'll see later—but having the extra information there is an important first step.

So what can we see in this chart? Well, let's start with that very conspicuous dip smack dab in the middle.

Daily 75th, 95th, and 98th percentiles for successful build durations – dip around 10/14

And for background, another short thematic interlude:

Back in June, @thcipriani of Release Engineering was waiting on a particularly long build to complete—it was a "dark and stormy night" or something, *sighs and rolls eyes*—and during his investigation on the labs instance that was running the build, he noticed a curious thing: There was a Docker container just chugging away running a build that had started more than 6 hours prior, a build that had thought to be canceled and reaped by Jenkins, a build that should have been long dead but was sitting there very much undead and seemingly loving its long and private binge before the terminal specter of a meat-space man had so rudely interrupted.

"It's a zombie container," @thcipriani (probably) muttered as he felt his way backward on outstretched fingertips (ctrl-ccccc), logged out, and filed task T198517 to which @hashar soon replied and offered a rational but disturbing explanation.

I'm not going to explain the why in its entirety but you can read more about it in the comments of an associated task, T176747, and the links posted therein. I will, however, briefly explain what I mean by "zombie container."

A zombie container for the sake of this post is not strictly a zombie process in the POSIX sense, but means that a build's main process is still running, even after Jenkins has told it to stop. It is both taking up some amount of valuable host resources (CPU, memory, or disk space), and is invisible to anyone looking only at the monitoring interfaces of Gerrit, Zuul, or Jenkins.

We didn't see much evidence of these zombie containers having enough impact on the overall system to demand dropping other priorities—and to be perfectly honest, I half assumed that Tyler's account had simply been due to madness after ingesting a bad batch of homebrew honey mead—but the data shows that they continued to lurk and that they may have even proliferated under the generally increasing load on CI. By early October, these zombie containers were wreaking absolute havoc—compounded by the way our CI system deals with chains of dependent builds and superseding patchsets—and it was clear that hunting them down should be a priority.

Task T198517 was claimed and conquered, and to the dismay of zombie containers across CI:

Two integration/config patches were deployed to fix the issue. The first refactored all Docker based jobs to invoke docker run via a common builder. The second adds to the common docker-run builder the --init option which ensures a PID 1 within the container that will properly reap child processes and forward signals, and --label options which tag the running containers with the job name and build number; it also implements an additional safety measure, a docker-reap-containers post-build script that kills any running containers that could be errantly running at the end of the build (using the added labels to filter for only the build's containers).

Between the deployed fix and periodically running a manual process to kill off long-running containers that were started prior to the fix being deployed, I think we may be out of the woods for now.

Looking again at that dip in the percentiles chart, a few things are clear.

Daily 75th, 95th, and 98th percentiles for successful build durations – dip around 10/14

There's a noticeable drop among all three daily duration percentiles. Second, there also seems to be a decrease in both the variance of each day's percentile average expressed by the plotted error bars—remember that our percentile precision demands we average multiple values for each percentile/day—and the day-to-day differences in plotted percentiles after the dip. And lastly, the dip strongly coincides with the job configuration changes that were made to resolve T198517.

WE. DID. IT. WE'VE FREED CI FROM THOSE DREADED ZOMBIE CONTAINERS! THEY ARE TRULY (UN)^2-DEAD AGAIN SO LET'S DITCH THESE BORING CHARTS AND CELEBRA...

Say what? Oh. Right. I guess we didn't adequately measure exactly how much of an improvement in duration there was pre-and-post T198517 and whether or not there was unnoticed/unanticipated regression. Let's pause on that celebration and look a little deeper.

So how does one get a bigger picture of overall CI build durations before and after a change? Or of categories within any real and highly heterogeneous performance data for that matter? I did not have a good answer to this question, so I went searching and I found a lovely blog post on analyzing DNS performance across various geo-distributed servers[2]. It's a great read really, and talks about a specific statistical tool that seemed like it might be useful in our case: The logarithmic percentile histogram.

"I like the way you talk..." Yes, it's a fancy name, but it's pretty simple when broken down... backwards, because, well, English.

A histogram shows the distribution of one quantitative variable in a dataset, in our case build duration, across various 'buckets'. A percentile histogram buckets values for the variable of the histogram by its percentiles, and a logarithmic percentile histogram plots the distribution of values across percentile buckets on a logarithmic scale.

I think it's a bit easier to show than to describe, so here's our plot of build duration percentiles before and after T198517 was resolved, represented as a histogram on a logarithmic scale.

High-to-low percentiles before and after the zombie container issue was resolved

First, note that while we ranked build durations low to high in our other chart, this one presents a high-to-low ranking, meaning that longer durations (slower builds) are ranked within lower percentiles and shorter durations (faster builds) are ranked in higher percentiles. This better fits the logarithmic scale, and more importantly it brings the lowest percentiles (the slowest durations) into focus, letting us see where the biggest gains were made by resolving the zombie container issue.

Also valuable about this representation is the fact that it shows all percentiles, not just the three that we saw earlier in the chart of daily calculations, which shows us that gains were made consistently across the board and there are no notable regressions among the percentile ranks where it would matter—there is a small section of the plot that shows percentiles of post-T198517 durations being slighter higher (slower), but this is among some of the percentiles for the very fastest of builds where the absolute values of differences are very small and perhaps not even statistically significant.

Looking at the percentage gains annotated parenthetically in the plot, we can see major gains at the 0.2, 1, 2, 10, 25, and 50th percentiles. Here they are as a table.

percentileduration w/ zombiesw/o zombiesgain from killing zombies
p0.243.3 minutes39.3 minutes-9.2%
p134.026.5-22.2%
p227.722.2-19.7%
p1017.612.7-27.9%
p2511.07.2-34.4%
p505.33.4-36.9%

So there it is quite plain, a CI world with and without zombie containers, and builds running upwards of 37% faster without those zombies chomping away at our brains! It's demonstrably a better world without them I'd say, but you be the judge; We all have different tastes. 8D

Now celebrate or don't celebrate accordingly!

Oh and please have at the data[3] yourself if you're interested in it. Better yet, find all the ways I screwed up and let me know! It was all done in a giant Google Sheet—that might crash your browser—because, well, I don't know R! (Side note: someone please teach me how to use R.)

References

[1] https://www.dynatrace.com/news/blog/why-averages-suck-and-percentiles-are-great/
[2] https://blog.apnic.net/2017/11/24/dns-performance-metrics-logarithmic-percentile-histogram/
[3] https://docs.google.com/spreadsheets/d/1-HLTy8Z4OqatLnufFEszbqkS141MBXJNEPZQScDD1hQ/edit#gid=1462593305

Credits

Thanks to @thcipriani and @greg for their review of this post!

//"DOCKER ZOMBIE" is a derivative of https://linux.pictures/projects/dark-docker-picture-in-playing-cards-style and shared under the same idgaf license as original https://linux.pictures/about. It was inspired by but not expressly derived from a different work by drewdomkus https://flickr.com/photos/drewdomkus/3146756158//

Wikimedia Release Engineering's 1st Annual Developer Satisfaction Surveyhttps://phabricator.wikimedia.org/phame/post/view/126/zeljkofilipin (Željko Filipin)2018-11-07T16:02:28+00:002018-12-15T20:02:47+00:00
NOTE: The survey is now closed

This survey will help the Release Engineering team measure developer satisfaction and determine where to invest resources. The topics covered will include the following:

  • Local Development Environment
  • Beta Cluster / Staging Environment
  • Testing / CI
  • Code Review
  • Deployments
  • Production Systems
  • Development and Productivity Tools
  • Developer Documentation
  • General Feedback

We are soliciting feedback from all Wikimedia developers, including Staff, 3rd party contributors and volunteer developers. The survey will be open for 2 weeks, closing on November 14th.

This survey will be conducted via a third-party service, which may subject it to additional terms. For more information on privacy and data-handling, see the survey privacy statement.

To participate in this survey, please start here: Developer Satisfaction Survey.

Mukunda Modell

Production Excellence #4: October 2018https://phabricator.wikimedia.org/phame/post/view/125/Krinkle (Timo Tijhof)2018-11-28T17:47:20+00:002020-03-24T22:06:23+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

  • Month in numbers.
  • Highlighted stories.
  • Current problems.

📊 Month in numbers

  • 7 documented incident since from 24 September to 31 October. [1]
  • 79 Wikimedia-prod-error tasks closed from 24 September to 31 October. [2]
  • 69 Wikimedia-prod-error tasks created from 24 September to 31 October. [3]
  • 175 currently open Wikimedia-prod-error tasks (as of 25 November 2018).

October had a relatively high number of incidents – compared to prior months and compared to the same month last year (details).

Terminology:

  • An Exception (or fatal) causes user actions to be prevented. For example, a page would display "Exception: Unable to render page", instead the article content.
  • A Warning (or non-fatal, or error) can produce page views that are technically unaware of a problem, but may show corrupt, incorrect, or incomplete information. Examples – an article would display the code word “null” instead of the actual content, a user looking for Vegetables may be taken to an article about Vegetarians, a user may receive a notification that says “You have (null) new messages.”

I’ve highlighted a few of last month’s resolved tasks below.

📖 Send your thanks for talk contributions

Fixed by volunteer @Mh-3110 (Mahuton).

The Thanks functionality for MediaWiki (created in 2013) wasn’t working in some cases. This problem was first reported in April, with four more reports since then. Mahuton investigated together with @SBisson. They found that the issue was specific to talk pages with structured discussions.

It turned out to be caused by an outdated array access key in SpecialThanks.php. Once adjusted, the functionality was restored to its former glory. The error existed for about eight months, since internal refactoring in March for T186920 changed the internal array.

This was Mahuton’s first Gerrit contribution. Thank you @Mh-3110, and welcome!

T191442 / https://gerrit.wikimedia.org/r/461189

📖 One space led to Fatal exception

Fixed by volunteer @D3r1ck01 (Derick Alangi).

Administrators use the Special:DeletedContributions page to search for edits that are hidden from public view. When an admin typed a space at the end of their search, the MediaWiki application would throw a fatal exception. The user would see a generic error page, suggesting that the website may be unavailable.

Derick went in and updated the input handler to automatically correct these inputs for the user.

T187619

📖 Fatal exception from translation draft access

Accessing the private link for ContentTranslation when logged-out isn’t meant to work. But, the code didn’t account for this fact. When users attempted to open such url when not logged in, the ContentTranslation code performed an invalid operation. This caused a fatal error from the MediaWiki application. The user would see a system error page without further details.

This could happen when opening the link from your bookmarks before logging in, or after restarting the browser, or after clearing one’s cookies.

Fixed by @santhosh (Santhosh Thottingal, WMF Language Engineering team).

T205433

🎉 Thanks!

Thank you to everyone who helped by reporting or investigating problems in Wikimedia production; and for devising, coding or reviewing the corrective measures. Including: @Addshore, @Aklapper, @Anomie, @ArielGlenn, @Catrope, @D3r1ck01, @Daimona, @Fomafix, @Ladsgroup, @Legoktm, @MSantos, @Mainframe98, @Melos, @Mh-3110, @SBisson, @Tgr, @Umherirrender, @Vort, @aaron, @aezell, @cscott, @dcausse, @jcrespo, @kostajh, @matmarex, @mmodell, @mobrovac, @santhosh, @thcipriani, and @thiemowmde.

📉 Current problems

Take a look at the workboard and look for tasks that might need your help. The workboard lists known issues, grouped by the week in which they were first observed.

https://phabricator.wikimedia.org/tag/wikimedia-production-error

💡 ProTip:

Cross-reference one workboard with another via Open TasksAdvanced Filter and enter Tag(s) to apply as a filter.

Thanks!

Until next time,
– Timo Tijhof


Footnotes:

[1] Incidents. – wikitech.wikimedia.org/wiki/Special:AllPages...
[2] Tasks closed. – phabricator.wikimedia.org/maniphest/query...
[3] Tasks opened. – phabricator.wikimedia.org/maniphest/query...

Production Excellence #3: September 2018https://phabricator.wikimedia.org/phame/post/view/119/Krinkle (Timo Tijhof)2018-09-25T18:41:42+00:002020-03-24T22:06:14+00:00

How’d we do in our strive for operational excellence last month? Read on to find out!

Month in numbers

  • 1 documented incident since August 9. [1]
  • 113 Wikimedia-prod-error tasks closed since August 9. [2]
  • 99 Wikimedia-prod-error tasks created since August 9. [3]

Current problems

Frequent:

  • [MediaWiki-Logging] Exception from Special:Log (public GET). – T201411
  • [Graph] Warning "data error" from ApiGraph in gzdecode. – T184128
  • [RemexHtml] Exception "backtrack_limit exhausted" from search index jobs. – T201184

Other:

  • [MediaWiki-Redirects] Exception from NS_MEDIA redirect (public GET). – T203942

This is an oldie: (Well..., it's an oldie where I come from... 🎸)

  • [FlaggedRevs] Exception from Special:ProblemChanges (since 2011). – T176232

Terminology:

  • An Exception (or fatal error) causes user actions to be aborted. For example, a page would display "Exception: Unable to render page", instead the article content.
  • A Warning (or non-fatal error) can produce page views that are technically unaware of a problem, but may show corrupt or incomplete information. For example, an article would display the word "null" instead of the actual content. Or, a user may be told "You have null new messages."

The combined volume of infrequent non-fatal errors is high. This limits our ability to automatically detect whether a deployment caused problems. The “public GET” risks in particular can (and have) caused alerts to fire that notify Operations of wikis potentially being down. Such exceptions must not be publicly exposed.

With that behind us... Let’s celebrate this month’s highlights!

📖 Quiz defect – "0" is not nothing!

Tyler Cipriani (Release Engineering) reported an error in Quiz. Wikiversity uses Quiz for interactive learning. Editors define quizzes in the source text (wikitext). The Quiz program processes this text, creates checkboxes with labels, and sends it to a user. When the sending part failed, "Error: Undefined index" appeared in the logs. @Umherirrender investigated.

A line in the source text can: define a question, or an answer, or nothing at all. The code that creates checkboxes needs to decide between "something" and "nothing". The code utilised the PHP "if" statement for this, which compares a value to True and False. The answers to a quiz can be any text, which means PHP first transforms the text to one of True or False. In doing so, values like "0" became False. This meant the code thought "0" was not an answer. The code responsible for sending checkboxes did not have this problem. When the code tried to access the checkbox to send, it did not exist. Hence, "Error: Undefined index".

Umherirrender fixed the problem by using a strict comparison. A strict comparison doesn't transform a value first, it only compares.

T196684

📖 PageTriage enters JobQueue for better performance

Kosta Harlan (from Audiences's Growth team) investigated a warning for PageTriage. This extension provides the New Pages Feed tool on the English Wikipedia. Each page in the feed has metadata, usually calculated when an editor creates a page. Sometimes, this is not available. Then, it must be calculated on-demand, when a user triages pages. So far, so good. The information was then saved to the database for re-use by other triagers. This last part caused the serious performance warning: "Unexpected database writes".

Database changes must not happen on page views. The database has many replicas for reading, but only one "master" for all writing. We avoid using the master during page views to make our systems independent. This is a key design principle for MediaWiki performance. [5] It lets a secondary data centre build pages without connecting to the primary (which can be far away).

Kosta addressed the warning by improving the code that saves the calculated information. Instead of saving it immediately, an instruction is now sent via a job queue, after the page view is ready. This job queue then calculates and saves the information to the master database. The master synchronises it to replicas, and then page views can use it.

T199699 / https://gerrit.wikimedia.org/r/455870

📖 Tomorrow, may be sooner than you think

After developers submit code to Gerrit, they eagerly await the result from Jenkins, an automated test runner. It sometimes incorrectly reported a problem with the MergeHistory feature. The code assumed that the tests would finish by "tomorrow".

It might be safe to assume our tests will not take one day to finish. Unfortunately, the programming utility "strtotime", does not interpret "tomorrow" as "this time tomorrow". Instead, it means "the start of tomorrow". In other words, the next strike of midnight! The tests use UTC as the neutral timezone.

Every day in the 15 minutes before 5 PM in San Francisco (which is midnight UTC), code submitted to Code Review, could have mysteriously failing tests.

– Continue at https://gerrit.wikimedia.org/r/452873

📖 Continuous Whac-A-Mole

In August, developers started to notice rare and mysterious failures from Jenkins. No obvious cause or solution was known at that time.

Later that month, Dan Duvall (Release Engineering team) started exploring ways to run our tests faster. Before, we had many small virtual servers, where each server runs only one test at a time. The idea: Have a smaller group of much larger virtual servers where each server could run many tests at the same time. We hope that during busier times this will better share the resources between tests. And, during less busy times, allow a single test to use more resources.

As implementation of this idea began, the mysterious test failures became commonplace. "No space left on device", was a common error. The test servers had their hard disk full. This was surprising. The new (larger) servers seemed to have enough space to accommodate the number of tests it ran at the same time. Together with Antoine Musso and Tyler Cipriani, they identified and resolved two problems:

  1. Some automated tests did not clean up after themselves.
  2. The test-templates were stored on the "root disk" (the hard drive for the operating system), instead of the hard drive with space reserved for tests. This root disk is quite small, and is the same size on small servers and large servers.

T202160 / T202457

🎉 Thanks!

Thank you to everyone who has helped report, investigate, or resolve production errors past month. Including:

Tpt
Ankry
Daimona
Legoktm
Volker_E
Pchelolo
Dan Duvall
Gilles Dubuc
Daniel Kinzler
Umherirrender
Greg Grossmeier
Gergő Tisza (Tgr)
Sam Reed (Reedy)
Giuseppe Lavagetto
Brad Jorsch (Anomie)
Tim Starling (tstarling)
Kosta Harlan (kostajh)
Jaime Crespo (jcrespo)
Antoine Musso (hashar)
Roan Kattouw (Catrope)
Adam WMDE (Addshore)
Stephane Bisson (SBisson)
Niklas Laxström (Nikerabbit)
Thiemo Kreuz (thiemowmde)
Subramanya Sastry (ssastry)
This, that and the other (TTO)
Manuel Aróstegui (Marostegui)
Bartosz Dziewoński (matmarex)
James D. Forrester (Jdforrester-WMF)

Thanks!

Until next time,

– Timo Tijhof


Further reading:

Footnotes:

[1] Incidents. – https://wikitech.wikimedia.org/wiki/Special:AllPages?from=Incident+documentation%2F20180809&to=Incident+documentation%2F20180922&namespace=0
[2] Tasks closed. – https://phabricator.wikimedia.org/maniphest/query/wOuWkMNsZheu/#R
[3] Tasks opened. – https://phabricator.wikimedia.org/maniphest/query/6HpdI76rfuDg/#R
[4] Quiz on Wikiversity. – https://en.wikiversity.org/wiki/How_things_work_college_course/Conceptual_physics_wikiquizzes/Velocity_and_acceleration
[5] Operate multiple datacenters. – https://www.mediawiki.org/wiki/Requests_for_comment/Master-slave_datacenter_strategy_for_MediaWiki

Quibble in summerhttps://phabricator.wikimedia.org/phame/post/view/118/hashar (Antoine Musso)2019-03-28T10:42:04+00:002019-03-28T10:47:34+00:00

Note: this post has been published on 03/28 but has been originally written in September 2018 after Quibble 0.0.26 and never got published.


The last update about Quibble is from June 1st (Blog Post: Quibble in May), this is about updating on progress made over the summer.

Since the last update, Quibble version went from 0.0.17 to 0.0.26:

For --commands one pass them as shell snippets such as: --commands 'echo starting' 'phpunit' 'echo done'. A future version of Quibble would make it only accept a single argument though it can be repeated. Or in other terms, in the future one would have to use: --command 'echo starting' --command 'phpunit' --command 'echo done'.

The MediaWiki PHPUnit test suite to use is determined based on ZUUL_PROJECT. --phpunit-testsuite lets one explicitly set it, a use case is to run extensions tests for a change made to mediawiki/core and ensure it does not break extensions (ZUUL_PROJECT=mediawiki/core quibble --phpunit-testsuite=extensions mediawiki/extensions/BoilerPlate). On Wikimedia CI they are the wmf-quibble-* jobs.

You can get great speed up by using a tmpfs for the database. Create a tmpfs and then pass --db-dir to make use of it. With a Docker container one would do: docker run --tmpfs /workspace/db:size=320M quibble:latest --db-dir=/workspace/db.

In the future, I would like Quibble to be faster, it runs the commands in a serialized way and would be made faster by parallelizing at least some of the test commands (edit: done in 0.0.29).


Changelog for 0.0.17 to 0.0.26

  • T196013 MediaWiki configuration injected by Quibble is now prepended at start of LocalSettings.php, that makes the configuration snippets available to wfLoadExtension() / wfLoadSkin().
  • T197687 - Fix Chrome autoplay policy which prevented Qunit tests to run for Wikispeech https://goo.gl/xX8pDD
  • T198171 - In Chrome do not rate limit history.pushState(), prevents some Qunit tests from passing since they overflow the limit.
  • T195918
    • Enhance inline help for --run and --skip by grouping group them in a stages argument group.
    • New --skip=all to skip all tests
  • T195084 T195918 - Support running any command inside the Quibble environment by using --commands (see below). They are run with a web server exposed (T203178).
  • T22471 T196347 - rebuildLocalisationCache after update.php, fix locking issues when doing the first page request, multiple requests were racing over generating the localization cache.
  • T200017 - Allow overriding the PHPUnit testsuite to run.
  • Do not spawn a WebServer when running PHPUnit tests, its is only needed for Qunit and Selenium tests.
  • Add a link to https://doc.wikimedia.org/quibble/ in the README.rst.
  • T192132 - Quibble is now licensed under Apache 2.0
  • T202710 - Xvfb no more listens on an unix socket.
  • T200991 - Passing --dump-db-postrun will dump the content of the database to the log directory (--log-dir). Thanks @Pablo-WMDE
  • Add support for Zuul cloner --branch and --project-branch, used to test MediaWiki-extensions-DonationInterface master branch against MediaWiki release branches.
  • The environment variable TMPDIR set by Quibble is no more hardcoded to /tmp, it now follows the logic of Python tempfile.gettempdir().
  • When running under Docker, default the log directory to be under the workspace instead of /log.
  • Allow specifying database data directory with --db-dir (default is the temporary directory based on environment variable).
An introduction to Task Types in Phabricatorhttps://phabricator.wikimedia.org/phame/post/view/116/mmodell (Mukunda Modell)2018-09-20T17:22:36+00:002018-09-24T13:21:21+00:00

This blog post will describe a bit about how we are utilizing the "Task Types" feature in Phabricator to facilitate better tracking of work and to streamline workflows with custom fields. Additionally, I will be soliciting feedback about potential use-cases which could potentially take further advantage of this feature.

Inroducing Task Types

Task Types are a relatively new feature in Phabricator which allow tasks to be created with extra information fields that are unique to tasks of a given type. For example, Release tasks have a release date and release version which are not relevant for other types of tasks.

Another task type that has been recently introduced is the deadline type. Deadlines include a single extra field Due Date which is displayed at the top of the task view as well as on workboard cards.

Example: Typed Tasks

DeadlineRelease
Screenshot from 2018-09-20 06-45-44.png (341×657 px, 29 KB)
Screenshot from 2018-09-20 06-50-14.png (494×1 px, 50 KB)

More Uses for Task Types

Task types have the potential to streamline workflows and support the use of Phabricator for collecting structured data.

Bug reports and Feature Requests

One proposed use of task types is for collecting specific information in bug reports and feature requests. Bug reports, for example, might ask for OS or Browser version in separate fields to aid in sorting and searching through reports.

Security Issues

Another potential use-case which is currently being developed is a security issue task type. This will allow the security team to add fields relevant to security issues without cluttering the task form used by everyone for other types of tasks.

The Relationship Between Custom Forms and Custom Types

Custom forms can be created which hide irrelevant fields and generally streamline the process of submitting a task for a given workflow or for a team's specific use-case. This is a great feature in Phabricator and we have made extensive use of it for various purposes. The drawback to custom forms is that they are generally only useful for submitting tasks. Once a task is created, editing takes place on the normal "generic" task edit form.

Enter: Typed forms

It's now possible to assign a type to a form. Now it's possible to configure forms so that whenever you edit a Security task you always see the Edit Security Task form. Thanks to typed forms, we can now add custom fields which are always visible when editing one type of task but hidden when editing other types.

Example: Custom Forms
Security Issue FormStandard Form
Screenshot from 2018-09-20 08-00-29.png (940×770 px, 75 KB)
Screenshot from 2018-09-20 08-03-43.png (696×696 px, 48 KB)

Soliciting Feedback

Your feedback will be helpful in shaping the types of tasks and forms available in Phabricator. In order to best meet the needs of everyone who uses Phabricator, I'd love to hear your input on what forms and fields would be most useful for your needs. Describe a workflow or a use-case that you think would be well served by custom fields. You can comment here or on the task: T93499: Add support for task types (subtypes)

mediawiki_selenium 1.8.1 Ruby Gem Releasedhttps://phabricator.wikimedia.org/phame/post/view/108/zeljkofilipin (Željko Filipin)2018-06-14T15:05:31+00:002018-09-04T17:49:57+00:00

It has been a while since the last mediawiki_selenium release! 💎

I have just released version 1.8.1. 🚀

Notable changes:

  • Required Ruby version is 2.x
  • Upgrade selenium-webdriver to 3.2
  • Integration tests use Chrome instead of PhantomJS
  • Added license to readme file
  • Documented Sauce Labs usage in readme file
  • Updated Special:Preferences/reset page

I would like to thank several contributors that have improved the gem since the last release: @hashar, @Rammanojpotla, @demon and @thiemowmde! 👏

Quibble in Mayhttps://phabricator.wikimedia.org/phame/post/view/107/hashar (Antoine Musso)2018-06-01T20:36:22+00:002018-06-06T10:17:44+00:00

[Quibble] is the new test runner for MediaWiki (see the intro Blog Post: Introducing Quibble). This post is to give an update of what happened during May 2018.

Updates

Željko Filipin wrote a blog post Blog Post: Run Selenium tests using Quibble and Docker.

Since the last update, Quibble version went from 0.0.11 to 0.0.17:

  • Use Sphinx to generate documentation and publish it online https://doc.wikimedia.org/quibble/ - T193164 [Antoine & Željko]
  • Composer timeout bumped to 900 seconds. PHP CodeSniffer against the entirety of mediawiki/core takes a while under HHVM. [Kunal Mehta]
  • Process git submodules in extensions and skins - T130966 [Antoine]
  • HHVM now serves .svg files with Content-Type: image/svg+xml - T195634 [Antoine]
  • Support for posgres as a database backend. You will need postgres and pg_virtualenv installed then pass --db=postgres. - T39602 [Kunal Mehta]
  • Option --skip to skip one or more test commands. [Kunal Mehta]
  • Properly pass environment variables to all setup and test commands. Notably MW_INSTALL_PATH and MW_LOG_DIR were missing which caused some extensions to fail. The Jenkins job now properly capture all logs [Antoine]

How you can help

Documentation

The documentation can use tutorials for various use cases. It is in integration/quibble.git in the doc/source directory. You should be able to generate it by simply running:

tox -e doc
<your web browser> doc/build/index.html

Any support or question you might have are most welcome as a Phabricator task against Quibble.

Migrate CI

I have migrated MediaWiki and a lot of extensions to use the Quibble jobs. There are still 229 mediawiki extensions not migrated yet. A test report is build daily by Jenkins:

https://integration.wikimedia.org/ci/job/integration-config-qa/lastCompletedBuild/testReport/

Tests "test_mediawiki_repos_use_quibble" represent extension not migrated yet. T183512 is the huge tracking task.

Postgres

Make MediaWiki tests passing with Postgres!

T195807: Fix failing MediaWiki core tests on Postgres database backend

Thank you

Huge thanks to Kunal Mehta, Timo Tijhof, Adam Wight, Željko Filipin and Stephen Niedzielski.

That is all for May 2018.

References

[Quibble]
https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089812.html
[Presentation]
https://commons.wikimedia.org/wiki/File:20180519-QuibblePres.pdf
[Last update]
https://lists.wikimedia.org/pipermail/wikitech-l/2018-April/089858.html

Technical Debt - The Contagion Effecthttps://phabricator.wikimedia.org/phame/post/view/106/Jrbranaa (Jean-Rene Branaa)2018-05-24T23:16:51+00:002018-07-22T14:18:06+00:00

One particularly interesting topic discussed during the Hackathon Technical Debt session (T194934) was that of the contagious aspect of technical debt. Although this makes sense in hindsight, it's not something that I had really given much thought to previously.

The basic premise is that existing technical debt can have a contagious effect on other areas of code. One aspect of this is developers new to the MediaWiki code base may use existing code as a pattern for new code development. If that code has technical debt, the technical debt could get replicated in other areas of code.

This can be overcome with both education about desired patterns as well as sharing the technical debt state of existing code. It's not clear how best to accomplish the later, but perhaps it's as simple as a comment in the code, once it's been identified and is being tracked in Phabricator.

Another aspect of the contagion effect (perhaps more of a compound effect), is the result of maintaining code with existing technical debt. As bugs are fixed or minor features added, those changes can, in effect, result in a spreading of the technical debt. Of course this doesn't always need to be the case, but it can be, if one is not careful.

I'd like to get your thoughts on this topic and your past experiences working with and around technical debt.

Thoughts/Questions:

  • Are some areas of code more contagious than others?
  • What are some ways to mark technical debt as such?
  • What do you do when you need to work on code with significant technical debt?
Run Selenium tests using Quibble and Dockerhttps://phabricator.wikimedia.org/phame/post/view/100/zeljkofilipin (Željko Filipin)2018-05-02T13:46:33+00:002020-03-04T14:35:43+00:00

Dependencies are Git Python 3, and Docker Community Edition (CE).

First, the general setup.

$ git clone https://gerrit.wikimedia.org/r/p/integration/quibble
...
       
$ cd quibble/

$ python3 -m pip install -e .
...

$ docker pull docker-registry.wikimedia.org/releng/quibble-stretch:latest
...
(2m 26s)

The simplest, and slowest, way to run Quibble.

$ docker run -it --rm \
 docker-registry.wikimedia.org/releng/quibble-stretch:latest
...
(12m 54s)

Speed things up by using local repositories.

$ mkdir -p ref/mediawiki/skins

$ git clone --bare https://gerrit.wikimedia.org/r/mediawiki/core ref/mediawiki/core.git
...
(3m 40s)

$ git clone --bare https://gerrit.wikimedia.org/r/mediawiki/vendor ref/mediawiki/vendor.git
...

$ git clone --bare https://gerrit.wikimedia.org/r/mediawiki/skins/Vector ref/mediawiki/skins/Vector.git
...

$ mkdir cache
$ chmod 777 cache

$ mkdir -p log
$ chmod 777 log

$ mkdir -p src
$ chmod 777 src

$ docker run -it --rm \
  -v "$(pwd)"/cache:/cache \
  -v "$(pwd)"/log:/workspace/log \
  -v "$(pwd)"/ref:/srv/git:ro \
  -v "$(pwd)"/src:/workspace/src \
  docker-registry.wikimedia.org/releng/quibble-stretch:latest
...
(18m 0s)

The second run of everything, just to see if things get faster.

$ docker run -it --rm \
  -v "$(pwd)"/cache:/cache \
  -v "$(pwd)"/log:/workspace/log \
  -v "$(pwd)"/ref:/srv/git:ro \
  -v "$(pwd)"/src:/workspace/src \
  docker-registry.wikimedia.org/releng/quibble-stretch:latest
...
(16m 50s)

If you get this error message

A LocalSettings.php file has been detected. To upgrade this installation, please run update.php instead

just remove the file

$ rm src/LocalSettings.php

Speed things up by skipping Zuul and not installing dependencies.

$ docker run -it --rm \
  -v "$(pwd)"/cache:/cache \
  -v "$(pwd)"/log:/workspace/log \
  -v "$(pwd)"/ref:/srv/git:ro \
  -v "$(pwd)"/src:/workspace/src \
  docker-registry.wikimedia.org/releng/quibble-stretch:latest --skip-zuul --skip-deps
...
(6m 17s)

Speed things up by just running Selenium tests.

$ docker run -it --rm \
  -v "$(pwd)"/cache:/cache \
  -v "$(pwd)"/log:/workspace/log \
  -v "$(pwd)"/ref:/srv/git:ro \
  -v "$(pwd)"/src:/workspace/src \
  docker-registry.wikimedia.org/releng/quibble-stretch:latest --skip-zuul --skip-deps --run selenium
...
(1m 19s)
Introducing Quibblehttps://phabricator.wikimedia.org/phame/post/view/99/hashar (Antoine Musso)2018-04-30T09:09:00+00:002018-05-30T21:12:47+00:00

Running all tests for MediaWiki and matching what CI/Jenkins is running has been a constant challenge for everyone, myself included. Today I am introducing Quibble, a python script that clone MediaWiki, set it up and run test commands.

It is a follow up to the Vienna Hackathon in 2017. We had a lot of discussion to make the CI jobs reproducible on a local machine and to unify the logic at a single place. Today, I have added a few jobs to
mediawiki/core.

An immediate advantage is that they run in Docker containers and will start running as soon as an execution slot is available. That will be faster than the old jobs (suffixed with -jessie) that had to wait for a
virtual machine to be made available.

A second advantage, is one can exactly reproduce the build on a local computer and even hack code for a fix up.

The setup guide is available from the source repository (integration/quibble.git):
https://gerrit.wikimedia.org/g/integration/quibble/

The minimal example would be:

git clone https://gerrit.wikimedia.org/r/p/integration/quibble
cd quibble
python3 -m pip install -e .
quibble

A few more details are available in this post on the QA list:
https://lists.wikimedia.org/pipermail/qa/2018-April/002699.html

Please give it a try and send issues, support requests to Phabricator Quibble project.

It will eventually used for all MediaWiki extensions and skins as well.

Selenium tests in Node.js project retrospectivehttps://phabricator.wikimedia.org/phame/post/view/88/zeljkofilipin (Željko Filipin)2018-03-26T14:28:12+00:002018-05-15T23:35:31+00:00

I have been working on the project with more or less focus on it since 2015. Maybe the easiest way to follow the project is by taking a look at a few epic tasks:

T182421: Q3 Selenium framework improvements will come to an end in a few days, so last week a few of us had a meeting to discuss the project.

Conclusions:

  • The new Node.js Selenium framework is simpler and easier to use than previous Ruby framework.

What could have gone better:

  • A lot of effort is required to port large test suites. Some teams were able to do it, some teams were not.
  • It was not clear that both Ruby and Node.js frameworks could coexist.
  • It was not clear that Mocha is recommended, but not mandatory. It is still possible to write Cucumber tests.
  • Some features of the Ruby framework are not available in Node.js framework, like multi-user login.
  • Node.js's built-in assertion library sometimes doesn't provide useful error messages. Chai is a good alternative.
  • It would be better if a meeting like this happened at the beginning of the project, and several times during the project.

Things to do:

Meeting notes are available at 20180320 Selenium Retrospective.


Image by Paul Friel - Meerkat II, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=24567063

Phabricator Updates for February 2018https://phabricator.wikimedia.org/phame/post/view/85/mmodell (Mukunda Modell)2018-02-15T07:55:48+00:002018-02-23T15:48:34+00:00

This is a digest of the updates from several weeks of changelogs which are published upstream. This is an incomplete list as I've cherry-picked just the changes which I think will be of significant interest to end-users of Wikimedia's phabricator. Please see the upstream changelogs for a detailed overview of everything that's changed recently.

General

Bulk Editor

https://secure.phabricator.com/T13025 The bulk editor (previously sometimes called the "batch editor") has been rebuilt on top of modern infrastructure (EditEngine) and a number of bugs have been fixed.

You can now modify the set of objects being edited from the editor screen, and a wider range of fields (including "points" and some custom fields) are supported. The bulk editor should also handle edits of workboard columns with large numbers of items more gracefully.

Bulk edits can now be made silently (suppressing notifications, feed stories, and email) with bin/bulk make-silent. The need to run a command-line tool is a little clumsy and is likely to become easier in a future version of Phabricator, but the ability to act silently could help an attacker who compromised an account avoid discovery for an extended period of time.

Edits which were made silently show an icon in the timeline view to make it easier to identify them.

Webhooks

Herald now supports formally defining webhooks. You can configure webhooks in "firehose" mode (so they receive all events) or use Herald rules to call them when certain conditions are met.

Mail Stamps

Several users have requested a way to differentiate notifications triggered by an @mention from the deluge of regular task subscription notification emails. This feature should provide a very good solution. See T150766 for one such request.

Mail now supports "mail stamps" to make it easier to use client rules to route or flag mail. Stamps are pieces of standardized metadata attached to mail in a machine-parseable format, like "FRAGILE" or "RETURN TO SENDER" might be stamped on a package.

By default, stamps are available in the X-Phabricator-Stamps header. You can also enable them in the mail body by changing the SettingsEmail FormatSend Stamps setting. This may be useful if you use a client like Gmail which can not act on mail headers.

Stamps provide more comprehensive information about object and change state than was previously available, and you can now highlight important mail which has stamps like mention(@alice) or reviewer(@alice).

See https://secure.phabricator.com/T13069 for additional discussion and plans for this feature.

Mute

You can now Mute Notifications for any object which supports subscriptions. This action is available in the right-hand column under Subscribe. Muting notifications for an object stops you from receiving mail from that object, except for mail triggered by Send me an email rules in Herald.

This feature is "on probation" and may be removed in the future if it proves more confusing than useful.

See https://secure.phabricator.com/T13068 for some discussion.

Task Close Date

Maniphest now explicitly tracks a closed date (and closing actor) for tasks. This data will be built retroactively by a migration during the upgrade. This will take a little while if you have a lot of tasks (see "Migrations" below).

The Maniphest search UI can now order by close date and filter tasks closed between particular dates or closed by certain users. The maniphest.search API has similar support, and returns this data in result sets. This data is also now available via Export Data.

For closed tasks, the main task list view now shows a checkmark icon and the close date. For open tasks, the view retains the old behavior (no icon, modified date).

Require secure mail

Herald rules can now Require secure mail. You can use this action to prevent discussion of sensitive objects (like security bugfixes) from being transmitted via email.

To use this feature, you'll generally write a Herald rule like this:

Global Rule for Revisions
When: 
[ Projects ][ include ][ Security Fix ]
Take actions:
[ Require secure mail ]

Users will still be notified that the corresponding object has been updated, but will have to follow a link in the mail to view details over HTTPS.

This may be useful if you use mailing lists with wide distributions or model sophisticated attackers as threats.

Note that this action is currently not stateful: the rule must keep matching every update to keep the object under wraps. This may change in the future. This flag may also support continuing to send mail content if GPG is configured in some future release.

I expect that we will utilize this feature to improve the secrecy of critical security bugs which are kept private until a security patch has been released.

Minor
  • Slightly reduced the level of bleeding/explosions on the Maniphest burnup chart.
  • Added date range filtering to activity logs, pull logs, and push logs.
  • Push logs are now more human readable.
  • "Assign to" should now work properly in the bulk editor.
  • Fixed an issue with comment actions that affect numeric fields like "Points" in Maniphest.
  • maniphest.edit should now accept null to unassign a task, as suggested by the documentation.
  • GitLFS over SSH no longer fatals on a bad getUser() call.
  • Commits and revisions may now Reverts <commit|revision> one another, and reverting or reverted changes are shown more clearly in the timeline.
Selenium Ruby framework deprecatedhttps://phabricator.wikimedia.org/phame/post/view/79/zeljkofilipin (Željko Filipin)2017-10-30T13:44:09+00:002020-03-09T09:14:42+00:00

This is your friendly but final warning that we are replacing Selenium tests written in Ruby with tests in Node.js. There will be no more reminders. Ruby stack will no longer be maintained. For more information see T139740 and T173488.

Extensive documentation is available at mediawiki.org. If you need help with the migration, I am available for pairing and code review (zfilipin in Gerrit, zeljkof in #wikimedia-releng).

To see how to write a test watch Selenium tests in Node.js tech talk (J78).

Tech talk: Selenium tests in Node.jshttps://phabricator.wikimedia.org/phame/post/view/78/zeljkofilipin (Željko Filipin)2017-10-27T12:04:18+00:002017-11-07T11:32:31+00:00

Who 👨‍💻

Željko Filipin, Engineer (Contractor) from Release Engineering team. That's me! 👋

What 📆

Selenium tests in Node.js. We will write a new simple test for a MediaWiki extension. An example: https://www.mediawiki.org/wiki/Selenium/Node.js/Write

When ⏳

Tuesday, October 31, 16:00 UTC (E766).

Where 🌍

The internet! The event will be streamed and recorded. Details coming soon.

Why 💻

We are deprecating Ruby Selenium framework (T173488).

See you there!

Video 🎥

Youtube, Commons (coming soon)

Selenium Ruby framework deprecation (September)https://phabricator.wikimedia.org/phame/post/view/75/zeljkofilipin (Željko Filipin)2017-09-25T15:27:52+00:002017-09-25T15:41:26+00:00

Originally an email sent on September 25 2017 to qa, engineering and wikitech-l mailing lists.

This is your friendly but penultimate warning that we are replacing Selenium tests written in Ruby with tests in Node.js. There will be only one more reminder, in October. In the meantime, only critical problems will be resolved in the Ruby stack. After October we will no longer maintain it.

You can follow task T139740 or Release Engineering blog for more information.

Extensive documentation is available at mediawiki.org. If you need help with the migration, I am available for pairing and code review (zfilipin in Gerrit, zeljkof in #wikimedia-releng).

Selenium Ruby framework deprecationhttps://phabricator.wikimedia.org/phame/post/view/74/zeljkofilipin (Željko Filipin)2017-09-25T15:14:04+00:002017-09-25T15:14:04+00:00

Originally an email sent on August 23 2017 to qa, engineering and wikitech-l mailing lists.

As announced in April, we are replacing Selenium tests written in Ruby with tests in Node.js. Now is the last responsible moment to make the move. There will be two more reminders, in September and October. In the meantime, only critical problems will be resolved in the Ruby stack. After October we will no longer maintain it. You can follow task T139740 for more information. Extensive documentation is available at mediawiki.org. If you need help with the migration, I am available for pairing and code review (zfilipin in Gerrit, zeljkof in #wikimedia-releng).

Selenium tests in Node.jshttps://phabricator.wikimedia.org/phame/post/view/73/zeljkofilipin (Željko Filipin)2017-09-25T14:57:49+00:002017-09-25T15:43:33+00:00

Originally an-email sent on April 3 2017 to qa, engineering and wikitech-l mailing lists.

TL;DR

You can now write Selenium tests in Node.js! Learn more about it at https://www.mediawiki.org/wiki/Selenium/Node.js

Introduction

Five years ago we introduced browser tests using Selenium and a Ruby based stack. It has worked great for some teams, and not so great for others. Last year we talked to people from several teams and ran a survey. The outcome is a preference toward using a language developers are familiar with: JavaScript/Node.Js.

After several months of research and development, we are proud to announce support for writing tests in Node.js. We have decided to use WebdriverIO. It is already available in MediaWiki core and supports running tests for extensions.

You can give it a try in MediaWiki-Vagrant:

vagrant up
vagrant ssh
sudo apt-get install chromedriver
export PATH=$PATH:/usr/lib/chromium
cd /vagrant/mediawiki
xvfb-run npm run selenium

Documentation

Extensive details are available on the landing page: https://www.mediawiki.org/wiki/Selenium/Node.js

Future

We plan to replace the majority of Selenium tests written in Ruby with tests in Node.js in the next 6 months. We can not force anybody to rewrite existing tests, but we will offer documentation and pairing sessions for teams that need help. After 6 months, teams that want to continue using Ruby framework will be able to do so, but without support from Release Engineering team.

I have submitted a skill share session for Wikimedia Hackathon 2017 in Vienna. If you would like to pair on Selenium tests in person, that would be a great time.

The list of short term actions is in task T139740.

Thanks

I would like to thank several people for reviews, advice and code: Jean-Rene Branaa, Dan Duvall, Antoine Musso, Jon Robson, Timo Tijhof. (Names are sorted alphabetically by last name. Apologies to people I have forgot.)

New feature: Embed videos from Commons into Phabricator markuphttps://phabricator.wikimedia.org/phame/post/view/18/mmodell (Mukunda Modell)2017-06-01T23:49:27+00:002017-06-14T04:36:13+00:00

I just finished deploying an update to Phabricator which includes a simple but rather useful feature:

T116515: Enable embedding of media from Wikimedia Commons

You can now embed videos from Wikimedia commons into any Task, Comment or Post. Just paste the commons URL to embed the standard commons player in an iframe. For example, this url:

https://commons.wikimedia.org/wiki/File:Saving_and_sharing_search_queries_in_Phabricator.webm

Produces this embedded video:

Sponsored Phabricator Improvementshttps://phabricator.wikimedia.org/phame/post/view/9/mmodell (Mukunda Modell)2016-07-27T10:44:53+00:002021-06-05T15:46:47+00:00

In T135327, the WMF Technical Collaboration team collected a list of Phabricator bugs and feature requests from the Wikimedia Developer Community. After identifying the most promising requests from the community, these were presented to Phacility (the organization that builds and maintains Phabricator) for sponsored prioritization.

I am very pleased to report that we are already seeing the benefits of this initiative. Several sponsored improvements have landed on https://phabricator.wikimedia.org/ over the past few weeks. For an overview of what's landed recently, read on!

Fixed

The following tasks are now resolved:

Notice three of those have task numbers lower than 2000. Those long-standing tasks date from the first months of WMF's Phabricator evaluation and RFC period. When those tasks were originally filled, Phabricator was just a test install running in WMF Labs. For me, It's especially satisfying to close so many long-standing issues that have effected many of us for more than a year.

Work in Progress

Several more issues were identified for sponsorship which are still awaiting a complete solution. Some of these are at least partially fixed and some are still pending. You can find out more details by reading the comments on each task linked below.

Other recent changes

Besides the sponsored features and bug fixes, there are several other recent improvements which are worth mentioning.

Milestones now include Next / Previous navigation

Recurring calendar events also gained next / previous navigation

New feature for Maniphest tasks: dependency graph

This very helpful feature displays a graphical representation of a task's Parents and Subtasks.

Example screenshot of the Phabricator Task Graph (194×329 px, 14 KB)

Initially there was an issue with this feature that made tasks with many relationships unable to load. This was exacerbated by the historical use of "tracking tasks" in the Wikimedia Bugzilla context. Thankfully after a quick patch from @epriestley (the primary author of Phabricator) and lots of help and testing from @Danny_B and @Paladox, @mmodell was able to deploy a fix for the issue a little over 24 hours after it was discovered.

Here's to yet more fruitful collaborations with upstream Phabricator!

Code Review Office Hourshttps://phabricator.wikimedia.org/phame/post/view/5/mmodell (Mukunda Modell)2016-05-09T21:50:08+00:002016-05-15T09:59:10+00:00

Starting Thursday May 12th, 13:00 PDT ( 20:00 GMT ) we will be having the first weekly Code Review office hours on freenode IRC in the #wikimedia-codereview channel.

Event details: E179: Code Review Office Hours
Background: T128371: Set up Code Review office hours

Thanks to everyone who's been helping to organize this. We would welcome people to submit your patches for review as well as reviewers who can spare a few minutes to provide feedback and hopefully merge some patches!

If you can't make it during the scheduled time period then please feel free to suggest other times that would be better for you. I intend to set up one or two other weekly time slots, at least one of which should be at a time that's more convenient for people in Europe and Asia.

Looking forward to seeing you in #wikimedia-codereview

What's new: Lots of improvements on phabricator.wikimedia.orghttps://phabricator.wikimedia.org/phame/post/view/1/mmodell (Mukunda Modell)2016-02-23T00:23:37+00:002016-03-20T12:32:36+00:00

Not a lot has changed for Wikimedia's instance of Phabricator over the past few months. That's because a lot has been happening behind the scenes, as well as upstream at Phacility. Members of the Release-Engineering-Team and Team-Practices group have been working since December 2015 to integrate various upstream changes, however, nothing was released to our production instance because there were so many important features that were in-progress and not yet fully usable. Additionally, we had to figure out exactly how these features would fit with the specific needs of our project and test a lot of functionality to be sure that we would not break anyone's workflows.

So our Phabricator instance has been relatively unchanged since November of last year. This all changed last Wednesday night (Thursday February 18th, 01:00 UTC) when we unleashed several months of changes into production. If you use phabricator.wikimedia.org regularly then you have probably already noticed some of the more obvious improvements.

A whole lot of hard work went into this release. Thankfully, everyone's hard work seems to have paid off as we only encountered a couple of relatively small issues which were fixed quickly after.

This post is to fill everyone in about what's changed and what you can expect from some of the exciting new functionality that has been added with this release.

Custom Forms

  • Some likely use cases include:
    • Custom markup at the top of forms (T115017)
    • Pre-filling information in fields
    • Hiding certain fields (T120903)
    • Bug reporting and template tasks can be entered more easily (T91538)
  • A great deal of caution is required when using this new functionality.
    • Form creation is limited to admins because it is currently too easy to accidentally override existing forms when someone creates a new form without fully understanding the subtleties of the new system
    • @mmodell can answer questions about what is possible.
    • Anyone with a use-case for a custom form can request that one be set up by Phabircator admin. We have not established a formal process for this yet.

Customizable Project Pages

It's now possible to customize individual project pages to meet the needs of each type of project or the needs of specific teams.

  • Custom links can be added to the navigation menu. This is great for prominently linking to a project wiki page or other relevant URLs that are relevant to a project.
  • The default page that is shown when visiting a project can be configured. For some projects, it makes more sense to go directly to the workboard, for others, the project details page is more appropriate.
  • We can disable the workboard entirely for certain projects (useful for 'tag' type projects)
  • There is an API for developing custom panels to be placed on project pages or as part of the navigation menus. These are new and unstable but it is seems like a promising way for us to extend Phabricator with new functionality in the future.

Milestones & Sub-Projects

Projects can now be nested. There are two new types of projects in Phabricator and they could prove to be really useful for organizing all of the things. Sub-projects are just like regular projects, but nested inside of an existing project. Milestones are a special type of sub-project that can be used to represent a sprint or a software release. There are a few somewhat complex rules about how project membership, policies and tasks are affected by sub-projects. There is detailed coverage in the Phabricator Projects Documentation and we have attempted to explain some of the implications here:

Comparison of Sub-projects vs. Milestones

  • Sub-projects have members, milestones do not.
  • Parent-projects' members are the union of all sub-projects' members. When adding the first sub-project to a parent, all existing members get moved to the subproject.
  • Tasks can only exist in a single milestone, but can exist in multiple sub-projects.
  • Milestones exist as columns within the parent project's workboard, sub-projects have their own workboard.

Sub-projects in detail

  • Projects can have sub-projects. A subproject behaves like a regular project, and moving a task between a project and sub-project is the same as moving a task between two unrelated projects, except:
    • Filtering by project matches all Sub-Project tasks.
    • Moving a task from a project to a sub-project does auto-remove the parent project.
  • It's very easy to navigate from viewing a sub-project to viewing a project, via the breadcrumb trail (one click, always in the same place, always present; and then a page reload).
  • It's possible, and maybe easier than searching, but not trivial, to navigate from projects to sub-projects. You have to click on sub-projects in the menu, wait for page reload, see the list of projects, identify the one you want, click on it, and wait for page reload.
  • Sub-projects often appear in the UI as Project > sub-project, but they appear in name completion as Sub-project, so if you name your sub-project "bugs", it will be really confusing in completion.
    • Hopefully we will get this fixed so that completion shows the parent project.
  • A task can belong to two different sub-projects within the same project.

Milestones are also regular projects, except:

  • They can be a child of a project or sub-project, but can't be a child of another milestone.
  • Milestones also appear as columns in their parent project, and so tasks in a project can be moved to milestones via drag and drop.
  • A task can't belong to both a project and to a milestone in that project; if it's in the milestone, adding the milestone's parent project to it removes the milestone (but, possible bug, in the UI it still appears in the Milestone's column).
  • Milestone names are not directly available in autocomplete. Instead, you see the parent (sub)project, followed by the Milestone name in parenthesis.
  • You can't assign a new task to a project and to a milestone in that project in one action; it takes several full steps.
  • There's some UI for auto-numbering milestones in sequence.

Story points is now built in to Phabricator

Previously this functionality was provided by a custom field and rPHSP phabricator-Sprint

  • All tasks will show a story point field by default
    • A custom form could be created to restrict this per project
  • All numeric story points have been transitioned to the new field, the old story points field is now disabled.

Other new features and bugs fixed

  • Auto-completion of usernames and projects in all markup fields & comments. (T876)
  • Non-members can watch projects(T77228)
  • The "Security" field on tasks is now deprecated. Use the "Report security issue" form instead of submitting a regular task with "security" set to "Software security bug."
  • It's now possible to make multiple changes to a task from the comment form instead of using the advanced edit form or submitting multiple times.
  • Marking a task as resolved no longer re-assigns it (T84833)

Thanks to everyone who helped out testing this release

This couldn't have happened without everyone's help <3

Specifically I'd like to thank: