Page MenuHomePhabricator

Cannot vote on votewiki
Closed, ResolvedPublic

Description

Probably related to T209800: Cannot edit elections on votewiki SecurePoll. It just times out with a "Service Temporarily Unavailable" error.

UBN since enwiki ArbCom elections began at 00:00 UTC.

Note: voting was ok for a previous election that ended on 14 November 2018 - that election did not have encryption (GPG) enabled

Steps to reproduce

  1. Have an account that is eligible to vote in this election (we can add exemptions if needed for testing purposes, just tag @jrbs and/or @Jalexander)
  2. Navigate to https://en.wikipedia.org/wiki/Special:SecurePoll/vote/750 (a test election for the purposes of this task)
  3. Click "Go to the voting server" (this transfers your credentials to check against the eligibility parameters)
  4. Click "Submit vote"

Reproduce from eval.php --wiki votewiki

$context = new SecurePoll_Context;
$e = $context->getElection( 730 );
$c = $e->getCrypt();
$stat = $c->encrypt( 'test' );

Event Timeline

jrbs triaged this task as Unbreak Now! priority.Nov 19 2018, 12:06 AM
jrbs created this task.
Restricted Application added subscribers: Liuxinyu970226, TerraCodes, Aklapper. · View Herald TranscriptNov 19 2018, 12:06 AM
jrbs updated the task description. (Show Details)Nov 19 2018, 12:07 AM

More context at T209656

Huji added a comment.Nov 19 2018, 12:15 AM

Server logs?

jrbs added a subscriber: Bawolff.EditedNov 19 2018, 12:19 AM

Server logs?

I don't have them to hand but I think they're similar to those @Bawolff found T209656#4752391

(i.e. FastCGI problems. Working on it with James)

Not sure if this is enough but what I was seeing in logstash. I have a feeling there are other log issues that aren't appear in there (at least with my current search)

{
  "_index": "logstash-2018.11.19",
  "_type": "apache2",
  "_id": "AWcpURup0aq-rJxtfUAP",
  "_version": 1,
  "_score": null,
  "_source": {
    "severity": 3,
    "level": "ERROR",
    "channel": "proxy_fcgi",
    "pid": "3128",
    "program": "apache2",
    "thread": "140047785346816",
    "message": "AH01067: Failed to read FastCGI header",
    "type": "apache2",
    "priority": 187,
    "logsource": "mw1266",
    "clientport": "35651",
    "normalized_message": "AH01067: Failed to read FastCGI header",
    "tags": [
      "syslog",
      "es",
      "normalized_message_untrimmed"
    ],
    "referrer": "https://vote.wikimedia.org/wiki/Special:SecurePoll/vote/710",
    "@timestamp": "2018-11-19T00:14:00.000Z",
    "clientip": "10.64.32.69",
    "@version": "1",
    "host": "mw1266",
    "facility": 23,
    "severity_label": "Error",
    "timestamp": "Nov 19 00:14:00",
    "facility_label": "local7"
  },
  "fields": {
    "@timestamp": [
      1542586440000
    ]
  },
  "sort": [
    1542586440000
  ]
}
{
  "_index": "logstash-2018.11.19",
  "_type": "apache2",
  "_id": "AWcpURup0aq-rJxtfUAQ",
  "_version": 1,
  "_score": null,
  "_source": {
    "severity": 3,
    "level": "ERROR",
    "channel": "proxy_fcgi",
    "pid": "3128",
    "program": "apache2",
    "thread": "140047785346816",
    "message": "(104)Connection reset by peer: AH01075: Error dispatching request to : ",
    "type": "apache2",
    "priority": 187,
    "logsource": "mw1266",
    "clientport": "35651",
    "normalized_message": "(104)Connection reset by peer: AH01075: Error dispatching request to : ",
    "tags": [
      "syslog",
      "es",
      "normalized_message_untrimmed"
    ],
    "referrer": "https://vote.wikimedia.org/wiki/Special:SecurePoll/vote/710",
    "@timestamp": "2018-11-19T00:14:00.000Z",
    "clientip": "10.64.32.69",
    "@version": "1",
    "host": "mw1266",
    "facility": 23,
    "severity_label": "Error",
    "timestamp": "Nov 19 00:14:00",
    "facility_label": "local7"
  },
  "fields": {
    "@timestamp": [
      1542586440000
    ]
  },
  "sort": [
    1542586440000
  ]
}
Huji added a comment.Nov 19 2018, 1:53 AM

I had a feeling it is related to FastCGI as well (just like the other two mentioned above). Sadly, I have no knowledge of FastCGI troubleshooting. I will quietly monitor this thread. Positively surprised that this did not happen during T207560.

4nn1l2 added a subscriber: 4nn1l2.Nov 19 2018, 2:00 AM

I had a feeling it is related to FastCGI as well (just like the other two mentioned above). Sadly, I have no knowledge of FastCGI troubleshooting. I will quietly monitor this thread. Positively surprised that this did not happen during T207560.

Its probably not FastCGI itself, it is probably HHVM exploding and then failing to communicate properly to FastCGI (which causes fastcgi to log an error). Or at least that would be my guess

jrbs added a comment.Nov 19 2018, 2:34 AM

I'm currently trying to set up a test election for folks to reproduce. I screwed it up the first time around on enwiki so just trying to sort that out.

jrbs updated the task description. (Show Details)Nov 19 2018, 2:44 AM
jrbs updated the task description. (Show Details)

As an aside, telling x-wikimedia-debug to send me to a php7 seemed to make it work, so definitely seems hhvm related.

So looking in the logs, it seems like a log event is generated for importing the key into gpg, but there is no log event for actually encrypting the voting record (The next step after importing the key). This makes me wonder if its an issue with shelling out to gpg.

Further proof of this:

bawolff@mwmaint1002:/srv/mediawiki/php-1.33.0-wmf.4/maintenance$ mwscript eval.php votewiki 
> $context = new SecurePoll_Context;
> $e = $context->getElection( 730 );
> $c = $e->getCrypt();
> $stat = $c->encrypt( 'test' );
[process hangs]
Bawolff updated the task description. (Show Details)Nov 19 2018, 3:55 AM
revi added a subscriber: revi.Nov 19 2018, 4:01 AM
Bawolff updated the task description. (Show Details)Nov 19 2018, 4:44 AM
jrbs updated the task description. (Show Details)Nov 19 2018, 4:46 AM

As an aside, telling x-wikimedia-debug to send me to a php7 seemed to make it work, so definitely seems hhvm related.

So looking in the logs, it seems like a log event is generated for importing the key into gpg, but there is no log event for actually encrypting the voting record (The next step after importing the key). This makes me wonder if its an issue with shelling out to gpg.
Further proof of this:

bawolff@mwmaint1002:/srv/mediawiki/php-1.33.0-wmf.4/maintenance$ mwscript eval.php votewiki 
> $context = new SecurePoll_Context;
> $e = $context->getElection( 730 );
> $c = $e->getCrypt();
> $stat = $c->encrypt( 'test' );
[process hangs]

FTR forcing to PHP7 with debug also fixes the edit screen issue. Not sure if that reaches out to GPG (I don't think so? but wouldn't shock me) or if that just means the gpg hang issue is semi unrelated

Joe added a subscriber: Joe.Nov 19 2018, 8:56 AM

So looking in the logs, it seems like a log event is generated for importing the key into gpg, but there is no log event for actually encrypting the voting record (The next step after importing the key). This makes me wonder if its an issue with shelling out to gpg.
Further proof of this:

bawolff@mwmaint1002:/srv/mediawiki/php-1.33.0-wmf.4/maintenance$ mwscript eval.php votewiki 
> $context = new SecurePoll_Context;
> $e = $context->getElection( 730 );
> $c = $e->getCrypt();
> $stat = $c->encrypt( 'test' );
[process hangs]

I can confirm that if you force the use of PHP7 (PHP=/usr/bin/php7.0 when running this script, encryption works and the $stat variable shows the correct gpg encrypted message. So the issue seems to be HHVM specific. I'm off today so I guess someone else can look into this, probably strace(1) on the above script can give some hints on what goes on with HHVM.

jrbs added a comment.Nov 19 2018, 9:06 AM

Worth noting here - coordinators agreed to push voting back on the elections by 24 hours (i.e. 00:00 UTC on November 20).

fsero added a subscriber: fsero.Nov 19 2018, 10:45 AM

i took a quick look with strace and it seems is hanging launching this

/bin/bash /srv/mediawiki/php-1.33.0-wmf.4/includes/shell/limit.sh 'gpg' --homedir '/tmp/securepoll-AAAAA' --trust-model always --batch --yes  --import '/tmp/securepoll-AAAAAA/key' 2>&1 MW_INCLUDE_STDERR=;MW_CPU_LIMIT=50; MW_CGROUP='/sys/fs/cgroup/memory/mediawiki/job'; MW_MEM_LIMIT=1048576; MW_FILE_SIZE_LIMIT=524288; MW_WALL_CLOCK_LIMIT=180; MW_USE_LOG_PIPE=yes

still don't know why

Joe added a comment.Nov 19 2018, 10:46 AM

I just succesfully obtained an encrypted message by running the same script and disabling the light_process feature of HHVM.

$ PHP="/usr/bin/hhvm -d hhvm.server.light_process_count=0" mwscript eval.php votewiki 
> $context = new SecurePoll_Context;
> $e = $context->getElection( 730 );
>  $c = $e->getCrypt();
> $stat = $c->encrypt( 'test' );
>

I think this happens because we set

hhvm.server.light_process_file_prefix = /var/tmp/hhvm

which means processes launched from HHVM via the light process mechanism cannot reach /tmp/securepoll$id

Joe added a comment.Nov 19 2018, 10:54 AM

So, redefining $wgSecurePollTempDir doesn't do the trick:

$ mwscript eval.php votewiki 
> $wgSecurePollTempDir="/var/tmp/hhvm";

> $context = new SecurePoll_Context;

> $e = $context->getElection( 730 );

> $c = $e->getCrypt();

> $stat = $c->encrypt( 'test' );

> var_dump($stat->value);
NULL
> var_dump($stat->errors[0]['message']);
string(22) "securepoll-no-gpg-home"
Joe added a comment.Nov 19 2018, 11:22 AM

I tried to debug further what the problem is, given I'm not inclined to disable site-wide the use of lightprocesses (although now it should be less of an issue than it used to be).

Running the above script with strace I see

[pid 247353] access("/var/tmp/hhvm/securepoll-6e57aea6a109b306953f61da67b90a7e9b1979e8/private-keys-v1.d/<redacted>.key", R_OK) = -1 ENOENT (No such file or directory)
[pid 247353] write(8, "ERR 67108881 No secret key <GPG "..., 38) = 38

after which the process (probably gpg itself) dies, and the parent keeps waiting for 10 seconds indefinitely (which happens, according to limit.sh, because some child of the shell never returned).

Joe added a comment.Nov 19 2018, 12:22 PM

Some more information before I get away on a vacation day:

looking at strace of HHVM's execution that hangs, I see the gpg-agent process reading what follows:

read(7, "\1\0\0\0\0\2\0\0\0\0\0\0 \0\0\0gnupg_spawn_agen"..., 272) = 48
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0000\0\0\0.#lk0x000055d38b"..., 272) = 64
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0000\0\0\0.#lk0x000055d38b"..., 272) = 64
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=251764, si_uid=33} ---
rt_sigreturn({mask=[HUP INT USR1 USR2 TERM CONT]}) = -1 EINTR (Interrupted system call)
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = ? ERESTARTNOHAND (To be restarted if no handler)

and that's it, while when I disable light processes I get

pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0 \0\0\0gnupg_spawn_agen"..., 272) = 48
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0000\0\0\0.#lk0x0000558ce8"..., 272) = 64
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0000\0\0\0.#lk0x0000558ce8"..., 272) = 64
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=252495, si_uid=33} ---
rt_sigreturn({mask=[HUP INT USR1 USR2 TERM CONT]}) = -1 EINTR (Interrupted system call)
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0000\0\0\0.#lk0x000055d5bb"..., 272) = 64
pselect6(8, [3 4 5 6 7], NULL, NULL, NULL, {[], 8}) = 1 (in [7])
read(7, "\1\0\0\0\0\2\0\0\0\0\0\0 \0\0\0pubring.kbx.lock"..., 272) = 48
...

and many more lines, showing that gpg-agent just stops being called at some point by gpg itself that must fail somehow. I wasn't able to find that, but the straces for the two runs are avaiable at mwmaint:/root/T209802/strace.{lightprocess,nolightprocess} if someone else wants to take a further look.

For now, the only quick fix I found is to set in the hhvm settings light_process_count to 0; that shouldn't be as bad as it used to be (we don't shellout to tidy anymore), but we don't really know what performance hit that could cause anyways. If a decision is made to go down that route, I advise to first try it on the canaries, and then on the whole cluster after some evaluation.

Installing the package gnupg1 and using

$wgSecurePollGPGCommand = 'gpg1';

causes this to be fixed, I have tested it on mwmaint1002. Apparently lightprocess hangs waiting for gpg-agent to exit, which gpg2 autostarts whenever a private key operation is done.

Change 474679 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] In SecurePoll use gpg1 to avoid gpg-agent autostart

https://gerrit.wikimedia.org/r/474679

Change 474681 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Install gpg 1 on app servers for SecurePoll extension

https://gerrit.wikimedia.org/r/474681

Change 474679 merged by jenkins-bot:
[operations/mediawiki-config@master] In SecurePoll use gpg1 to avoid gpg-agent autostart

https://gerrit.wikimedia.org/r/474679

Mentioned in SAL (#wikimedia-operations) [2018-11-19T13:03:09Z] <pmiazga@deploy1001> Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:474679]|In SecurePoll use gpg1 to avoid gpg-agent autostart (T209802)]] (duration: 00m 48s)

Change 474681 merged by Muehlenhoff:
[operations/puppet@production] Install gpg 1 on app servers for SecurePoll extension

https://gerrit.wikimedia.org/r/474681

MoritzMuehlenhoff lowered the priority of this task from Unbreak Now! to Normal.Nov 19 2018, 1:16 PM

Installing the package gnupg1 and using

$wgSecurePollGPGCommand = 'gpg1';

causes this to be fixed, I have tested it on mwmaint1002. Apparently lightprocess hangs waiting for gpg-agent to exit, which gpg2 autostarts whenever a private key operation is done.

Piotr deployed Tim's patch to switch $wgSegurePollPGPCommand as part of European Mid-day SWAT and gpg1 is now installed on all app servers via Puppet. This unbreaks voting, so I'm lowering the priority to "Normal". Not sure if we should explicity track gnupg2 support with a custom task, as this is specific to HHVM, which we're moving away from? We can also simply close the bug for good.

I just submitted a vote to poll 750 successfully.

jrbs added a comment.Nov 19 2018, 7:35 PM

Installing the package gnupg1 and using

$wgSecurePollGPGCommand = 'gpg1';

causes this to be fixed, I have tested it on mwmaint1002. Apparently lightprocess hangs waiting for gpg-agent to exit, which gpg2 autostarts whenever a private key operation is done.

Piotr deployed Tim's patch to switch $wgSegurePollPGPCommand as part of European Mid-day SWAT and gpg1 is now installed on all app servers via Puppet. This unbreaks voting, so I'm lowering the priority to "Normal". Not sure if we should explicity track gnupg2 support with a custom task, as this is specific to HHVM, which we're moving away from? We can also simply close the bug for good.

Thanks so much, everyone - greatly appreciated on a tight schedule! I'm happy to call this resolved and to direct discussion around gnupg2 support to a new task just to keep things clean?

One thing that was confusing me was why timeout in limit.sh wasnt killing the process eventually. But after reading docs i guess that is because it didnt have a -k flag

jijiki added a subscriber: jijiki.Nov 19 2018, 10:03 PM
tstarling closed this task as Resolved.Nov 19 2018, 10:31 PM
tstarling claimed this task.

One thing that was confusing me was why timeout in limit.sh wasnt killing the process eventually. But after reading docs i guess that is because it didnt have a -k flag

The original gpg process, timeout and the sh -c run by HHVM all exit promptly, within a second. But HHVM does not properly wait on the sh -c process, leaving it as a zombie. I didn't trace the HHVM parent, so this is not confirmed, but presumably it is polling the limit.sh monitor process, waiting for it to exit. The monitor process is in an infinite sleep loop waiting for gpg-agent to exit so it can clean up the cgroup. So there is a bug in the interaction between limit.sh and HHVM LightProcess -- the monitor process is supposed to run in the background while PHP execution continues, but instead HHVM waits for it.

The workaround previously proposed, disabling LightProcess, would have left gpg-agent and the limit.sh monitor running forever in the background while HHVM continues normally. Not an ideal situation since presumably every SecurePoll encryption request would start its own gpg-agent.