Page MenuHomePhabricator

Termbox SSR broken on Test Wikidata (since k8s migration? unclear)
Closed, ResolvedPublic

Description

Steps to reproduce:

Expected:
A server-side rendered termbox (non-interactive).

Actual:
No termbox.

There are some messages in Logstash that are probably related, for instance:

message: Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server
errormessage: Request failed with status 0. Usually this means network failure or timeout

Event Timeline

Probably worth mentioning that it’s also broken on Test Wikidata (example item), which is configured somewhat differently:

InitialiseSettings.php
'wmgWikibaseSSRTermboxServerUrl' => [
	'wikidatawiki' => 'http://localhost:6008/termbox',
	'testwikidatawiki' => 'http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox',
],

(Though localhost:6008 might just be a proxy for another termbox svc, not sure.)

(Later edit: at the time I wrote this comment, I thought it was also broken on www.wikidata.org. In fact only Test Wikidata is broken.)

Yes, localhost:6008 is pointing to termbox.discovery.wmnet:4004 in production.

The problem doesn't seem to be in termbox, as we could both fetch the data from the service without issues. So the issue doesn't seem to be related to the switch to use mediawiki on k8s as a backend for termbox.

Sorry, I’m an idiot and couldn’t read the page properly. The termbox SSR is actually working fine on Wikidata – all of this is server-rendered termbox:

image.png (616×1 px, 67 KB)

It’s only broken on Test Wikidata – this is what a truly missing termbox looks like:
image.png (817×1 px, 98 KB)

Lucas_Werkmeister_WMDE renamed this task from Termbox SSR broken since k8s migration to Termbox SSR broken on Test Wikidata (since k8s migration? unclear).Aug 24 2023, 11:32 AM
Lucas_Werkmeister_WMDE updated the task description. (Show Details)

Looks like Test Wikidata (which is mw-on-k8s) can’t talk to the Termbox SSR (@Joe says in IRC it’s missing an egress rule):

lucaswerkmeister-wmde@deploy1002 ~ $ sudo mw-debug-repl testwikidatawiki
Finding a mw-debug pod in eqiad...
Now running shell.php for testwikidatawiki inside pod/mw-debug.eqiad.pinkunicorn-8477b6d89d-8r4bc...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> $rf = mws()->getHttpRequestFactory()
= MediaWiki\Http\HttpRequestFactory {#3812}

> $req = $rf->create( 'http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en', [ 'method' => 'GET' ], 'Lucas Werkmeister (WMDE) manual testing' )
= GuzzleHttpRequest {#3961}

> $res = $req->execute()
= Status {#3986
    +cleanCallback: false,
    +value: & 0,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
  }

> $res->getErrors()
= [
    [
      "type" => "error",
      "message" => "http-curl-error",
      "params" => [
        "Failed to connect to termbox-test.staging.svc.eqiad.wmnet port 3031: Connection timed out",
      ],
    ],
    [
      "type" => "error",
      "message" => "http-bad-status",
      "params" => [
        "0",
        "Error",
      ],
    ],
  ]

Change 952191 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add egress rules for termbox-test

https://gerrit.wikimedia.org/r/952191

Change 952191 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add egress rules for termbox-test

https://gerrit.wikimedia.org/r/952191

Mentioned in SAL (#wikimedia-operations) [2023-08-24T12:16:38Z] <cgoubert@deploy1002> Started scap: Redeploying mw-on-k8s - T344904

Mentioned in SAL (#wikimedia-operations) [2023-08-24T12:18:46Z] <cgoubert@deploy1002> Finished scap: Redeploying mw-on-k8s - T344904 (duration: 02m 07s)

Change 952203 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-debug: Use global mw egress

https://gerrit.wikimedia.org/r/952203

Change 952203 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: Copy global mw egress

https://gerrit.wikimedia.org/r/952203

Looks like Test Wikidata (which is mw-on-k8s) can’t talk to the Termbox SSR (@Joe says in IRC it’s missing an egress rule):

lucaswerkmeister-wmde@deploy1002 ~ $ sudo mw-debug-repl testwikidatawiki
Finding a mw-debug pod in eqiad...
Now running shell.php for testwikidatawiki inside pod/mw-debug.eqiad.pinkunicorn-8477b6d89d-8r4bc...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> $rf = mws()->getHttpRequestFactory()
= MediaWiki\Http\HttpRequestFactory {#3812}

> $req = $rf->create( 'http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en', [ 'method' => 'GET' ], 'Lucas Werkmeister (WMDE) manual testing' )
= GuzzleHttpRequest {#3961}

> $res = $req->execute()
= Status {#3986
    +cleanCallback: false,
    +value: & 0,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
  }

> $res->getErrors()
= [
    [
      "type" => "error",
      "message" => "http-curl-error",
      "params" => [
        "Failed to connect to termbox-test.staging.svc.eqiad.wmnet port 3031: Connection timed out",
      ],
    ],
    [
      "type" => "error",
      "message" => "http-bad-status",
      "params" => [
        "0",
        "Error",
      ],
    ],
  ]

Now working for testwikidatawiki on all mw-on-k8s deployments including mw-debug

cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-debug$ sudo mw-debug-repl testwikidatawiki
Finding a mw-debug pod in eqiad...
Now running shell.php for testwikidatawiki inside pod/mw-debug.eqiad.pinkunicorn-85f6f8c9dd-4dvxm...
Psy Shell v0.11.10 (PHP 7.4.33 — cli) by Justin Hileman
> $rf = mws()->getHttpRequestFactory()
= MediaWiki\Http\HttpRequestFactory {#3812}

> $req = $rf->create( 'http://termbox-test.staging.svc.eqiad.wmnet:3031/termbox?entity=Q229877&revision=630197&language=en&editLink=%2Fwiki%2FSpecial%3ASetLabelDescriptionAliases%2FQ229877&preferredLanguages=en', [ 'method' => 'GET' ], 'claime WMF manual testing' )
= GuzzleHttpRequest {#6330}

> $res = $req->execute()
= Status {#6294
    +cleanCallback: false,
    +value: & 200,
    +success: & [],
    +successCount: & 0,
    +failCount: & 0,
  }

> $res->getErrors()
= []
Clement_Goubert claimed this task.

Resolving, feel free to reopen if there are still any issues.

Still not working, I’m afraid – https://test.m.wikidata.org/wiki/Q469 still doesn’t have an SSR termbox after purging, and there are new logstash messages identical (I think) to the ones from before.

there are new logstash messages identical (I think) to the ones from before.

(Although at least the host changed, from mw-web.eqiad.main-f96c4cfb-r5lkw to mw-web.eqiad.main-85b4fff6db-njm27, indicating that a new replicaset was rolled out.)

Apparently the network policy is pretty old and I don’t see 10.192.0.195 in it, is that correct?

lucaswerkmeister-wmde@deploy1002 ~ $ kube_env mw-web eqiad; kubectl get networkpolicy mediawiki-main
NAME             POD-SELECTOR                 AGE
mediawiki-main   app=mediawiki,release=main   169d
lucaswerkmeister-wmde@deploy1002 ~ $ kubectl describe networkpolicy mediawiki-main | grep -Fc 10.192.0.195
0

(It does contain some of the other IPs seen in _mediawiki-common_/global.yaml, such as 10.192.48.59, but of course it’s still possible I’m looking at the wrong thing entirely.)

Hm, in kube_env mw-debug eqiad, the sole network policy is also 169d old, but does contain 10.192.0.195. So maybe the age of the networkpolicy isn’t a deciding factor (I guess it can get patched / updated without having its age reset), but still, the problem could be that the IP address didn’t make it into the mw-web network policy.

Apparently @Joe fixed mw-web (it wasn’t deployed earlier), now it’s working \o/