Page MenuHomePhabricator

Termbox Error Logging Should Differentiate between RemoteRenderer and Service timeouts.
Closed, ResolvedPublic2 Estimated Story Points

Description

While investigating T255410 we concluded that we should try to make it clearer to people investigating issues if the timeout was due to:
A) failure to get a response from the termbox Kubernetes service in time
OR IF
B) we actually got the response from termbox saying it was unable to contact Special:EntityData in time.

We intend the issue in situation A should remain an ERROR from the mediawiki TermboxRemoteRenderer but the B) should become only a NOTICE because an ERROR will already have been emitted by the Kubernetes services (A).

Conceptually the RemoteRenderer (Wikibase service calling the SSR - Kubernetes - service) does the correct thing and falls back to the clientside rendered version.

Notes:

  • Error logging is currently happening in both wikibase and termbox SSR service. At most one error due to a timeout should normally be logged (either from termbox or from wikibase but not both)

Questions/Details requested:

  • link to "failure to get a response from the termbox Kubernetes service in time" error in logstash

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-mediawiki-2020.11.20/mediawiki?id=AXXkYGPvKLIuIzERk2pj&_g=h@1251ff0

Acceptance Criteria

  • Issues with Termbox not being able to contact or get response in time from Special:EntityData (or Mediawiki API in general) are logged (by MediaWiki) using NOTICE error level

Event Timeline

During the team discussion it has been pointed out that there should be made possible to track what Termbox SSR - MediaWiki requests relates to what Mediawiki-Termbox SSR request.
Currently the requests, and their errors, could only be matched by having happened in the same time frame.
This change is out of scope of this task

Task inspection notes:

The RemoteRenderer in PHP is TermboxRemoteRenderer.php; it logs “case A” (timeout) in line 61 and “case B” (SSR service got error from MediaWiki) in line 63.

We also agreed that all kinds of errors in the SSR service should already be logged – the termbox SSR service can instruct the service runner to log an error, or the service runner will itself log an error if it can’t even start the service properly – but in any case, in PHP land we can assume that an error we get was already logged, and we can safely downgrade to “notice” all the time. (Edit: therefore we decided to set the story points to 2 after all, even though we had previously estimated this around 5 – we are now all convinced that changing the word “error” to “notice” is all that’s needed, except maybe updating a test case.)

During the team discussion it has been pointed out that there should be made possible to track what Termbox SSR - MediaWiki requests relates to what Mediawiki-Termbox SSR request.
Currently the requests, and their errors, could only be matched by having happened in the same time frame.
This change is out of scope of this task

T268640: Log termbox SSR errors with MediaWiki request ID

Change 643506 had a related patch set uploaded (by Tonina Zhelyazkova; owner: Tonina Zhelyazkova):
[mediawiki/extensions/Wikibase@master] TermboxRemoteRenderer: Change log level from error to notice

https://gerrit.wikimedia.org/r/643506

Change 643506 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] TermboxRemoteRenderer: Change log level from error to notice

https://gerrit.wikimedia.org/r/643506

I have failed to find any notices recorded in the logstash (errors before the change were also pretty rare, it seems), but no errors are reported either. I conclude this task is done. thanks!