Page MenuHomePhabricator

Fix termbox ssr on beta
Closed, ResolvedPublic

Description

Termbox SSR was reported to be down on Mon Sep 30 2019 11:43:05 GMT+0200. Looking at the commit history it seems unlikely that it broke because of newly deployed code. It may have run out of disk space.

The service is deployed on a labs instance.
Docs: https://gerrit.wikimedia.org/r/plugins/gitiles/wikibase/termbox/+/refs/heads/master/infrastructure/README.md

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks @Ladsgroup for taking care of this. Is there a commit/patch somewhere that encodes the fix in our history?

Unfortunately not. It was just a manual restart of docker service, plus a restart of our own service

Sorry to be a buzzkill. It's down again since today 13:15. I have a feeling that the manual restart may have only temporarily fixed the issue and Tom's suspicion that it's run out of disk space was correct.

For the record, there's a bot that reports outages to the Termbox channel on mattermost.

thanks @Jakob_WMDE .. good to know about the bot reporting to Termbox channel.

It says, however, in the bug report "Termbox Beta SSR" .. and the link does link to ssr-termbox.wmflabs.org .. does it mean it is only wikidata beta that's broken as a result?

When I test production now, even after purging cache, things still work (though not sure 100% if SSR being down should break things)

Hey @alaa_wmde, I'm not sure if I'm understanding your question correctly. Yes, this is about beta only and is in no way related to production wikidata or the Termbox SSR instances used in production. This task has always been only about beta, right? ssr-termbox.wmflabs.org is the address of our continuously deployed beta ssr service that we maintain ourselves, i.e. it's separate from the production instances that are managed through kubernetes.

Aside from poking ssr-termbox.wmflabs.org directly, you'll see that visiting https://wikidata.beta.wmflabs.org/wiki/Q11?useformat=mobile with JavaScript disabled doesn't look quite right. That's a good indicator that the SSR service is broken.

Sorry to be a buzzkill. It's down again since today 13:15. I have a feeling that the manual restart may have only temporarily fixed the issue and Tom's suspicion that it's run out of disk space was correct.

For the record, there's a bot that reports outages to the Termbox channel on mattermost.

Looking at logs, I doesn't seem to be the same issue. Right now, termbox can't boot up because of this error:

Oct 14 20:15:06 wikidata-misc docker[2945]: systemd_termbox
Oct 14 20:15:06 wikidata-misc systemd[1]: termbox.service: Unit entered failed state.
Oct 14 20:15:06 wikidata-misc systemd[1]: termbox.service: Failed with result 'exit-code'.
Oct 14 20:30:00 wikidata-misc systemd[1]: Starting a systemd service for termbox server side rendering...
Oct 14 20:30:00 wikidata-misc updater.sh[3429]: Already have latest Termbox image
Oct 14 20:30:00 wikidata-misc systemd[1]: Started a systemd service for termbox server side rendering.
Oct 14 20:30:02 wikidata-misc docker[3442]: {"name":"wikibase-termbox","hostname":"0c06df2b588b","pid":1,"level":60,"err":{"message":"","name":"Error","stack":"Error: Cannot find module '@wmde/vuex-helpers/dist/
Oct 14 20:30:02 wikidata-misc systemd[1]: termbox.service: Main process exited, code=exited, status=1/FAILURE

(output of ladsgroup@wikidata-misc:~$ sudo journalctl -u termbox)