Page MenuHomePhabricator

Follow up on possible server failure/intrusion
Closed, ResolvedPublic32 Estimated Story Points

Description

Per this incident report

  • Roll back wikimedia02 to pre-problem backup
  • Update wikimedia02 to 16.04 (at least)
  • Update wikimedia01 to 16.04 (at least)

Also consider updating wikimedia03 through wikimedia05 to 18.04

When done the incident report should be closed and T194157: Set up maintenance routine for servers and websites should be updated with the relevant info.

Event Timeline

Lokal_Profil renamed this task from Follow up on possible server failiure/introsion to Follow up on possible server failure/intrusion.Aug 21 2018, 3:53 PM
Lokal_Profil created this task.

Started the update (16.04) of wikimedia02. To avoid doing it over ssh (per recommendations) I did it through the Bahnhof interface (which requires Pale Moon on a Windows machine to run the java applet).

The update failed and the shitty interface does neither allow scrolling back in the history nor copy-pasting so the reason is hard to discover (something about uncomitted changes in `/etc/` flashed by.

The failure seems to have been recoverable though and is now back up on 14.04

dug up /var/log/dist-upgrade (over ssh)

and found

apt.log

2018-08-21 15:53:24,829 DEBUG failed to SystemUnLock() (E:Not locked) 
2018-08-21 15:53:29,825 ERROR not handled exception:
SystemError: E:Problem executing scripts DPkg::Pre-Invoke 'if [ -x /usr/bin/etckeeper ]; then etckeeper pre-install; fi', E:Sub-process >returned an error code

screenlog.0

** etckeeper detected uncommitted changes in /etc prior to apt run
** Aborting apt run. Manually commit and restart.

Error in function: 

SystemError: E:Problem executing scripts DPkg::Pre-Invoke 'if [ -x /usr/bin/etckeeper ]; then etckeeper pre-install; fi', E:Sub-process >returned an error code
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/problem_report.py", line 416, in add_to_existing
    self.write(f)
  File "/usr/lib/python3/dist-packages/problem_report.py", line 369, in write
    block = f.read(1048576)
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Original exception was:
SystemError: E:Problem executing scripts DPkg::Pre-Invoke 'if [ -x /usr/bin/etckeeper ]; then etckeeper pre-install; fi', E:Sub-process >returned an error code

That is it for trying to get this to work today

As this could be @Sebastian_Berlin-WMSE as well depending on availability and patience

To get the web console at Banhof to work:

  1. Install Pale Moon 32-bit on a Windows machine/partition
  2. Update to the latest Java
  3. Go to the java config settings and whitelist https://ecs.bahnhof.se in the Security tab

It turns out that the problem was AVOID_COMMIT_BEFORE_INSTALL being on in /etc/etckeeper/etckeeper.conf. Commenting that line solved the issue. Reference: https://serverfault.com/a/809335.

wikimedia01 and wikimedia02 are now upgraded to Ubuntu 18.04. Both also got their disk space increased to 12 GB.

WARNING: After getting everything up again, Drupal is apparently using PHP 7.2 and not 5.6. Some of the steps in this and the next comment may not actually have done anything.

The site wasn't accessible after the update. After quite some experimenting and digging around in Drupal forums, it's now up again. There are still problems logging in as admin. Here are the steps followed to get it working:

  1. Change PHP version to 5.6 (was 7.2 after the Ubuntu upgrade) as per these instructions.
  2. Install PHP-FPM
    1. sudo a2enconf php5.6-fpm
    2. In /etc/php/5.6/fpm/pool.d/www.conf change listen = /var/run/php5-fpm.sock to listen = 127.0.0.1:10000 (ref)
  3. Comment out the line ini_set('session.save_handler', 'user'); in settings.php
    1. Initially, there was an error concerning cache_get(). Redefining it in settings.php helped finding the underlying problem (ref).

Did the following to enable admin pages again:

  1. Uncomment and change in php.ini: opcache.enable=1
  2. Comment out lines concerning APC in settings.php
  3. Moved all unused wmse_* modules from sites/all/modules/wmse_modules/, they were loaded during bootstrap and caused errors
Lokal_Profil claimed this task.

I've updated the on-wiki incident report and am now resolving this one.