Page MenuHomePhabricator

XMLRCs is not functioning
Closed, ResolvedPublic

Description

XMLRCs is a service within the Huggle cloud-vps project that is where wm-bot gets RecentChanges feeds. It seems to have gone down this morning as wm-bot is failing to connect to it which is rendering the RecentChanges module useless. Somebody with access to Huggle needs to restart XMLRCs.

Event Timeline

@Addshore and @Petrb are the admins for the huggle Cloud VPS project.

Mentioned in SAL (#wikimedia-cloud) [2021-11-13T01:54:16Z] <bd808> sudo su - xmlrcs; ./xmlrcsd -d after seeing no running xmlrcsd (T295487)

Mentioned in SAL (#wikimedia-cloud) [2021-11-13T01:54:16Z] <bd808> sudo su - xmlrcs; ./xmlrcsd -d after seeing no running xmlrcsd (T295487)

I found some sketchy docs at https://wikitech.wikimedia.org/wiki/XmlRcs#Maintainer_info which led me to try that command.

/opt/xmlrcs/nohup.out
Traceback (most recent call last):
  File "./es2r.py", line 16, in <module>
    rs.set("es2r.pid", int(os.getpid()))
  File "/usr/lib/python2.7/dist-packages/redis/client.py", line 1072, in set
    return self.execute_command('SET', *pieces)
  File "/usr/lib/python2.7/dist-packages/redis/client.py", line 573, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/usr/lib/python2.7/dist-packages/redis/client.py", line 585, in parse_response
    response = connection.read_response()
  File "/usr/lib/python2.7/dist-packages/redis/connection.py", line 582, in read_response
    raise response
redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
/var/log/redis/redis-server.log
11418:M 13 Nov 03:02:37.048 * 1 changes in 900 seconds. Saving...
11418:M 13 Nov 03:02:37.048 # Can't save in background: fork: Cannot allocate memory

Mentioned in SAL (#wikimedia-cloud) [2021-11-13T03:08:12Z] <bd808> Rebooting xmlrcs.huggle.eqiad1.wikimedia.cloud (T295487)

Perryprog claimed this task.

From @bd808: "[03:07:21] bd808: I think what happened is that the redis queue filled up because the consuming script was down. I'm going to reboot the instance and then check back to see what has to be manually started."

After this, everything depending on xmlrcs was back and running again, so this looks to be resolved.

The services DO NOT start on boot, so after rebooting someone needs to do something like:

$ ssh xmlrcs.huggle.eqiad1.wikimedia.cloud
$ sudo su - xmlrcs
$ cd /opt/xmlrcs
$ ./xmlrcsd -d
$ nohup ./start &

It appears to be down again. @Petrb or @bd808, are one of you able to re-poke it awake again?

And we're back!

With the shutdown of the stretch images since here, it seems like xmlrcs has been down. From some poking around by bd808, it looks like there's xmlrcs2.huggle.eqiad1.wikimedia.cloud, but it's not yet ready for deployment. Re-opening since more work will likely be needed than just restarting a nohup script.

Mentioned in SAL (#wikimedia-cloud) [2022-07-22T22:55:43Z] <bd808> Joined project to help with T295487

Mentioned in SAL (#wikimedia-cloud) [2022-07-22T22:58:21Z] <bd808> Started stretch VM xmlrcs.huggle.eqiad1.wikimedia.cloud to restore service to Huggle end-users (T295487)

The services DO NOT start on boot, so after rebooting someone needs to do something like:

$ ssh xmlrcs.huggle.eqiad1.wikimedia.cloud
$ sudo su - xmlrcs
$ cd /opt/xmlrcs
$ ./xmlrcsd -d
$ nohup ./start &

I did these same things again after bringing the shutdown instance back online. @Perryprog reports on IRC that the service is working again.

Phuzion subscribed.

XMLRCS has been down since at least 8:53PM EST on Tuesday, February 7.

I would like to offer to be an additional point of contact for this service, as I think having an additional person trained on fixing this system when it crashes would be useful.

XMLRCS has been down since at least 8:53PM EST on Tuesday, February 7.

I used the information at https://wikitech.wikimedia.org/wiki/XmlRcs#Maintainer_info to restart the processes on the xmlrcs2.huggle.eqiad1.wikimedia.cloud instance. Connecting with telnet rc.huggle.wmcloud.org 8822 and watching enwiki changes is now working for me.

I would like to offer to be an additional point of contact for this service, as I think having an additional person trained on fixing this system when it crashes would be useful.

I would suggest that you create a task similar to T321331: Grant TheresNoTime membership in the Huggle Cloud VPS project and then do you best to get the attention of @Petrb or @Addshore who are the current project maintainers. I wish you luck in the wars to come. ;)

Yoshi24517 subscribed.

XMLRCS is down again, Huggle continues to switch feed to IRC.

XMLRCS is back up now and running!

XMLRCS is back up now and running!

I restarted things after being pinged on IRC.

[18:53]  <  phuzion> Hey bd808, could you re-kick xmlrcs for us please?
[18:54]  <    bd808> *sigh* yeah
[18:54]  <  phuzion> Thanks, sorry for the ping.
[18:58]  <    bd808> phuzion: it seems to be more broken than down. I'm trying to figure out what the problem is...
[18:58]  <  phuzion> Alright thanks for the update.
[18:59]  <    bd808> I think it's working again now. I was just not patient enough for it to attach to the real data feed.
[19:02]  <  phuzion> bd808: Yep, it seems to be working. Is there any chance we can get some sort of monitoring on this? Is there a wmfcloud nagios instance or something that we can get that added to?
[19:03]  <    bd808> phuzion: since there are 0 active maintainers I don't know what use active monitoring would be
[19:03]  <    bd808> I'm not going to sign up to get paged for petan's broken app
[19:04]  <  phuzion> I've been trying to get access to the instance for a bit, at least so I could restart xmlrcs when it crashes.
[19:04]  <    bd808> The main problem is that the app is written badly and requires human intervention for minor issues like server reboots and dns failures
[19:05]  <  phuzion> Yeah.

Mentioned in SAL (#wikimedia-cloud) [2023-02-17T19:08:13Z] <bd808> Restarted xmlrcs per IRC ping (T295487)

XMLRCS died again, redis empty for at least 10 seconds.

fixed - someone started that nohup job while old one was still running, I cleared all running ./es2r.py and it fixed itself.

Keep in mind that ./es2r.py is terrible quality python script that keeps hanging up randomly for unknown reasons - it's based on reference script provided by original authors of event stream. Fixing this python script so that it's reliable would lead to permanent fix of this problem.