XMLRCs is not functioning
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MacFan4000
	Nov 10 2021, 4:10 PM

Description

XMLRCs is a service within the Huggle cloud-vps project that is where wm-bot gets RecentChanges feeds. It seems to have gone down this morning as wm-bot is failing to connect to it which is rendering the RecentChanges module useless. Somebody with access to Huggle needs to restart XMLRCs.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Petrb	T306076 Cloud VPS "huggle" project Stretch deprecation
		Resolved		bd808	T295487 XMLRCs is not functioning

Event Timeline

MacFan4000 created this task.Nov 10 2021, 4:10 PM

Restricted Application added a project: User-MacFan4000. · View Herald TranscriptNov 10 2021, 4:10 PM

@Addshore and @Petrb are the admins for the huggle Cloud VPS project.

Perryprog subscribed.Nov 11 2021, 12:32 AM

RhinosF1 subscribed.Nov 12 2021, 4:18 PM

Mentioned in SAL (#wikimedia-cloud) [2021-11-13T01:54:16Z] <bd808> sudo su - xmlrcs; ./xmlrcsd -d after seeing no running xmlrcsd (T295487)

In T295487#7501291, @Stashbot wrote:

Mentioned in SAL (#wikimedia-cloud) [2021-11-13T01:54:16Z] <bd808> sudo su - xmlrcs; ./xmlrcsd -d after seeing no running xmlrcsd (T295487)

I found some sketchy docs at https://wikitech.wikimedia.org/wiki/XmlRcs#Maintainer_info which led me to try that command.

/opt/xmlrcs/nohup.out

Traceback (most recent call last):
  File "./es2r.py", line 16, in <module>
    rs.set("es2r.pid", int(os.getpid()))
  File "/usr/lib/python2.7/dist-packages/redis/client.py", line 1072, in set
    return self.execute_command('SET', *pieces)
  File "/usr/lib/python2.7/dist-packages/redis/client.py", line 573, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/usr/lib/python2.7/dist-packages/redis/client.py", line 585, in parse_response
    response = connection.read_response()
  File "/usr/lib/python2.7/dist-packages/redis/connection.py", line 582, in read_response
    raise response
redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

/var/log/redis/redis-server.log

11418:M 13 Nov 03:02:37.048 * 1 changes in 900 seconds. Saving...
11418:M 13 Nov 03:02:37.048 # Can't save in background: fork: Cannot allocate memory

Mentioned in SAL (#wikimedia-cloud) [2021-11-13T03:08:12Z] <bd808> Rebooting xmlrcs.huggle.eqiad1.wikimedia.cloud (T295487)

From @bd808: "[03:07:21] bd808: I think what happened is that the redis queue filled up because the consuming script was down. I'm going to reboot the instance and then check back to see what has to be manually started."

After this, everything depending on xmlrcs was back and running again, so this looks to be resolved.

MacFan4000 reassigned this task from Perryprog to bd808.Nov 13 2021, 3:18 AM

Restricted Application added a project: User-bd808. · View Herald TranscriptNov 13 2021, 3:18 AM

The services DO NOT start on boot, so after rebooting someone needs to do something like:

$ ssh xmlrcs.huggle.eqiad1.wikimedia.cloud
$ sudo su - xmlrcs
$ cd /opt/xmlrcs
$ ./xmlrcsd -d
$ nohup ./start &

It appears to be down again. @Petrb or @bd808, are one of you able to re-poke it awake again?

Fixed by @Petrb.

And we're back!

With the shutdown of the stretch images since here, it seems like xmlrcs has been down. From some poking around by bd808, it looks like there's xmlrcs2.huggle.eqiad1.wikimedia.cloud, but it's not yet ready for deployment. Re-opening since more work will likely be needed than just restarting a nohup script.

Mentioned in SAL (#wikimedia-cloud) [2022-07-22T22:55:43Z] <bd808> Joined project to help with T295487

Mentioned in SAL (#wikimedia-cloud) [2022-07-22T22:58:21Z] <bd808> Started stretch VM xmlrcs.huggle.eqiad1.wikimedia.cloud to restore service to Huggle end-users (T295487)

In T295487#7501357, @bd808 wrote:
The services DO NOT start on boot, so after rebooting someone needs to do something like:
$ ssh xmlrcs.huggle.eqiad1.wikimedia.cloud
$ sudo su - xmlrcs
$ cd /opt/xmlrcs
$ ./xmlrcsd -d
$ nohup ./start &

I did these same things again after bringing the shutdown instance back online. @Perryprog reports on IRC that the service is working again.

Closing this again, but will post some notes on T306076: Cloud VPS "huggle" project Stretch deprecation for follow up.

bd808 mentioned this in T306076: Cloud VPS "huggle" project Stretch deprecation.Jul 22 2022, 11:15 PM

TheresNoTime mentioned this in T321331: Grant TheresNoTime membership in the Huggle Cloud VPS project.Oct 20 2022, 4:58 PM

Urbanecm mentioned this in T326050: wm-bot’s recentchanges module does not work.Jan 1 2023, 4:13 AM

XMLRCS has been down since at least 8:53PM EST on Tuesday, February 7.

I would like to offer to be an additional point of contact for this service, as I think having an additional person trained on fixing this system when it crashes would be useful.

In T295487#8606675, @Phuzion wrote:

XMLRCS has been down since at least 8:53PM EST on Tuesday, February 7.

I used the information at https://wikitech.wikimedia.org/wiki/XmlRcs#Maintainer_info to restart the processes on the xmlrcs2.huggle.eqiad1.wikimedia.cloud instance. Connecting with telnet rc.huggle.wmcloud.org 8822 and watching enwiki changes is now working for me.

I would like to offer to be an additional point of contact for this service, as I think having an additional person trained on fixing this system when it crashes would be useful.

I would suggest that you create a task similar to T321331: Grant TheresNoTime membership in the Huggle Cloud VPS project and then do you best to get the attention of @Petrb or @Addshore who are the current project maintainers. I wish you luck in the wars to come. ;)

XMLRCS is down again, Huggle continues to switch feed to IRC.

XMLRCS is back up now and running!

In T295487#8626148, @Yoshi24517 wrote:

XMLRCS is back up now and running!

I restarted things after being pinged on IRC.

[18:53]  <  phuzion> Hey bd808, could you re-kick xmlrcs for us please?
[18:54]  <    bd808> *sigh* yeah
[18:54]  <  phuzion> Thanks, sorry for the ping.
[18:58]  <    bd808> phuzion: it seems to be more broken than down. I'm trying to figure out what the problem is...
[18:58]  <  phuzion> Alright thanks for the update.
[18:59]  <    bd808> I think it's working again now. I was just not patient enough for it to attach to the real data feed.
[19:02]  <  phuzion> bd808: Yep, it seems to be working. Is there any chance we can get some sort of monitoring on this? Is there a wmfcloud nagios instance or something that we can get that added to?
[19:03]  <    bd808> phuzion: since there are 0 active maintainers I don't know what use active monitoring would be
[19:03]  <    bd808> I'm not going to sign up to get paged for petan's broken app
[19:04]  <  phuzion> I've been trying to get access to the instance for a bit, at least so I could restart xmlrcs when it crashes.
[19:04]  <    bd808> The main problem is that the app is written badly and requires human intervention for minor issues like server reboots and dns failures
[19:05]  <  phuzion> Yeah.

Mentioned in SAL (#wikimedia-cloud) [2023-02-17T19:08:13Z] <bd808> Restarted xmlrcs per IRC ping (T295487)

XMLRCS died again, redis empty for at least 10 seconds.

fixed - someone started that nohup job while old one was still running, I cleared all running ./es2r.py and it fixed itself.

Keep in mind that ./es2r.py is terrible quality python script that keeps hanging up randomly for unknown reasons - it's based on reference script provided by original authors of event stream. Fixing this python script so that it's reliable would lead to permanent fix of this problem.

Petrb closed this task as Resolved.Feb 22 2023, 8:23 PM

XMLRCs is not functioningClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

XMLRCs is not functioning
Closed, ResolvedPublic
Actions

Related Objects
Search...