Page MenuHomePhabricator

Restore IRC alerts for beta-scap-eqiad job
Closed, ResolvedPublic

Description

When the beta-scap-eqiad job fails it means that there's something wrong with beta cluster, scap, or the mainline branch of MediaWiki. If beta-scap-eqiad fails it usually means beta is out-of-date and that hurts our ability to test mainline prior to train.

This job used to alert in #wikimedia-releng but now it doesn't. The logs still say:

09:08:43 IRC notifier plugin: Sending notification to: #wikimedia-releng

We should make sure this works.

Event Timeline

hashar added a subscriber: hashar.

The bot connects via the IRC notification plugin which can be configured at https://integration.wikimedia.org/ci/configure . Its name is wmf-insecte (insecte is french for a bug). A whois on IRC shows it has been connected since December 7th and last active 40 minutes ago, so I guess it does send notifications.

The global config show it is configured to join #wikimedia-releng though it is not there. IIRC it only joins once it has to send a message in a channel, so I guess it has send nothing yet :-\

The job https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ shows no configuration history changes, maybe cause they got purged automatically after X days. The job is configured to emit an alert for failure and fixed builds.

There was a build failing a few hours ago, from the console log:

00:03:29.404 Build step 'Execute shell' marked build as failure
00:03:29.547 IRC notifier plugin: Sending notification to: #wikimedia-releng

Jenkins let us captures logging messages and I had a few setup specially for the IRC plugin a while ago at https://integration.wikimedia.org/ci/log/ :

[wmf-insecte, #wikimedia-analytics, Cannot send to nick/channel]
Dec 22, 2020 4:19:06 PM WARNING hudson.plugins.ircbot.v2.PircListener onServerResponse

Dec 22, 2020 8:00:03 PM WARNING hudson.plugins.ircbot.v2.PircListener onServerResponse
IRC server responded error 404 Message:
[wmf-insecte, #wikimedia-releng, Cannot send to nick/channel]

etc

Anything WARNING or above is also logged in a flat file on the server:

/var/log/jenkins/jenkins.log
...
Jan  4 17:17:52 contint2001 jenkins[26345]: WARNING: [hudson.plugins.ircbot.v2.PircListener onServerResponse] IRC server responded error 404 Message:
Jan  4 17:30:10 contint2001 jenkins[26345]: WARNING: [hudson.plugins.ircbot.v2.PircListener onServerResponse] IRC server responded error 404 Message:
Jan  4 17:30:11 contint2001 jenkins[26345]: WARNING: [hudson.plugins.ircbot.v2.PircListener onServerResponse] IRC server responded error 404 Message:
Jan  4 19:56:52 contint2001 jenkins[26345]: WARNING: [hudson.plugins.ircbot.v2.PircListener onServerResponse] IRC server responded error 404 Message:
Jan  4 19:56:52 contint2001 jenkins[26345]: WARNING: [hudson.plugins.ircbot.v2.PircListener onServerResponse] IRC server responded error 404 Message:

Which has been going on for a while. Beside Cannot send to nick/channel there is not much information. Maybe our bot got banned. Which brings me to whether it is authenticated?

An irc whois gives me:

[21:53:09] * [wmf-insecte] (~PircBotx@contint2001.wikimedia.org): PircBotX 2.0.1, a fork of PircBot, the Java IRC bo

So surely it is not authenticated and surely the plugin configuration shows:

Nicknamewmf-insecte
LoginPircBotx
Password<concealed>
NickServ Password<concealed>

The bot account on Freenode is mw-jenkinsbot:

/query nickserv info wmf-insecte
[21:52:26] <hashar> info wmf-insecte
[21:52:26] -NickServ- Information on wmf-insecte (account mw-jenkinsbot):
[21:52:26] -NickServ- Registered : Apr 18 20:55:45 2014 (6y 37w 3d ago)
[21:52:26] -NickServ- User reg.  : Sep 08 01:22:15 2011 (9y 17w 2d ago)
[21:52:26] -NickServ- Last addr  : ~PircBotx@contint2001.wikimedia.org
[21:52:26] -NickServ- Last seen  : Oct 21 18:37:05 2020 (10w 5d 2h ago)
[21:52:26] -NickServ- Flags      : HideMail
[21:52:26] -NickServ- *** End of Info ***

Almost 10 years old :]

The config file /var/lib/jenkins/hudson.plugins.ircbot.IrcPublisher.xml last got touched Oct 21 19:14 2020. Via the configuration history plugin we have an audit trail of configuration changes that have been done which can be browsed through https://integration.wikimedia.org/ci/jobConfigHistory/?filter=system and more specially https://integration.wikimedia.org/ci/jobConfigHistory/history?name=hudson.plugins.ircbot.IrcPublisher

Probably an unrelated config change triggered by the Jenkins plugin itself as the form was being saved.

Mentioned in SAL (#wikimedia-releng) [2021-01-04T21:08:31Z] <hasharAway> Change Jenkins IRC login to mw-jenkinsbot # T271122

I got the hashed password from the Jenkins plugin XML configuration file and it can then be retrieved via https://integration.wikimedia.org/ci/script:

println(new String(com.cloudbees.plugins.credentials.SecretBytes.fromString("{}").getPlainData(), "ASCII"))
                                    Add hashed password between curly braces ^^

I have set the password again in https://integration.wikimedia.org/ci/configure just to make sure it is entirely valid.

I must have tried too many authentication cause eventually the server yields:

437 * wmf-insecte :Nick/channel is temporarily unavailable

437 error is emitted by the server when there is too many changes of nickname / connections.

I have also enabled SASL ( https://freenode.net/kb/answer/sasl ), that lets one authenticate with the server outside of the IRC protocol / Nickserv system. It seems to be working by looking at the debug log at https://integration.wikimedia.org/ci/log/IRC%20IM%20plugin/ :

Jan 04, 2021 10:18:27 PM INFO org.pircbotx.InputParser handleLine
:hitchcock.freenode.net CAP * ACK :sasl 
Jan 04, 2021 10:18:27 PM INFO org.pircbotx.output.OutputRaw rawLineNow
AUTHENTICATE PLAIN
Jan 04, 2021 10:18:28 PM INFO org.pircbotx.InputParser handleLine
AUTHENTICATE +
Jan 04, 2021 10:18:28 PM INFO org.pircbotx.output.OutputRaw rawLineNow
AUTHENTICATE <XXXXXXXXXXXXXXXXXXXXXXXXX>
Jan 04, 2021 10:18:29 PM INFO org.pircbotx.InputParser handleLine
:hitchcock.freenode.net 900 * *!mw-jenkins@contint2001.wikimedia.org mw-jenkinsbot :You are now logged in as mw-jenkinsbot.
Jan 04, 2021 10:18:29 PM INFO org.pircbotx.output.OutputRaw rawLineNow
CAP END
Jan 04, 2021 10:18:29 PM INFO org.pircbotx.InputParser handleLine
:hitchcock.freenode.net 903 * :SASL authentication successful

But eventually fails due to a timeout:

Jan 04, 2021 10:18:56 PM INFO org.pircbotx.InputParser handleLine
ERROR :Closing Link: contint2001.wikimedia.org (Connection timed out)

Eventually I have disabled SASL again which bring us back to 437 * wmf-insecte :Nick/channel is temporarily unavailable. I have thus disabled the IRC notifications entirely for the time being via https://integration.wikimedia.org/ci/configure

Will try again tomorrow I guess.

Mentioned in SAL (#wikimedia-releng) [2021-01-04T22:30:47Z] <hasharAway> IRC notifications from Jenkins / wmf-insecte disabled for now due to T271122

I checked with the IRC ops and they said there were no recent bans that would've matched wmf-insecte!~PircBotx@contint2001.wikimedia.org to prevent the nick from speaking. They also suggested connecting directly to freenode from that server (with telnet or openssl, etc. ) to debug if it's a network issues vs something with Jenkins.

@Legoktm thank you for the check!

I think there might be an issue in how the plugin saves the password in its configuration file. When saving the configuration a second time the form contains the encrypted password which seems to be encrypted again and saved as is.

In /var/lib/jenkins/hudson.plugins.ircbot.IrcPublisher.xml a second save results in passwords values that are way longer.

At least I got it authenticated (with SASL) and it managed to emit an alarm:

[09:41:13] <mw-jenkinsbot> Project hashar-irc build #8: SUCCESS in 0.14 sec: https://integration.wikimedia.org/ci/job/hashar-irc/8/

But I guess next time the config is saved, we will be blocked again.

#wikimedia-releng is +r which is block unidentified - Prevents users who are not identified to services from joining the channel. so the bot couldn't join the channel.

    -NickServ- Information on wmf-insecte (account mw-jenkinsbot):
    -NickServ- Registered : Apr 18 20:55:45 2014 (6y 37w 4d ago)
    -NickServ- User reg.  : Sep 08 01:22:15 2011 (9y 17w 3d ago)
    -NickServ- Last addr  : ~mw-jenkin@contint2001.wikimedia.org
>>> -NickServ- Last seen  : Oct 21 18:37:05 2020 (10w 5d 14h ago)
    -NickServ- User seen  : now
    -NickServ- Flags      : HideMail
    -NickServ- *** End of Info ***

@Peachey88 indeed, cause the passwords end up being scrambled when saving the configuration a second time. The bot can not authenticate as a result. It is authenticated properly right now and uses the nick mw-jenkinsbot.

I guess the complete fix will be to upgrade the plugin but it requires a newer version of Jenkins which hasn't been released yet :-\

thcipriani changed the task status from Open to Stalled.Jan 5 2021, 6:29 PM

I guess the complete fix will be to upgrade the plugin but it requires a newer version of Jenkins which hasn't been released yet :-\

bah, well, marking as stalled pending new Jenkins release.

#wikimedia-releng is +r which is block unidentified - Prevents users who are not identified to services from joining the channel. so the bot couldn't join the channel.

Should we set -r for now then until the bot can properly auth?

08:35:29 <mw-jenkinsbot> Project beta-scap-eqiad build #334134: STILL FAILING in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/334134/
08:45:28 <mw-jenkinsbot> Project beta-scap-eqiad build #334135: STILL FAILING in 1 min 6 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/334135/
08:55:28 <mw-jenkinsbot> Project beta-scap-eqiad build #334136: STILL FAILING in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/334136/
09:05:31 <mw-jenkinsbot> Project beta-scap-eqiad build #334137: STILL FAILING in 1 min 7 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/334137/
09:15:33 <mw-jenkinsbot> Project beta-scap-eqiad build #334138: STILL FAILING in 1 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/334138/
09:25:38 <mw-jenkinsbot> Project beta-scap-eqiad build #334139: STILL FAILING in 1 min 13 sec: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/334139/

Looks fixed to me?

#wikimedia-releng is +r which is block unidentified - Prevents users who are not identified to services from joining the channel. so the bot couldn't join the channel.

Should we set -r for now then until the bot can properly auth?

Kind of, then we expose the channels to various random spammers. Another solution is to drop authentication for the bot but IIRC that exposes to rate limiting / breach Freenode term of services. Anyway the real issue is that bot password get with a double encryption by either Jenkins core and/or the plugin :-\

@Jdforrester-WMF the issue is still on our Jenkins instance. I am assuming it is fixed when using the last version of non LTS Jenkins and latest of the plugin. So that is pending the release of the Jenkins LTS.

Or maybe one can look at rolling back the plugin to a previous version. It is a bit time consuming though.

Mentioned in SAL (#wikimedia-operations) [2021-05-17T09:43:14Z] <hashar> Restarted CI Jenkins to update the instant-messaging and ircbot plugins # T271122

I have looked a bit about it during the week-end, but from my notes above I have not been able to retrieve the password :-\

I have took a copy of the file /var/lib/jenkins/hudson.plugins.ircbot.IrcPublisher.xml-2.33-T271122

I have upgraded the plugins:

hashar claimed this task.

I could not retrieve the password, but after upgrading the plugins it works just fine when using:

println( hudson.util.Secret.decrypt( "{....}") )

I am assuming it was some weird issue in the plugin since everything seems to work fine after upgrade. The account is logged in as mw-jenkinsbot and nickname wmf-insecte.

I have stored the password in the release engineering secret store.

And to confirm, the bot did notify about a build failure:

[11:20:11] <wmf-insecte> Project beta-update-databases-eqiad build #50435: STILL FAILING in 1 min 53 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/50435/

There is still a SEVERE log when parsing IRC v3 AWAY notification, that got already addressed in the upstream IRC software (pircbotx), the plugin has to be update. Filed as T283009