Page MenuHomePhabricator

OTRS spam classification methods and systems
Open, NormalPublic

Description

Disclaimer: I am not yet familiar with the working of WMF mail ops, who's doing what. Feel free to share the info I may need to go on.

It is an ongoing problem that non-English OTRS languages that mail get higher scores than normal due to seemingly English based bayesian scoring. It is not a good direction IMNSHO that we try to hack this with OTRS filtering since it'd require the system to dissect the final score to its components or other horribilities. I do not see how much the incoming is common with other parts of WMF email but I suggest that OTRS should get a separate SA scoring, or at least a separate bayes db.

If that's the case (which I hope it is) then the bayesian db needs serious training with non-English non-spam email. I am obviously interested in Hungarian training but others mentioned that the problem is not unique to hu.

There is an OTRS Secret Agent™ only discussion embryo on https://otrs-wiki.wikimedia.org/wiki/Administrator_requests#OTRS_spam_filtering but it basically requires someone who's fiddling with the OTRS incoming MTA, the SA it uses and its config. (There were some smaller problems noticed in the scoring, too, or at least a few question risen.)

You may use my (limited) resources if you need it, I happen to administer mailservers for the last few decades along other things.

Event Timeline

grin created this task.Sep 29 2016, 8:26 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 29 2016, 8:26 AM
Restricted Application added subscribers: TerraCodes, Matthewrbowker, Rjd0060. · View Herald TranscriptSep 29 2016, 8:33 AM
pajz added a subscriber: pajz.EditedOct 6 2016, 10:59 AM

Now, I can't say anything definite given the relevant servers are operated by the WMF, so I suppose only they'd be able to provide perfectly up-to-date information, but let me just add that the key Spamassassin configuration choices were made when we upgraded to OTRS 3. Essentially, there is a Bayes DB for mail routed to our OTRS instance. It is trained daily with email from our queues (Junk for spam, ordinary queues for ham) using a nightly-running export script in OTRS. See also https://wikitech.wikimedia.org/wiki/OTRS. Hence it's not like non-English mail isn't considered; however, naturally, it's outweighed by the much greater amount of English-language mail. This means we're basically feeding the Bayes DB with a very small amount of, say, Hungarian email. You would indeed expect this to generate non-ideal scoring results when it comes to queues for smaller languages.

That said, I'm not sure there's a viable solution to this disparity. Off the top of my head, I would say that perhaps it's possible to exclude emails sent to certain smaller-language email addresses (say, info-hu@wikimedia.org) from Bayes scoring alltogether and, accordingly, tell the OTRS export script not to export such emails to Spamassassin. (The current work-around in place for some queues instead doesn't involve altering the Bayes scores but merely tells OTRS to create a given ticket with a given To address in a given queue, and in doing so to ignore all Spamassassin headers.) That way these emails could still benefit from scores based on other, non-Bayes Spamassassin rules.

grin added a comment.Oct 7 2016, 2:23 PM

Now, I can't say anything definite given the relevant servers are operated by the WMF, so I suppose only they'd be able to provide perfectly up-to-date information,

I definitely welcome WMF ops shining some light on the architecture and possibilities. (Thanks for the architecture link, I have checked it and it reassured me to contact the WMF people and get involved.)

Essentially, there is a Bayes DB for mail routed to our OTRS instance. It is trained daily with email from our queues (Junk for spam, ordinary queues for ham) using a nightly-running export script in OTRS.

Thanks, I'll spend some time to see whether all ham gets into the db as they should.

Hence it's not like non-English mail isn't considered; however, naturally, it's outweighed by the much greater amount of English-language mail. This means we're basically feeding the Bayes DB with a very small amount of, say, Hungarian email. You would indeed expect this to generate non-ideal scoring results when it comes to queues for smaller languages.

Indeed, or even larger ones as I've heard.

That said, I'm not sure there's a viable solution to this disparity. Off the top of my head, I would say that perhaps it's possible to exclude emails sent to certain smaller-language email addresses (say, info-hu@wikimedia.org) from Bayes scoring alltogether and, accordingly, tell the OTRS export script not to export such emails to Spamassassin. (The current work-around in place for some queues instead doesn't involve altering the Bayes scores but merely tells OTRS to create a given ticket with a given To address in a given queue, and in doing so to ignore all Spamassassin headers.) That way these emails could still benefit from scores based on other, non-Bayes Spamassassin rules.

I can conjure several approach.

For example the scoring decision is done in Exim, and it is easy to use separate SA instances for different patterns of localparts. This could be an ideal solution: separate bayes for separate languages. (The problem arises when doing late training, the email may be tossed between languages within OTRS, but this isn't impossible to solve, I just haven't thought about it now.)

I see no point to actually skip bayes when we already separate the SA instances.

The simplistic solution is to lower bayes scores which would lower its score importance.

I started to write about using BAYES_ tags in the header for filtering but it turned out even more hackish than I would find acceptable. :-)

elukey triaged this task as Normal priority.Oct 20 2016, 1:32 PM

It looks like the anti-spam mechanism is not working as expected. Almost all (90%) of the OTRS emails I currently get are spam emails, to the point that I don't feel like contributing any longer.

A lot of them even have negative spam scores due to https://wiki.apache.org/spamassassin/Rules/RCVD_IN_DNSWL_MED.

akosiaris moved this task from Incoming to Backlog on the OTRS board.Oct 11 2017, 11:31 AM