Page MenuHomePhabricator

Merge wikitext checkers
Closed, ResolvedPublic

Description

@Yaron_Koren suggested on Hangouts that the wikitext checkers be merged in so that the lack of one kind of wikitext can compensate for another.

Related Objects

Event Timeline

polybuildr raised the priority of this task from to Needs Triage.
polybuildr updated the task description. (Show Details)
polybuildr added a subscriber: polybuildr.
polybuildr set Security to None.
polybuildr updated the task description. (Show Details)
polybuildr added a subscriber: Yaron_Koren.
polybuildr added a subscriber: jan.

Change 221736 had a related patch set uploaded (by Polybuildr):
Merge different wikitext checkers into one

https://gerrit.wikimedia.org/r/221736

Concerns I have:

  1. Customizability has reduced. An admin can no longer choose to not run the checks on headings, or on templates, or to give a higher/lower weight to one of them. This can be implemented of course, but then should the checkers be merged in the first place? This should probably involve an investigation of what kinds of wikis exist. I currently know of only two. Wikipedia-like wikis and wiki databases. On a wikipedia-like wiki, pages would normally include headings, templates and internal links and absence would point to likelihood of spam. On a wiki database, a page may contain only a template call or two and in that case, it would be a bad idea to check for the others. There are probably other kinds of wikis I'm not aware of which would need to check different things. The earlier customization offered the option for a sysop to make that call, this change would remove it.
  1. The overall value returned by the checkers has reduced. This is because InternalLinksChecker, HeadingsChecker and TemplatesChecker returned high values for cases with zero instances of said wikitext. The average of this ended up being pretty high and the terms used (High, Very high etc.) represented that. This will require a re-evaluation of those terms and their respective values and probably a change in the submitted commit to correct the values.

1 is a major concern, 2 is just something that should be looked at before merging.

@Yaron_Koren, @jan, thoughts?

As you may know, I'm not that big a fan of customizability. In this case, Wikipedia-like sites can have simple template calls, and "wiki databases" can have WP-like articles - and I think the average spam page looks significantly different from both in some key ways.

True, the average spam page looks different from both. However, wouldn't you agree that having extra information about what kind of pages normally exist in a wiki makes it much easier to find a potential spam page?

That might be true; I haven't seen any evidence to support it.

I still think there's a very good reason to keep these separate (or at least come with another way to handle this issue), but since we're basically testing on a single wiki right now, let's optimize for that one. Once we're ready with a good, stable version of the extension, we could maybe test it on other wikis. So I'll wait for a +1 from @jan, and merge it after that.

@polybuildr Do you have a idea how to use the information what kind of wiki a site is to find the spam. I think the most spammer use similar spam page on every wiki because they normally do not think about the wiki propose.

About your change: The wikitext checker seems not to work for me. I have created a page with the content: "gr 77rg 498o7g843g79tp4" and one heading. Your code should report that page as spam because there is only one heading... When I add one link the page is reported but this should be the link checker not the wikitext... The problem is that I cannot find any bug in your code :-(

@jan, about the change: There are two reasons that's happening. One is https://gerrit.wikimedia.org/r/#/c/219624/ and the other is line 70 of SmiteSpamAnalyzer.php. Pages with no external links and small pages get ignored.

As for the other thing. I agree the spammers use similar spam independent of wiki. My argument is that if we know what kind of wiki it is, it is easier to find spam pages by looking for deviations from the average wiki page.

"$wgSmiteSpamIgnorePagesWithNoExternalLinks = false;" seems not to help me...

@jan, try making your pages longer than 500 characters in length? Or change [[ https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FSmiteSpam/9b0103651873b4e3ab2e7a775290873287a57f98/includes%2FSmiteSpamAnalyzer.php#L71 | line 71 of SmiteSpamAnalyzer.php ]]. Any pages that have no external links and have less than 500 characters are ignored.

@jan, added a config option in https://gerrit.wikimedia.org/r/#/c/222277/ and rebased on top on that. https://gerrit.wikimedia.org/r/#/c/221736/ should work for you now after setting $wgSmiteSpamIgnoreSmallPages = false.

Oh, I did not see that small pages are ignored. That was my fault... I have add my +1 now, too :-)

Change 221736 merged by jenkins-bot:
Merge different wikitext checkers into one

https://gerrit.wikimedia.org/r/221736