Page MenuHomePhabricator

Abstract CopyPatrol code to not assume everything is English Wikipedia
Closed, ResolvedPublic8 Estimated Story Points

Description

Right now CopyPatrol assumes it is always interacting with the English Wikipedia. We should abstract any code related to communication with the wiki database and all it to be configurable.

Event Timeline

kaldari set the point value for this task to 8.Sep 12 2016, 7:01 PM
kaldari moved this task from New & TBD Tickets to Up Next (June 3-21) on the Community-Tech board.

MusikBot has been updated to work on French Wikipedia: https://github.com/MusikAnimal/MusikBot/commit/44b6f7d43b136f6281a2aa7de794847f0e110d68

Things left relevant to MusikBot:

  • Create the new copyright_diffs table for frwiki. I currently have it set in the bot to be copyright_diffs_frwiki but it can be called anything
  • WikiProjects on frwiki are linked like Projet:Médecine (they have their own namespace). In our code we'll need to modify our code to strip out Projet: like we do with WikiProject_
  • MusikBot already has another task running on frwiki, but maybe with the advent of the CopyPatrol task we should get the bot flag to get around any potential API limitations. Right now the bot seems to be doing fine without the flag
  • Once everything is in place I'll add ruby copypatrol.rb --project fr.wikipedia to the cronjob script

MusikBot has been updated to work on French Wikipedia: https://github.com/MusikAnimal/MusikBot/commit/44b6f7d43b136f6281a2aa7de794847f0e110d68

Things left relevant to MusikBot:

  • Create the new copyright_diffs table for frwiki. I currently have it set in the bot to be copyright_diffs_frwiki but it can be called anything

We have a problem. Eranbot puts all records in copyright_diffs using the lang field to indicate which wiki the record belongs to. We have to also modify Community Tech Bot script to not mark such records as false positive which is what it is doing right now.

We have a problem. Eranbot puts all records in copyright_diffs using the lang field to indicate which wiki the record belongs to. We have to also modify Community Tech Bot script to not mark such records as false positive which is what it is doing right now.

Ah, I see. I'll update MusikBot accordingly. If we change EnwikiDao to only return records where lang = 'en' then the auto-review logic by Community Tech bot should be OK, right?

We have a problem. Eranbot puts all records in copyright_diffs using the lang field to indicate which wiki the record belongs to. We have to also modify Community Tech Bot script to not mark such records as false positive which is what it is doing right now.

Ah, I see. I'll update MusikBot accordingly. If we change EnwikiDao to only return records where lang = 'en' then the auto-review logic by Community Tech bot should be OK, right?

I haven't looked at the script in Community Tech bot that reviews records but as part of the abstraction step I changed EnwikiDao into a generic Dao.

Niharika added subscribers: Samwilson, Niharika.

The initial code I wrote up for this is at https://github.com/wikimedia/CopyPatrol/pull/31
@Samwilson - It'll be great if you want to take it on. I'm still having a really messed up install. :/

A few notes about WikiProjects, which you may already be aware of:

  • We'll want some config object to specify what prefix to use when searching for WikiProjects. On enwiki this is Wikipedia:WikiProject , and on French Wikipedia it has it's own namespace, so just Project:
  • WikiProjects are currently stored in the wikiprojects table with the prefix WikiProject_, which we remove. This won't interfere with frwiki since there is no prefix (the namespace is not included), but thinking ahead we should probably only store the name of the WikiProject itself. That would require removing WikiProject_ from all values in wikiprojects, and then updating MusikBot accordingly.
  • Similar to how you'd SELECT records by lang in the copyright_diffs table, you'd use wp_lang for the wikiprojects table.

It seems that French Wikipedia doesn't have a Draft NS. Should we just hide that filter option? (I think so.) For future expansion, is it worth checking for the existence of a draft namespace and showing or hiding the option accordingly? (I think probably not.)

https://fr.wikipedia.org/wiki/Aide:Brouillon suggests that draft articles should be created under one's user page as /Brouillon1, /Brouillon2, etc.

It seems that French Wikipedia doesn't have a Draft NS. Should we just hide that filter option? (I think so.) For future expansion, is it worth checking for the existence of a draft namespace and showing or hiding the option accordingly? (I think probably not.)

Maybe just add a config flag for it so we can hide/unhide it based on whether wiki has draft NS?

90℅ of drafts are Utilisateur:Name/Brouillon, so you can check if the page is in User NS and contain "brouillon" in title.

@Samwilson: I don't think it's going to be feasible to support custom draft set-ups for every wiki. My suggestion would be to have a config variable for each language project that is set to true if the wiki hosts drafts at namespace 118. If that's the case, show the filtering option, otherwise, suppress all draft related code.

The option to show only Drafts was in the interest of the Articles for Creation program on enwiki (see T139542). I'm not sure if frwiki has something similar, but if not, we shouldn't feel like we should go out of our way to add this functionality.

Perhaps you talk about this page. But yet I am not sure that allow differentiate articles and drafts is necessary.

Or is it worth generalising it to allow searching with any particular namespace? Replace the 'drafts' checkbox with a dropdown that lists all namespaces, I mean. We could cache that list (per language).

@Samwilson: Since most wikis don't even have a namespace for drafts, I'm not sure it would be that useful. Most wikis are only going to be worried about plagiarism in namespace 0, and a few will want to check drafts as well, but I don't imagine it would have much utility for other namespaces.

It now will only show the drafts checkbox if there are drafts to be searched. This avoids any language-specific code.

PR is at https://github.com/wikimedia/CopyPatrol/pull/32 (by the way, should I squash this all into one commit?)

should I squash this all into one commit?

Up to you. It does seem like a large number of commits, so squishing it into one might be cleaner.

Okay, all squashed and repushed as PR #34.

The multiwiki branch is now deployed to the staging site: https://tools.wmflabs.org/plagiabot/