Page MenuHomePhabricator

Write new CopyPatrol backend to replace Plagiabot
Closed, ResolvedPublic

Description

Plagiabot (which runs as User:EranBot) was written in Python 2 and is the bot responsible for feeding data to CopyPatrol. As of October 16, 2021, it appeared to go down because it was still using the legacy method of obtaining tokens in the MediaWiki API. A fix (rPWBCc0cf17) is available in pywikibot 6.6.1 but that only supports Python 3. Python 2 has passed it EOL anyway, and eventually will not be supported on Toolforge at all.

This task is to track writing a new CopyPatrol backend to replace Plagiabot.

  • The newly rewritten bot should live on the copypatrol tool rather than eranbot, a decision made at T306888#7893572.

Backend using Python3 and XML-RPC API:

  • The new code should have configurable database credentials for tools-db, and not rely on replica.my.cnf as the current code does (this is because we will still need to use eranbot's database credentials).
  • split recent changes checking from checking for IThenticate updates to increase stability
  • new database schema to accommodate the above and not store iThenticate findings in a blob that needs to be parsed with regex
  • database migration script
  • tests
    • unit tests
    • integration tests
    • CI

Updated backend using REST API:

  • new database schema
  • database migration script
  • tests
    • unit tests
    • integration tests
    • CI
planned schema summary
> describe diffs;
+------------------+------------------+------+-----+---------+----------------+
| Field            | Type             | Null | Key | Default | Extra          |
+------------------+------------------+------+-----+---------+----------------+
| diff_id          | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| project          | varbinary(20)    | NO   | MUL | NULL    |                |
| lang             | varbinary(20)    | NO   |     | NULL    |                |
| page_namespace   | int(11)          | NO   |     | NULL    |                |
| page_title       | varbinary(255)   | NO   |     | NULL    |                |
| rev_id           | int(10) unsigned | NO   |     | NULL    |                |
| rev_parent_id    | int(10) unsigned | NO   |     | NULL    |                |
| rev_timestamp    | binary(14)       | NO   |     | NULL    |                |
| rev_user_text    | varbinary(255)   | NO   |     | NULL    |                |
| submission_id    | varbinary(36)    | YES  | UNI | NULL    |                |
| status           | tinyint(4)       | NO   | MUL | NULL    |                |
| status_timestamp | binary(14)       | YES  |     | NULL    |                |
| status_user_text | varbinary(255)   | YES  |     | NULL    |                |
+------------------+------------------+------+-----+---------+----------------+

> describe report_sources;
+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| source_id     | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| submission_id | varbinary(36)    | NO   | MUL | NULL    |                |
| description   | blob             | NO   |     | NULL    |                |
| url           | blob             | YES  |     | NULL    |                |
| percent       | float unsigned   | NO   |     | NULL    |                |
+---------------+------------------+------+-----+---------+----------------+

Event Timeline

Restricted Application added subscribers: Cyberpower678, Aklapper. · View Herald Transcript

IABot ≠ Plagiabot. @Cyberpower678 you may wish to tweak H247.

Where is the repository for this script?

JJMC89 renamed this task from CopyPatrol: port Plagiabot to Python 3 to Write new CopyPatrol backend to replace Plagiabot.Jul 16 2022, 4:06 PM
JJMC89 changed the task status from Open to In Progress.
JJMC89 claimed this task.
JJMC89 updated the task description. (Show Details)
JJMC89 updated the task description. (Show Details)

@JJMC89 Wanted to let you know that Turnitin informed us there's a more modern REST version of the iThenticate API that supposedly is coming out soon-ish. This shouldn't change most of your code too much, but should make responses easier to work with. We meet with them in early August and I can get more details then. For now, I guess just keep that in mind, though I suspect the old XML-RPC based API will still continue to function anyways.

Thanks again for taking this on! As you know I don't have much Python expertise but I'm here if you need me for anything :)

@MusikAnimal Did you get any information about the REST API?

@MusikAnimal Did you get any information about the REST API?

I did just recently, actually, and have been meaning to reach out to you! Right now the new developer portal is provisioned. I have given them your GitHub username (which is what they use for username creation) as well as your email, so keep an eye out for an email with instructions. Thank you again for taking on this rewrite. You rock! If you have any questions about the API specifically, I'm sure the contact who emails you can answer them, but don't hesitate to ping me either. I haven't personally dabbled with the new API yet but what I did see seemed well-documented.

Thanks, @MusikAnimal. I have access to the developer portal now.

The API documented there (Turnitin Core API) doesn't appear to be for the same service as the XML-RPC iThenticate API that we are currently using.

Even if it is, I don't have the necessary credentials for it.

@JJMC89 Apologies for the long wait (I was away at a team off-site). I have been given administrative access to the new iThenticate sandbox. I created an account for you and you should be receiving an email soon. I see you were CC'd on the email I got, so I assume you have all the other info you need? In particular the guide for iThenticate v2 can be found at https://developers.turnitin.com/turnitin-core-api/information-for-ithenticate-integrators

Let me know if there are any issues, and as always thank you for your time!

JJMC89 changed the task status from Stalled to In Progress.Nov 4 2022, 5:40 AM
JJMC89 updated the task description. (Show Details)

@MusikAnimal, some questions before I finalize this for you to update the front end:

MariaDB [s51306__copyright_p]> select distinct status from copyright_diffs;
+---------------+
| status        |
+---------------+
| null          |
| NULL          |
| false         |
| fixed         |
| falsepositive |
|               |
| 0             |
+---------------+
7 rows in set (0.782 sec)
  1. In the migration, I am currently mapping fixed to 2 (page fixed), false and falsepositive to 1 (no action needed), and everything else to 0 (pending review). Given the age of entries ending up as pending review, I don't think that is completely correct. Please confirm how each value should be mapped.
  2. The planned database tables use a mix of binary and char. Would this be an issue for the front end?

I partially migrated the current s51306__copyright_p to s52615__copypatrol_migrate_test_01_p if you want to review it.

@MusikAnimal, some questions before I finalize this for you to update the front end:

Apologies again for the delay in my response!

  1. In the migration, I am currently mapping fixed to 2 (page fixed), false and falsepositive to 1 (no action needed), and everything else to 0 (pending review).

Fantastic. There's no reason for these to be stored as strings! Should this be a TINYINT instead of INT(11)?

Given the age of entries ending up as pending review, I don't think that is completely correct. Please confirm how each value should be mapped.

NULL are the ones waiting for review, false is no action needed, and fixed is page fixed. I've manually gone through and cleaned up all the other values. They all seemed quite old, so I don't think there's much risk of them being introduced. It doesn't hurt to have your normalization though, just in case.

The planned database tables use a mix of binary and char. Would this be an issue for the front end?

I don't think so, but I'm not sure! If it is, I'm sure we can fix it in the app accordingly, or change the schema again if we need to. Don't let it stop you from moving forward :) Thank you, as always!

Fantastic. There's no reason for these to be stored as strings! Should this be a TINYINT instead of INT(11)?

Yes

NULL are the ones waiting for review, false is no action needed, and fixed is page fixed. I've manually gone through and cleaned up all the other values. They all seemed quite old, so I don't think there's much risk of them being introduced. It doesn't hurt to have your normalization though, just in case.

Thanks

I don't think so, but I'm not sure! If it is, I'm sure we can fix it in the app accordingly, or change the schema again if we need to. Don't let it stop you from moving forward :) Thank you, as always!

I've updated the schema to make everything binary.

New test migration at s52615__copypatrol_migrate_test_02_p.

repo

Let's track updating the front end separately.