[[ https://github.com/valhallasw/plagiabot | Plagiabot ]] (which runs as [[ https://en.wikipedia.org/wiki/User:EranBot | User:EranBot ]]) was written in Python 2 and is the bot responsible for feeding data to [[ https://copypatrol.toolforge.org | CopyPatrol ]]. As of October 16, 2021, it appeared to go down because it was still using the [[ https://www.mediawiki.org/wiki/MediaWiki_1.37/Deprecation_of_legacy_API_token_parameters | legacy method of obtaining tokens ]] in the MediaWiki API. A fix (rPWBCc0cf17) is available in pywikibot 6.6.1 but that only supports Python 3. Python 2 has passed it EOL anyway, and eventually will not be supported on Toolforge at all.
This task is to track writing a new CopyPatrol backend to replace Plagiabot.
[ ] The newly rewritten bot should live on the `copypatrol` tool rather than `eranbot`, a decision made at T306888#7893572.
Backend using Python3 and XML-RPC API:
[x] The new code should have configurable database credentials for tools-db, and not rely on replica.my.cnf as the current code does ~~(this is because we will still need to use `eranbot`'s database credentials)~~.
[x] split recent changes checking from checking for IThenticate updates to increase stability
[x] new database schema to accommodate the above and not store iThenticate findings in a blob that needs to be parsed with regex
[x] database migration script
[ ] tests
[x] unit tests
[x] integration tests
[ ] ~~CI~~
Updated backend using REST API:
[ ] new database schema
[ ] database migration script
[ ] tests
[ ] unit tests
[ ] integration tests
[ ] CI
```lang=sql,name=planned schema summary for XML-RPC API
> describe diffs;
+------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+------------------+------+-----+---------+----------------+
| diff_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| project | varchar(20) | NO | MUL | NULL | |
| lang | varchar(20) | NO | | NULL | |
| page_namespace | int(11) | NO | | NULL | |
| page_title | varchar(255) | NO | | NULL | |
| rev_id | int(10) unsigned | NO | | NULL | |
| rev_timestamp | char(14) | NO | | NULL | |
| document_id | int(10) unsigned | YES | MUL | NULL | |
| report_id | int(10) unsigned | YES | MUL | NULL | |
| status | tinyint(4) | NO | | NULL | |
| status_timestamp | char(14) | NO | | NULL | |
| status_user | varchar(255) | YES | | NULL | |
+------------------+------------------+------+-----+---------+----------------+
> describe report_urls;
+------------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+-------+
| report_id | int(10) unsigned | NO | MUL | NULL | |
| url | text | NO | | NULL | |
| percent | int(10) unsigned | NO | | NULL | |
| word_count | int(10) unsigned | NO | | NULL | |
+------------+------------------+------+-----+---------+-------+
```