Plagiabot (which runs as User:EranBot) was written in Python 2 and is the bot responsible for feeding data to CopyPatrol. As of October 16, 2021, it appeared to go down because it was still using the legacy method of obtaining tokens in the MediaWiki API. A fix (rPWBCc0cf17) is available in pywikibot 6.6.1 but that only supports Python 3. Python 2 has passed it EOL anyway, and eventually will not be supported on Toolforge at all.
This task is to track writing a new CopyPatrol backend to replace Plagiabot.
- The newly rewritten bot should live on the copypatrol tool rather than eranbot, a decision made at T306888#7893572.
Backend using Python3 and XML-RPC API:
- The new code should have configurable database credentials for tools-db, and not rely on replica.my.cnf as the current code does
(this is because we will still need to use eranbot's database credentials). - split recent changes checking from checking for IThenticate updates to increase stability
- new database schema to accommodate the above and not store iThenticate findings in a blob that needs to be parsed with regex
- database migration script
- tests
- unit tests
- integration tests
-
CI
Updated backend using REST API:
- new database schema
- database migration script
- tests
- unit tests
- integration tests
- CI
> describe diffs; +------------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------------+------------------+------+-----+---------+----------------+ | diff_id | int(10) unsigned | NO | PRI | NULL | auto_increment | | project | varbinary(20) | NO | MUL | NULL | | | lang | varbinary(20) | NO | | NULL | | | page_namespace | int(11) | NO | | NULL | | | page_title | varbinary(255) | NO | | NULL | | | rev_id | int(10) unsigned | NO | | NULL | | | rev_parent_id | int(10) unsigned | NO | | NULL | | | rev_timestamp | binary(14) | NO | | NULL | | | rev_user_text | varbinary(255) | NO | | NULL | | | submission_id | varbinary(36) | YES | UNI | NULL | | | status | tinyint(4) | NO | MUL | NULL | | | status_timestamp | binary(14) | YES | | NULL | | | status_user_text | varbinary(255) | YES | | NULL | | +------------------+------------------+------+-----+---------+----------------+ > describe report_sources; +---------------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +---------------+------------------+------+-----+---------+----------------+ | source_id | int(10) unsigned | NO | PRI | NULL | auto_increment | | submission_id | varbinary(36) | NO | MUL | NULL | | | description | blob | NO | | NULL | | | url | blob | YES | | NULL | | | percent | float unsigned | NO | | NULL | | +---------------+------------------+------+-----+---------+----------------+