Page MenuHomePhabricator

Expand EranBot database to store WikiProjects and write cron job to update it regularly, also use this new data
Closed, ResolvedPublic3 Estimated Story Points

Description

Add a new column to the Eranbot database so that we can store WikiProjects in it. Write a cron job to update it regularly (just for the new articles since last run). Then use this new data for the WIkiProject filtering.

Event Timeline

DannyH triaged this task as Medium priority.Jun 30 2016, 5:17 PM
DannyH set the point value for this task to 3.
DannyH moved this task from Needs Discussion to Up Next on the Community-Tech board.

Per my email, this would be far easier for us to handle in CopyPatrol than modify Eranbot's code. Eranbot doesn't even use the table but scrapes the talk page templates which is far from ideal or efficient. We should check if the page has associated wikiprojects and if not, add them to the DB.
We should also have some sort of cron-job script which periodically updates the wikiprojects associated with pages (which have not yet been reviewed, most likely) since they might change frequently.

Per my email, this would be far easier for us to handle in CopyPatrol than modify Eranbot's code. Eranbot doesn't even use the table but scrapes the talk page templates which is far from ideal or efficient. We should check if the page has associated wikiprojects and if not, add them to the DB.
We should also have some sort of cron-job script which periodically updates the wikiprojects associated with pages (which have not yet been reviewed, most likely) since they might change frequently.

I thought about doing it in CopyPatrol too since we can easily update the copyright_diffs table, but the issue there, if I'm understanding correctly, is you'd need to load the CopyPatrol interface to store the WikiProjects. It'd also only do it one page at a time. So if no one used CopyPatrol over 4th of July, on the 5th when I go to look for WikiProject Medicine, all the medicine-related records from the 4th were not tagged and will not show up. We could do some sort of cron job, but it seems storing the data where the records are initially created would be ideal, and certainly the most performant.

WikiProjects do change, but I think this would only be a concern for us with new articles that haven't been assigned a WikiProject, and new articles suffering from copyright violations are often deleted entirely.

We definitely need to store the WikiProjects in the copyright_diffs table. I guess the question is just how and when to populate it. Having Eranbot populate it initially seems like a good idea since Eranbot is already retrieving this data. Thus there would be no extra cost. Niharika's point about the WikiProjects changing is a valid problem, but perhaps we should wait until we can use PageAssessments to handle that.

kaldari renamed this task from Expand EranBot to store WikiProjects and usernames in database and return info by API to Expand EranBot database to store WikiProjects and write cron job to update it regularly, also use this new data.Jul 19 2016, 5:25 PM
kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari updated the task description. (Show Details)
kaldari removed the point value for this task.
DannyH set the point value for this task to 3.Jul 19 2016, 5:28 PM

My quick Ruby script: https://github.com/MusikAnimal/MusikBot/blob/master/tasks/copypatrol_wikiprojects.rb

Tried it out and it works. Connect to the s51306__copyright_p database and try SELECT * FROM wikiprojects. Everything seems to match up with the WPX database so far, so we can start using it if we want, just not sure how often it goes through recent changes. Earwig said there's a new table being deployed today I believe that might have more complete info, and they're working on updating the script to go through recent changes more often. We can experiment with that, but either way we'll need this cron job to put the data where we can access it.

I'm a bit nervous about this depending on a bot, but considering that it should only be temporary, I think it's OK.