Page MenuHomePhabricator

Add interface to filter by WikiProject to CopyPatrol
Closed, ResolvedPublic8 Estimated Story Points

Description

In the filters box at the top of the interface -- see T132352

There's an input field that allows the user to filter by WikiProject.

  • The default state (empty WikiProject filters) is to show all open cases.
  • The user can start typing in the input field; this opens an autocomplete list of existing WikiProjects.
  • Choosing a WikiProject adds a bubble with the project name to the filters.
  • Multiple WikiProjects can be in the filter at the same time; this loads cases that are either in one WikiProject or the other, or both.
  • Clicking on an X for a WikiProject bubble deletes it from the list.
  • Clicking Submit refreshes the list showing only the items that belong to the chosen WikiProjects.

Additionally: The WikiProject bubbles in the item listings are clickable. When the user clicks on a WP bubble, it opens a new tab, where the filter is set to All open cases, with the selected WikiProject active.

ddd.jpg (348×1 px, 49 KB)

copy patrol filters 2 - wikiprojects selected.jpg (194×1 px, 32 KB)

Note: We talked about whether the autocomplete list should include all WikiProjects, or only the WikiProjects appearing on currently open cases. We decided that it should include all of them, because it would feel broken if you tried typing a WikiProject name and it didn't show up in the autocomplete. On the other hand, if it does appear and then gives zero results, you have a complete understanding of what happened.

Event Timeline

DannyH set the point value for this task to 8.
DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.
bd808 renamed this task from Add interface to sort by WikiProject to tool labs interface for Plagiabot to Add interface to filter by WikiProject to tool labs interface for Plagiabot.May 19 2016, 5:29 PM
Niharika renamed this task from Add interface to filter by WikiProject to tool labs interface for Plagiabot to Add interface to filter by WikiProject to CopyPatrol.May 19 2016, 5:44 PM

Note you can probably steal some code from Pageviews Analysis, this looks just like the Select2 library we're using there.

Note you can probably steal some code from Pageviews Analysis, this looks just like the Select2 library we're using there.

Yep, we took inspiration from that while designing this. :)

Got the live search working fine, but if I'm understanding this correctly, the actual SQL querying of copyright diffs and WikiProjects involves two different databases – on two different servers. I don't think there's a graceful and performant way to do these queries without the databases at least being on the same server. Ideally, EranBot would query the WikiProject database and store it along with the other copyright diff information. That would mean we'd only need to query a single table to do the WikiProject filtering.

Any alternatives you can think of? Could we assist Eran in updating his bot? I don't think it would be difficult to implement, if we knew Python =P

@MusikAnimal: What database is the WikiProject information coming from?

@MusikAnimal: What database is the WikiProject information coming from?

Replied in the email thread, will repeat here: We're getting the WikiProjects from the s52475__wpx_p database on labsdb1004.eqiad.wmnet, which is populated by some magical bot that's part of WikiProject X. I assume EranBot is getting the data from the same place.

The query Eran recommended:

SELECT cv.page_title, page_ns, tl_title FROM s51306__copyright_p.copyright_diffs cv INNER JOIN enwiki_p.page p ON p.page_namespace=(cv.page_ns+1) AND p.page_title=cv.page_title INNER JOIN enwiki_p.templatelinks tl ON p.page_id=tl.tl_from WHERE tl_title LIKE 'Wikiproject%' LIMIT 5;

This doesn't check the WikiProjects database, but rather works by checking for "template links" on the talk page of the articles. This does a fine job, but is not fool proof since there are redirects, e.g. {{WikiProject NYC}} redirects to {{WikiProject New York City}}, so querying for the later will skip any that use NYC and not New York City. We could get clever and first use the API to get all redirects of a WikiProject template, and query for all of them. The speed is sort of not perfect, taking around 10 seconds on average, it seems. I think this might be acceptable, though.

I will try to get a prototype going so we can experiment further.

Eran has come up with a most wondrous solution of joining against the categorylinks table, which both eliminates the issue with redirects, and is surprisingly very fast. I'm upset I didn't think of this myself, both for our purpose and for one of my bot tasks =P I will resume work on this now and hopefully have something to show soon!

Pull request at https://github.com/Niharika29/PlagiabotWeb/pull/21 (note the comment at the top) and deployed at https://tools.wmflabs.org/plagiabot

A few notes from a product perspective:

  • The way EranBot detects WikiProjects is different than the way we do. We use templatelinks which is more accurate. Consequently if you filter for a specific WikiProject, you may see entries that don't have a bubble for that WikiProject. I don't think there's a way around this unless we update EranBot. We can't auto-add bubbles for the selected WikiProjects because if I filter for say, "Film" and "Medicine", an entry might be only Film and not Medicine, if that makes sense.
  • Clicking on the WikiProject bubbles appends that WikiProject to the current list of WikiProject filters. We could alternatively make it reset it and show only the WikiProject they clicked on. I am unsure which is more favourable.
  • The filters now live on the left of the interface like the wireframes above. I tried centering them and it didn't look as purty in my opinion.
  • Despite my initial testing, this can run a little slow. We have to use regex to query for multiple WikiProjects which I think is what is sometimes taking a toll on performance.
  • This will require a fair amount of testing. There's a lot of components involved here and the little man inside of me is saying there might be some Select2-related bugs I haven't discovered yet.

A few notes from a product perspective:

  • The way EranBot detects WikiProjects is different than the way we do. We use templatelinks which is more accurate. Consequently if you filter for a specific WikiProject, you may see entries that don't have a bubble for that WikiProject. I don't think there's a way around this unless we update EranBot. We can't auto-add bubbles for the selected WikiProjects because if I filter for say, "Film" and "Medicine", an entry might be only Film and not Medicine, if that makes sense.

I think this is backwards: we're using categorylinks, but EranBot is using templatelinks. IMO, we shouldn't worry about this too much. Once PageAssessments is live, we should switch both of them to use it instead.

  • Clicking on the WikiProject bubbles appends that WikiProject to the current list of WikiProject filters. We could alternatively make it reset it and show only the WikiProject they clicked on. I am unsure which is more favourable.

I think resetting it would be more intuitive. @DannyH?

Yeah, I think reset is better.

Also: There is a little man inside of Leon.

Yeah, I think reset is better.

No problem.

I think this is backwards: we're using categorylinks, but EranBot is using templatelinks. IMO, we shouldn't worry about this too much. Once PageAssessments is live, we should switch both of them to use it instead.

You are correct, but turns out we have a problem: The category names are not consistent. Example: Political positions of Hillary Clinton is part of WikiProject Hillary Clinton, but the category is WikiProject Hillary Rodham Clinton articles. We're surely going to run into the same issue with other WikiProjects... Should we wait for Page Assessments, or move forward with this partial solution? Right now it's possible clicking on a WikiProject bubble will yield no results :/

@MusikAnimal: If it works for WikiProject Medicine, let's go ahead and use it. They are the main customer for this feature, AFAIK.

@MusikAnimal: If it works for WikiProject Medicine, let's go ahead and use it. They are the main customer for this feature, AFAIK.

That should be no problem.

@DannyH with that I think we're okay to move forward with testing. Just note that some WikiProjects won't work... so if you click on the bubble and you don't get the right results, check the on-wiki category on the talk page of the given article and see if it matches up. If it does match up, then you've found a bug :)

  • The way EranBot detects WikiProjects is different than the way we do. We use templatelinks which is more accurate. Consequently if you filter for a specific WikiProject, you may see entries that don't have a bubble for that WikiProject. I don't think there's a way around this unless we update EranBot. We can't auto-add bubbles for the selected WikiProjects because if I filter for say, "Film" and "Medicine", an entry might be only Film and not Medicine, if that makes sense.
  • Despite my initial testing, this can run a little slow. We have to use regex to query for multiple WikiProjects which I think is what is sometimes taking a toll on performance.

Sorry if it seems like I'm beating the point to death but I still think we're missing the really obvious and simplest solution. When we fetch the wikiprojects for each record on page load, we can simply insert them into the DB at our end (we can optimize this in several ways - not query for wikiprojects if the DB already has them, for example) and simply search through them when we want to display records for a specific search query. Or alternatively use a cron job script which runs every few minutes or such.

The only "risk" that's been pointed out with the cron job approach is that it might be that the top record doesn't show any Wikiprojects on page load. Compared to what we have right now, with a lot of category and wikiproject names mismatch, I don't think that risk is significant.

Sorry if it seems like I'm beating the point to death but I still think we're missing the really obvious and simplest solution. When we fetch the wikiprojects for each record on page load, we can simply insert them into the DB at our end (we can optimize this in several ways - not query for wikiprojects if the DB already has them, for example) and simply search through them when we want to display records for a specific search query. Or alternatively use a cron job script which runs every few minutes or such.

The only "risk" that's been pointed out with the cron job approach is that it might be that the top record doesn't show any Wikiprojects on page load. Compared to what we have right now, with a lot of category and wikiproject names mismatch, I don't think that risk is significant.

No I get you! A cron job makes sense, before I just thought having EranBot do it was preferable as we're essentially duplicating the work it's doing. I'm not worried about the top few records not having WikiProjects saved, especially if the cron runs every few minutes, it will catch up quickly. Storing on page load without a cron job I don't think will work because it requires a user to load the page for anything to get saved, right? We have to account for the time during which no one is using the tool.

I also confused myself, the WikiProject bubbles we see come from the WikiProjects database, so apparently it is incomplete... e.g. {{WikiProject Tambayan Philippines}} was added to Talk:Jones, Isabela years ago but the database still doesn't have any WikiProjects saved (not that we'd have many people filtering for that WikiProject :). This is ok for now I think, if we store what the WikiProjects database has at least the filtering will match up with what's in the interface. Or we could do what MusikBot does and parse the talk page markup, a more fool-proof solution, and save that which we will use for both the interface and for the filtering. I could turn this into a cron job pretty easily, but you'll have to be OK with Ruby :) With PHP we'd need something similar to the Nokogiri gem.

The bigger question I think is if we should hold off for PageAssessments which makes all of this work redundant? The more I think about it, I'm OK with either waiting or going with a cron job, because it seems like a bad user experience to click one of those WikiProject bubbles and get no results.

Sorry if it seems like I'm beating the point to death but I still think we're missing the really obvious and simplest solution. When we fetch the wikiprojects for each record on page load, we can simply insert them into the DB at our end (we can optimize this in several ways - not query for wikiprojects if the DB already has them, for example) and simply search through them when we want to display records for a specific search query. Or alternatively use a cron job script which runs every few minutes or such.

The only "risk" that's been pointed out with the cron job approach is that it might be that the top record doesn't show any Wikiprojects on page load. Compared to what we have right now, with a lot of category and wikiproject names mismatch, I don't think that risk is significant.

No I get you! A cron job makes sense, before I just thought having EranBot do it was preferable as we're essentially duplicating the work it's doing. I'm not worried about the top few records not having WikiProjects saved, especially if the cron runs every few minutes, it will catch up quickly. Storing on page load without a cron job I don't think will work because it requires a user to load the page for anything to get saved, right? We have to account for the time during which no one is using the tool.

I didn't understand. If nobody uses the tool for five hours and then someone comes along and loads it then the records which were added during those five hours will get loaded and saved. Am I missing something?

I also confused myself, the WikiProject bubbles we see come from the WikiProjects database, so apparently it is incomplete... e.g. {{WikiProject Tambayan Philippines}} was added to Talk:Jones, Isabela years ago but the database still doesn't have any WikiProjects saved (not that we'd have many people filtering for that WikiProject :).

Good question. The wikiprojects are all there in the database:

MariaDB [s52475__wpx_p]> SELECT * FROM projectindex WHERE pi_page LIKE 'Talk:Jones,_Isabela';
+---------+---------------------+--------------------------------+
| pi_id   | pi_page             | pi_project                     |
+---------+---------------------+--------------------------------+
| 6659364 | Talk:Jones,_Isabela | Wikipedia:Tambayan_Philippines |
+---------+---------------------+--------------------------------+
1 row in set (0.07 sec)

But we filter out all wikiprojects which are inconsistently named in the database to avoid programs specific wikiprojects. So, normal wikiprojects in the database are listed as "Wikiproject_Something":

+---------+-------------------------------+--------------------------------+
| pi_id   | pi_page                       | pi_project                     |
+---------+-------------------------------+--------------------------------+
| 1841888 | Talk:Makam_Sultan_Abdul_Samad | Wikipedia:WikiProject_Malaysia |
+---------+-------------------------------+--------------------------------+

But programs-related Wikiprojects are named as Wikipedia:Something -

MariaDB [s52475__wpx_p]> SELECT DISTINCT pi_project FROM projectindex WHERE pi_project NOT LIKE 'Wikipedia:Wikiproject_%';
+----------------------------------------------------------------+
| pi_project                                                     |
+----------------------------------------------------------------+
| Wikipedia:GLAM/The_Children's_Museum_of_Indianapolis           |
| Wikipedia:GLAM/Museum_of_Modern_Art                            |
| Wikipedia:GLAM/Balboa_Park                                     |
| Wikipedia:GLAM/Delaware_Art_Museum                             |
| Wikipedia:GLAM/British_Library                                 |
| Wikipedia:GLAM/GibraltarpediA                                  |
| Wikipedia:GLAM/Chemical_Heritage_Foundation                    |
| Wikipedia:GLAM/Herbert_Art_Gallery_and_Museum                  |
| Wikipedia:GLAM/Indiana_Historical_Society                      |
| Wikipedia:GLAM/University_of_California_Riverside_Libraries    |
| Wikipedia:GLAM/National_Archives_and_Records_Administration    |
| Wikipedia:Featured_topics                                      |
| Wikipedia:GLAM/Smithsonian_Institution_Archives                |
| Wikipedia:GLAM/YMT                                             |
| Wikipedia:GLAM/Israel_Museum,_Jerusalem                        |
| Wikipedia:GLAM/JoburgpediA                                     |
| Wikipedia:GLAM/Johns_Hopkins_University                        |
| Wikipedia:Jewish_Labour_Bund_Task_Force                        |
| Wikipedia:Version_1.0_Editorial_Team/Core_topics               |
| Wikipedia:Tambayan_Philippines                                 |
| Wikipedia:GLAM/Newcomb_Archives_and_Vorhoff_Library            |
| Wikipedia:GLAM/Archives_of_American_Art                        |
| Wikipedia:GLAM/MonmouthpediA                                   |
| Wikipedia:GLAM/Teylers                                         |
| Wikipedia:Tambayan_Philippines/Task_force_Philippine_History   |
| Wikipedia:GLAM/Philadelphia_Museum_of_Art                      |
| Wikipedia:GLAM/Derby                                           |
| Wikipedia:Article_Incubator                                    |
| Wikipedia:GLAM/Smithsonian_Institution                         |
| Wikipedia:GLAM/George_Washington_University                    |
| Wikipedia:GLAM/British_Museum                                  |
| Wikipedia:Alphabet_Task_Force                                  |
| Wikipedia:GLAM/National_Railway_Museum                         |
| Wikipedia:GLAM/Pritzker                                        |
| Wikipedia:GLAM/History_of_the_Paralympic_movement_in_Australia |
+----------------------------------------------------------------+
35 rows in set (12.06 sec)

Tambayan Philippines seems to be an exception to the rule.

This is ok for now I think, if we store what the WikiProjects database has at least the filtering will match up with what's in the interface. Or we could do what MusikBot does and parse the talk page markup, a more fool-proof solution, and save that which we will use for both the interface and for the filtering. I could turn this into a cron job pretty easily, but you'll have to be OK with Ruby :) With PHP we'd need something similar to the Nokogiri gem.

The bigger question I think is if we should hold off for PageAssessments which makes all of this work redundant? The more I think about it, I'm OK with either waiting or going with a cron job, because it seems like a bad user experience to click one of those WikiProject bubbles and get no results.

I don't think PageAssessments will make this redundant. It will surely give us an easier query to do but nonetheless, having wikiprojects stored in the database will boost performance at little extra cost.

  • Searching through saved wikiprojects is going to be much faster than querying PageAssessments tables repetitively.
  • We won't have to fetch Wikiprojects for records over and over again every time a user loads a page.

There is the small question of updating our wikiprojects but that can be achieved with a daily cron-job.

I didn't understand. If nobody uses the tool for five hours and then someone comes along and loads it then the records which were added during those five hours will get loaded and saved. Am I missing something?

If no one uses the tool for five hours, nothing gets saved. So let's say someone comes along and loads the page, and the first 50 records only span the past 30 minutes. They will have to hit "Load more" repeatedly for data on the other 4 1/2 hours to get saved. Or perhaps I'm misunderstanding you? The filtering is a "search" so to speak, so we need each record to be preemptively have WikiProjects saved. It's also possible that person who first loads the tool after those 5 hours goes straight to a WikiProject filter, maybe they had it bookmarked.

I don't think PageAssessments will make this redundant. It will surely give us an easier query to do but nonetheless, having wikiprojects stored in the database will boost performance at little extra cost.

  • Searching through saved wikiprojects is going to be much faster than querying PageAssessments tables repetitively.
  • We won't have to fetch Wikiprojects for records over and over again every time a user loads a page.

There is the small question of updating our wikiprojects but that can be achieved with a daily cron-job.

I'm not actually sure what the schema looks like, but if it's as simple as look up this given page and it gives me the WikiProjects, it shouldn't be very slow. I think it's the regex that's slowing it down for categorylinks, and even it goes pretty fast from most of my tests.

Other non-recent articles missing WikiProjects: Talk:Conservative Party (UK) leadership election, 2016 (part of Wikipedia:WikiProject_Elections_and_Referendums since Oct 2015), Talk:Big Four (Western Europe) (part of WP:EU since Oct 2014), and then of course all of the recently created articles.

If we wanted to update WikiProjects via cron job, it'd have to go through every record in the copyright_diffs table, right (or at least the open ones which care most about)? That seems expensive. At the time of writing there are 14,091 open cases, or 2,081 since the beginning of June. We're looking at ~5,500 since June if we were to update all records since June. Anyway I don't think we need to worry about updating the WikiProjects, at least for now. If I understand correctly PageAssessments will solve this issue, and presumably would be more up-to-date with recent creations that are tagged with a WikiProject.

All in all without PageAssessments I'm OK with a cron job. MusikBot doesn't mind either :) I only recommend that because we're part of the way there, and it'd be more accurate about detecting WikiProjects. For the specific sub-WikiProjects, maybe we could just skip over ones that are a subpage of some other WikiProject. Also some like Wikipedia:Featured_topics I think would be beneficial to be tracked.

I didn't understand. If nobody uses the tool for five hours and then someone comes along and loads it then the records which were added during those five hours will get loaded and saved. Am I missing something?

If no one uses the tool for five hours, nothing gets saved. So let's say someone comes along and loads the page, and the first 50 records only span the past 30 minutes. They will have to hit "Load more" repeatedly for data on the other 4 1/2 hours to get saved. Or perhaps I'm misunderstanding you?

That's highly unlikely given that Eranbot only churns out about 70-80 records per day. But I also think cron job is a better way to handle this. :)

The filtering is a "search" so to speak, so we need each record to be preemptively have WikiProjects saved. It's also possible that person who first loads the tool after those 5 hours goes straight to a WikiProject filter, maybe they had it bookmarked.

I agree with this.

I don't think PageAssessments will make this redundant. It will surely give us an easier query to do but nonetheless, having wikiprojects stored in the database will boost performance at little extra cost.

  • Searching through saved wikiprojects is going to be much faster than querying PageAssessments tables repetitively.
  • We won't have to fetch Wikiprojects for records over and over again every time a user loads a page.

There is the small question of updating our wikiprojects but that can be achieved with a daily cron-job.

I'm not actually sure what the schema looks like, but if it's as simple as look up this given page and it gives me the WikiProjects, it shouldn't be very slow. I think it's the regex that's slowing it down for categorylinks, and even it goes pretty fast from most of my tests.

The schema for PageAssessments tables is almost exactly same as WikiprojectX table we use. It also contains columns for assessments.

Other non-recent articles missing WikiProjects: Talk:Conservative Party (UK) leadership election, 2016 (part of Wikipedia:WikiProject_Elections_and_Referendums since Oct 2015), Talk:Big Four (Western Europe) (part of WP:EU since Oct 2014), and then of course all of the recently created articles.

Hmm, this is concerning. We should switch to PageAssessments from WikiprojectX tables when it's deployed. It will be more reliable.

If we wanted to update WikiProjects via cron job, it'd have to go through every record in the copyright_diffs table, right (or at least the open ones which care most about)? That seems expensive. At the time of writing there are 14,091 open cases, or 2,081 since the beginning of June. We're looking at ~5,500 since June if we were to update all records since June. Anyway I don't think we need to worry about updating the WikiProjects, at least for now. If I understand correctly PageAssessments will solve this issue, and presumably would be more up-to-date with recent creations that are tagged with a WikiProject.

PageAssessments won't magically solve this issue. It'll give us a table similar to WikiprojectX and we'll have to update our database ourselves.

All in all without PageAssessments I'm OK with a cron job. MusikBot doesn't mind either :) I only recommend that because we're part of the way there, and it'd be more accurate about detecting WikiProjects. For the specific sub-WikiProjects, maybe we could just skip over ones that are a subpage of some other WikiProject. Also some like Wikipedia:Featured_topics I think would be beneficial to be tracked.

We are already skipping sub-projects. Did you happen to find any subproject in the records?

That's highly unlikely given that Eranbot only churns out about 70-80 records per day. But I also think cron job is a better way to handle this. :)

Agreed

PageAssessments won't magically solve this issue. It'll give us a table similar to WikiprojectX and we'll have to update our database ourselves.

Do we need to update our database though? At that point we could just JOIN and get up-to-date data, won't have to worry about the concern of a given article changing WikiProjects. I guess we'll find out how efficient it is when the time comes.

We are already skipping sub-projects. Did you happen to find any subproject in the records?

No I meant that as an alternative approach as some WikiProjects like Featured_topics are helpful but don't start with WikiProject_.

I'm gunna try to write a quick script that will update copyright_diffs.

@MusikAnimal: How hard would it be to accommodate for capitalization differences? See WikiProject Women's Health vs. Category:WikiProject Women's health articles.

@MusikAnimal: How hard would it be to accommodate for capitalization differences? See WikiProject Women's Health vs. Category:WikiProject Women's health articles.

Not hard, but I think we've ditched the category idea. They can have completely different wording, e.g. Category:WikiProject Hillary Rodham Clinton articles versus WikiProject Hillary Clinton.

About to share my Ruby code for the cron job :)

Cron job is running and plagiabot has been updated. Pull request ready to be reviewed: https://github.com/Niharika29/PlagiabotWeb/pull/21/files