Page MenuHomePhabricator

Integrate Pageviews Analysis with Page Pile
Closed, ResolvedPublic

Description

Asaf is requesting the idea to "take pageviews Analysis to the next step". Program leaders need to learn the collective outcomes of their work for reporting both to WMF and to External Partners (e.g. GLAM institutions), and one of the key outcomes is pageviews from articles created/improved during a program. At the moment, we rely on tools that @Magnus has created called treeviews, which is currently not working, according to @Ijon and @Esh77.

So, is it possible to add a feature to Pageviews Analysis that will collect pageview data for an ID from Page Pile, in order to take pageviews to the next step?

Please let @Abit know if you have questions.

Thanks!
Edward

https://tools.wmflabs.org/pageviews/#project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Cat|Dog

https://tools.wmflabs.org/pagepile/

http://tools.wmflabs.org/glamtools/treeviews/

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 10 2016, 7:04 PM

@Samwalton9 might be interested as well.

Hooking up to the Page Pile API is pretty straightforward. The only issue here is Pageviews is hardcoded to accept 10 pages max. Beyond that the normal chart interface is not as helpful, e.g. 100 pages isn't going look good even as a pie chart.

Maybe instead we could have a table view like we do with Langviews, where we simply have a row for each page with the total number of views, average views, and maybe some other data.

How does that sound?

I'm thinking we'll adopt the code from Langviews, and make a new app to go with this suite of tools called "Massviews" (has to be consistent with the play on words, Langviews, Topviews, etc :)

I see that some of the recent piles are over 1,000 pages in size. Collections this big will not only take quite some time to process, but are also likely to suffer from backend errors that will leave you with incomplete data. E.g. you may ask for data on 1,000 pages but only get back data on say, 800 of them. T124314 hopefully will help or resolve this issue, but either way I think we'll have to put a cap on the number of pages: maybe 500 for right now, and bump it to 1,000 once the backend response times are improved.

In the short term, we have been working with the developer of the Google Spreadsheets Wikipedia tool, namespace agnostic, so its super easy to grab non-article namespace -- not quite a pagepile, but fairly dynamic as a short term fix (we have a demo at https://docs.google.com/spreadsheets/d/1hUbMHmjoewO36kkE_LlTsj2JQL9018vEHTeAP7sR5ik/edit With Wikipedia Library project pages).

As far as I can tell, treeviews works just fine, with and without PagePile.

Minor snag: If you enter the PagePile, click "Add category" to get the Run button. Will fix that soon. Example with PagePile:
https://tools.wmflabs.org/glamtools/treeviews/?q={%22pagepile%22%3A%223042%22%2C%22rows%22%3A[]}

Got a very much beta version running at https://tools.wmflabs.org/massviews

For simple Page Piles it works great, e.g. #3028
http://tools.wmflabs.org/massviews/?platform=all-access&agent=user&source=pagepile&target=3028&range=latest-20&sort=views&direction=1

Larger collections are truncated at 500 pages, and even then you might see a few requests fail with "cassandra backend errors" (which is what T124314 is all about). This is why we make you wait another 90 seconds before using the tool again, otherwise it may fail completely. However you might find that 90 seconds is still not enough.

So essentially if you use this tool every few minutes or so, you should have minimal problems. Just don't expect to be able to run enormous collections of pages one right after the other :)

Let me know what you think!

Very nicely done! Thank you @MusikAnimal!

I have heard about the issue with most tools maxing out at 500 requests, so that is a bummer, but still, 500 pages is a great place to start. Thank you so much for creating this! :)

@Magnus I will let Asaf and @Esh77 chime in, but he was saying that some pages were returning zero views.

Ijon added a subscriber: Nemo_bis.May 12 2016, 9:24 PM

@Nemo_bis has identified the bug -- @Esh77 used a PagePile that had the wiki name as 'He' rather than 'he'. That's enough to cause the PageViews API to fail. I actually encountered it myself as I was building my own tool, and quickly added lowercasing of the input, without thinking that might be the reason Magnus's tool was not getting results too.

This should be fixed in the PageViews API itself, but besides, TreeViews would do well to lowercase input of wiki name, and PagePile itself should too.

Ijon added a comment.May 12 2016, 9:27 PM

(I had filed T134926 against the API itself.)

@MusikAnimal I have been getting "langview" errors for Pagepile 3053 .

@Sadads You should not see the word "Langviews" anywhere... I'm assuming you are using the tool in a non-English language, or your computer is set to something that's not English? If you just see "langviews-error" or something like that, than the i18n library isn't falling back to English like it's supposed to :(

Anyway, it looks like the pages in that Page Pile all have Project: before the actual page name. This is why it isn't working.

E.g. Project:Wikipedia:Culture/New_York_Public_Library should be Wikipedia:Culture/New_York_Public_Library

@Magnus is there a way to limit the number of pages returned by the API? For instance Pile 3030 has 690,381 pages, with a download size of nearly 2MB! With Massviews anyway, we are only going to show the first 500.

My hope is we can put the total number of pages in the metadata, then allow for a limit parameter to get back on N pages.

So:

https://tools.wmflabs.org/pagepile/api.php?id=3030&action=get_data&format=json&metadata=1&limit=500

which will give us:

{
  "pages": {
    ...500 pages...
  },
  "wiki": "wikidatawiki",
  "id": 3030,
  "length": 690381
}

This will most importantly speed things up and reduce download size as we'd only get the first 500, but also I could throw the error "There are 690,381 pages, only the first 500 will be processed".

Sorry to duplicate, wasn't sure the best place, so also created an issue on BitBucket

@Sadads You should not see the word "Langviews" anywhere... I'm assuming you are using the tool in a non-English language, or your computer is set to something that's not English? If you just see "langviews-error" or something like that, than the i18n library isn't falling back to English like it's supposed to :(
Anyway, it looks like the pages in that Page Pile all have Project: before the actual page name. This is why it isn't working.
E.g. Project:Wikipedia:Culture/New_York_Public_Library should be Wikipedia:Culture/New_York_Public_Library

@MusikAnimal I am generating it from petscan, so that would be a @Magnus bug perhaps? It must be reading the Project as a lang issue -- one of the perpetual problems with tabulating data around non-mainspace pages has been the transportation of other namespace data between Magnus's tools.

Nemo_bis removed a subscriber: Nemo_bis.May 14 2016, 9:33 AM
kaldari renamed this task from Integrate Pageviews Analysis with Page Pile to Tracking: Integrate Pageviews Analysis with Page Pile.May 17 2016, 5:19 PM
DannyH renamed this task from Tracking: Integrate Pageviews Analysis with Page Pile to Integrate Pageviews Analysis with Page Pile (tracking).May 17 2016, 5:25 PM
Phabricator_maintenance renamed this task from Integrate Pageviews Analysis with Page Pile (tracking) to Integrate Pageviews Analysis with Page Pile.Aug 14 2016, 12:08 AM

@MusikAnimal: I think we have sufficient integration with PagePiles to mark this as resolved. The only remaining issue is a minor edge-case bug that is upstream in the PagePiles API, but I doubt that's ever going to be fixed. What do you think?

MusikAnimal closed this task as Resolved.Sep 28 2016, 3:14 AM

@MusikAnimal: I think we have sufficient integration with PagePiles to mark this as resolved. The only remaining issue is a minor edge-case bug that is upstream in the PagePiles API, but I doubt that's ever going to be fixed. What do you think?

Yes I believe we can resolve this. And for those concerned about the extraneous Project: namespace (as with Project:Wikipedia:Culture/New_York_Public_Library), you can alternatively use the "Wikilinks" source on Massviews and put the URL to any wiki page (your sandbox, for instance) containing links to the pages you want pageviews data on.

MusikAnimal moved this task from Backlog to Done on the Tool-Pageviews board.Sep 28 2016, 3:16 AM

Addendum: "limit" parameter suggested by @MusikAnimal has now been implemented.