Page MenuHomePhabricator

Implement 'Pages Created' downloadable csv
Closed, ResolvedPublic8 Story Points

Description

The Pages Created downloadable reports give details on all articles created during an event. This report in csv (spreadsheet) format will provide event organizers with data that they can sort and recombine in order to create documents or other reports for partners, grantors, bosses, etc.

  • Metric definitions: See below, under "Definitions of Metrics"
  • Event details: In addition to the data/metrics, the CSV file will contain some descriptive information about the event and the report. See below under "Event details."
  • Report filename: when the user saves the report, the filename should follow this format: pages-created_event-name

Report Content

Metrics/column names

  • The left-most column of the report will be a list of page titles.
  • The metrics in each row will be presented/calculated as they apply to the particular page listed in the left column. E.g., "Bytes changed" means bytes changed for that article (as opposed to the same figure on the Event Summary reports, where it totals the bytes changed for the whole event).
  • The default sort order will be by "Avg. daily pageviews", descending

The column headings will be, in order (pls use this approved wording):

  • Title
  • URL
  • Creator
  • Wiki
  • Edits during event [method defined in T206821]
  • Bytes changed during event [method defined in T206820]
  • Pageviews, cumulative [method defined in T206817]
  • Avg. daily pageviews [method defined in T206817]
  • Incoming links [Release II, method defined in T214219]
  • More page metrics [see below for formatting info]

Event details

At the bottom of the csv report below the data above, please list the data in the table below:.

  • Please separate the event details from the report with a line of 7 dashes, as shown
  • The timezone notation and all dates/times are the timezone of the event as per the Settings, not of the user who did the downloading.
  • The "last updated" time is the time of the last Update, not of the download,
———————
Pages Created:Eventname
Timezone:Timezonecountry/City
Start date:yyyy-mm-dd hh:mm
End date:yyyy-mm-dd hh:mm
Last updated:yyyy-mm-dd hh:mm

Metric definitions & formatting

  • Title: page title of each Main space page created (consistent with all active filters).
  • URL of the page listed.
  • Creator The username of the person who created the article. In the Wikitext version of this report, the name is combined with URL of the user's userpage (on the same wiki as the Page Created), so that the names are linked.
  • Wiki where the article exists. Limited to the short list of wikis defined on the Event Setup screen for the event. For space reasons, label all Wikipedias only using the language name—"Spanish," French," etc. (i.e., omit "Wikipedia"). List "Commons" and "Wikidata" as such.
  • Edits during event The edit count to the article during the event period.
  • Bytes changed during event The net bytes changed to the page during the event period. If the Bytes Changed is a negative number, please include a - (but don't use a + for positive numbers).
  • Pageviews, cumulative Pageviews to the Main space page from creation until most recent data available as of the last data Update. (Granularity of Pageviews API is one day, meaning you always get yesterday's data.) If the user requests stats during the day of creation, we will show "n/a", for "not available" rather than 0, which is misleading.
  • Avg. daily pageviews Avg. Pageviews is an average over the preceding 30 days. If 30 days are not available, use the average of however many days are available. If the user requests stats during the day of creation (when no figures are available), we will show "n/a", for "not available" rather than 0.
  • Incoming links A count of links to the article (cumulative, i.e., since creation)
  • More page metrics provides a URL that links users to the XTools "Page History" page for that article. In CSV reports, just list the URL. In Wikitext reports, combine the URL with the word "more" to form links.

Data that are fixed at event close vs. data that continue to develop

Figures like Pageviews naturally continue to develop after the event is over. Other figures can be considered fixed once the event period is over; these could be stored and need never be calculated again. Here is a breakdown for this report:

Remain fixed

  • Wiki
  • Creator
  • Edits during event
  • Bytes changed during event

Continue to develop

  • Title
  • URL
  • Pageviews, cumulative
  • Avg. daily pageviews
  • Incoming links
  • More page metrics [URL]

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jmatazzoni triaged this task as Normal priority.Oct 15 2018, 9:56 PM
jmatazzoni updated the task description. (Show Details)
jmatazzoni updated the task description. (Show Details)

Note that following T210682: Don't refresh fixed data if last updated datestamp is after end date of the event we will need this report to check the archive table if the page doesn't exist in page. All the metrics listed above are actually still available after the page is deleted, only we'd need to refer to a different table.

There is also an edge case where a page is oversighted or revision-deleted, in which case it won't be in page or archive.

Desirable but not for MVP / Deprecated Metrics

This is a list of metrics we'd wanted for this report but which were judged out of scope for the first MVP

  • Bytes changed subsequently
  • # of words changed subsequently
  • Still exists? [T206695]
  • Words added during event [T206690]
  • Words removed during event [T206690]
  • Net % change in words during event [ T206690]
  • Article class (where available)

Note that most of the work for this will be done with T205502. Wikitext is easier since we can do our development within the browser, as opposed to having to download CSV and open it in spreadsheet software over and over.

jmatazzoni updated the task description. (Show Details)
MusikAnimal moved this task from Ready to In Development on the Community-Tech-Sprint board.
jmatazzoni updated the task description. (Show Details)

@MusikAnimal, FYI I changed the spec in the Description of "Bytes changed" slightly to get rid of the + for positive numbers—because it makes the number align left instead of right in CSVs (they don't see it as a number anymore). The - for negative numbers is OK though, and doesn't have that problem. Here is the new spec, fyi:

  • Bytes changed during event The net bytes changed to the page during the event period. If the "Bytes changed" is a negative number, please include a - (but don't use a + for positive numbers).

FYI I changed the spec in the Description of "Bytes changed" slightly to get rid of the + for positive numbers

Got it, thanks!

This is ready for review: https://github.com/wikimedia/eventmetrics/pull/204

dom_walden added a subscriber: dom_walden.EditedMar 5 2019, 9:02 AM

Checked the correct figures for some wikidata items (I have not really tested wikidata up to this point). Saw no problems, although I did not know where to get figures for page views.

@MusikAnimal The only problem I saw was that it did not appear to be correctly encoding apostrophes. For example, https://eventmetrics-dev.wmflabs.org/programs/76/events/259 has cells such as "Switzerland women's national under-20 volleyball team" (effects both Title and URL columns). Perhaps similar to T215923? The encoding is correct in the wikitext format.

I also checked that I could perform mathematical operations on the integer and date data when I viewed it in LibreOffice, just to check the format of the data was OK.

aezell added a comment.Mar 5 2019, 4:45 PM

Apostrophes in CSV formats are always problematic. I suspect there's a flag on the CSV generator to wrap cells with double-quotes when the content has single quotes within it. I know I've had to do that in previous work.

The only problem I saw was that it did not appear to be correctly encoding apostrophes. For example, https://eventmetrics-dev.wmflabs.org/programs/76/events/259 has cells such as "Switzerland women's national under-20 volleyball team" (effects both Title and URL columns). Perhaps similar to T215923? The encoding is correct in the wikitext format.

Yes it's the same issue as T215923. I've got a fix up for review now :)

Yes it's the same issue as T215923. I've got a fix up for review now :)

In that case, I will put this task into Product Sign-off as I have nothing else to do.

@MusikAnimal @dom_walden, I'm looking at the CSV report. It looks great except I notice the Creator column is missing. See screenshot below.

I'm looking at the CSV report. It looks great except I notice the Creator column is missing.

Done with https://github.com/wikimedia/eventmetrics/pull/213

jmatazzoni closed this task as Resolved.Mar 8 2019, 2:29 AM
jmatazzoni moved this task from Product sign-off to Q3 2018-19 on the Community-Tech-Sprint board.
jmatazzoni updated the task description. (Show Details)

Event details: In addition to the data/metrics, the CSV file will contain some descriptive information about the event and the report. See below under "Event details."

Event details correct for the few that I checked.

Report filename: when the user saves the report, the filename should follow this format: pages-created_event-name

I did not test this with a large range of characters, although I noticed that it supports Cyrillic. When creating an event there is some validation of the event name (e.g. not allowed "/" in names).

The default sort order will be by "Avg. daily pageviews", descending

The rows of the csv are in this order.

Wiki where the article exists. Limited to the short list of wikis defined on the Event Setup screen for the event. For space reasons, label all Wikipedias only using the language name—"Spanish," French," etc. (i.e., omit "Wikipedia"). List "Commons" and "Wikidata" as such.

@jmatazzoni Should I raise this as a separate bug? Do we also want to rename these on the "EVENT SUMMARY" and "ALL EDITS" pages (and resp. downloadable reports)?

Edits during event The edit count to the article during the event period.

@jmatazzoni Currently, this includes the edit that created the article. Perhaps after T217455 we don't want it to?

Inside the wikipedia software, we may not distinguish "edit" from "create". The organiser might know this, but if they are "[creating] reports for partners, grantors, bosses, etc." those people may not.

Bytes changed during event The net bytes changed to the page during the event period. If the Bytes Changed is a negative number, please include a - (but don't use a + for positive numbers).

I checked this for a small sample, but not systematically. I did not see any numbers which used "+". I did not see negative numbers, and I don't see how this would be possible if we are counting net bytes changed since creation. The lowest we could possibly get would be 0 (e.g. create an article then remove all the content).

Pageviews, cumulative Pageviews to the Main space page from creation until most recent data available as of the last data Update. (Granularity of Pageviews API is one day, meaning you always get yesterday's data.) If the user requests stats during the day of creation, we will show "n/a", for "not available" rather than 0, which is misleading.
Avg. daily pageviews Avg. Pageviews is an average over the preceding 30 days. If 30 days are not available, use the average of however many days are available. If the user requests stats during the day of creation (when no figures are available), we will show "n/a", for "not available" rather than 0.

Accuracy of "pageviews, cumulative" tested elsewhere (I think). I have not looked at "avg. daily pageviews" as we already know there are bugs (T217704).

@jmatazzoni If the pages created report is run on the same day as the page is created, then the columns are blank (in both csv and wikitext). Shall I raise a new bug? (example in https://eventmetrics-dev.wmflabs.org/programs/118/events/297/pages-created, but will only work today).

Incoming links A count of links to the article (cumulative, i.e., since creation)

@jmatazzoni Currently, it is only the current links to this article, rather than any page that has ever linked to this article. I assume the latter information is not available.

This has already been resolved so I assume we're all set here, but just in case:

Wiki where the article exists. Limited to the short list of wikis defined on the Event Setup screen for the event. For space reasons, label all Wikipedias only using the language name—"Spanish," French," etc. (i.e., omit "Wikipedia"). List "Commons" and "Wikidata" as such.

We don't want to do this. Using the domain en.wikipedia, commons.wikimedia, etc., is consistent with the rest of the application and avoids a bunch of API calls and logic complexity to localize "Wikipedia" etc. for the interface language. There are also tentative plans to introduce Wiktionary, Wikisource, perhaps others, into Event Metrics. In that case simply putting "Spanish" is ambiguous as it wouldn't specify the wiki family.

Edits during event The edit count to the article during the event period.

Currently, this includes the edit that created the article. Perhaps after T217455 we don't want it to?

A page creation requires an edit. I don't think this concept is the same as pages created vs improved, and all edits should be counted. That is certainly the behaviour I would expect as a user (the raw edit count would match what I see at Special:Contributions).

Incoming links A count of links to the article (cumulative, i.e., since creation)

Currently, it is only the current links to this article, rather than any page that has ever linked to this article. I assume the latter information is not available.

Correct, we can only count the current number of links to the article.