Page MenuHomePhabricator

Create a method to fetch information about uploaded files in local wikis
Closed, ResolvedPublic5 Estimated Story Points

Description

Deliverable:

  • When checking for uploaded files in EventMetrics, also check uploaded files that were uploaded to the given local wikis, not just to commons.

More information

Unlike Grant Metrics currently, in Event Metrics the various reports that provide metrics about Files Uploaded will include uploads to local wikis (that are specified during event setup). The information about uploaded files need to be expanded to always include the requested local wikis when computing the metrics listed below. (As now on Grant Metrics, Commons uploads will be checked when Commons is specified as a wiki of interest.)

Note that "Pages with uploaded files" is a new stat for Grant Metrics.

Definitions of metrics relating to 'Files Uploaded'

As used in the "Event summary" reports T205561

  • Files uploaded A count of the files uploaded during the event. Includes all file types. Unlike Grant Metrics currently, we will counts files uploaded to to the individual specified wikis as well as to Commons. As on current Grant Metrics, Commons is counted only if Commons is specified as a wiki of interest during setup.
  • Unique pages with uploaded files [The method for this is in task, T215356] A count of how many pages have Files Uploaded on them, on all wikis (i.e., not just those specified for the event).
  • Uploaded files in use A count of the uploaded files that are in use on at least one page on any wiki.
  • Avg. daily views to files uploaded [the method for this is in T206700] Pageviews per day to all pages on which Files Uploaded have been placed.Counts pageviews on all wikis that include articles with uploaded files—not just wikis specified as wikis of interest in event setup. Avg. is calculated from a 30-day sample (or, if available days , 30, as many days as are available.)

As used in the "Files Uploaded" reports T212547

Essentially, everything in this report is "related" to uploaded files. The main metric, though, on which everything else in the report depends is this:

  • Filename Gives the name of each file uploaded during the event, to both Commons and local wikis (if they are specified as wikis of interest in event setup). This is a file-by-file listing of the Files Uploaded metric (see above).

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Niharika updated the task description. (Show Details)
Mooeypoo renamed this task from Create a method to fetch information about uploaded files to Create a method to fetch information about uploaded files in local wikis.Oct 16 2018, 11:37 PM
Mooeypoo updated the task description. (Show Details)

@Mooeypoo, this is a high-value Method ticket you need to go over and rewrite for Implementation.

@jmatazzoni is there a similar task to this ? I feel like I just wrote something about this...

In T206819#4882432, @Mooeypoo wrote:

@jmatazzoni is there a similar task to this ? I feel like I just wrote something about this...

Yes, this is closely related (as you note in the ticket) to Create a method for putting 'Avg. daily views to files uploaded' metric into 'Event Summary' reports

@Mooeypoo, isn't T206700: Create a method for putting 'Avg. daily views to pages that have files uploaded actually a Parent task of this, not a subtask? You don't need Avg. Daily Views to get Files Uploaded, but you do need Files Uploaded to get Avg. Daily Views for Files Uploaded. Right? Making the change. Tell me if I screwed up.

You're right, I think I connected the tasks wrong. Thanks for fixing it!

@jmatazzoni Just a note... we're already showing "files uploaded" and "file usage" in the interface now, so we will need to continue to do that after this is implemented. But, it's kinda weird to show "Files uploaded: 0" when your event is only for say, en.wikipedia, and you don't care about files at all. So I'm thinking, if Commons is not included as part of the event, we will only show Files Uploaded/Usage if at least one file was uploaded. That sound okay?

@jmatazzoni Just a note... we're already showing "files uploaded" and "file usage" in the interface now, so we will need to continue to do that after this is implemented. But, it's kinda weird to show "Files uploaded: 0" when your event is only for say, en.wikipedia, and you don't care about files at all. So I'm thinking, if Commons is not included as part of the event, we will only show Files Uploaded/Usage if at least one file was uploaded. That sound okay?

Since this speaks to the user experience, I'm copying @Prtksxna for his opinion. As for myself, I would prefer to display all metrics all the time. For one thing, demonstrating that Files Uploaded is something we track, even if it doesn't apply to the current event, is a good way to help users discover that feature. In general, I don't like "hidden settings" (if x happens, then change y without telling the user about it) since they put questions in users' minds. I think it's better if the UI stays consistent from event to event, so that elements don't move around, users get accustomed to reading a report that stays in a standard format—and when users don't see something they don't have to wonder "is that broken?"

After all ,if an event produced 0 images, that is valid and possibly useful data. And I've specified a number of metrics for which we will list "n/a" as opposed to zero, to show that that metric is "not available" (e.g., pageviews on the first day of an event). If some are n/a and some are 0, that is clear. If some are n/a and some disappear, then that is mysterious. E.g, imagine someone who meant to specify Commons but forgot. If he sees 0 Files Uploaded, that may spark him to fix his settings. If the metric just goes missing...

I also have to add that I absolutely want all metrics displayed in the downloadable reports. So we're really talking only about the on-page displays.

Now, having said all that, I guess I have to ask: how do the metrics work currently? Do they all disappear from the on-page display in the event they are zero? So would that be a ticket I'd have to write and another job of work do do, to fix this the way I prefer? Because adding more work is not something I want to be doing just now.

is this how all the metrics work currently? Do they all disappear from the on-page display in the event they are zero?

Nope, we show them even if they are zero. This is except for the newer metrics like pageviews and bytes added, which we don't show at all.

I agree with you and will move forward with always showing files uploaded/usage, even if it is zero. This makes it a lot easier :) That said I believe this is ready for review: https://github.com/wikimedia/eventmetrics/pull/171

If we want to hide some metrics if they are zero, I suggest creating a separate ticket as it's a little tricky. I only asked here because this is a new metric (for non-Commons anyway).

@jmatazzoni Just a note... we're already showing "files uploaded" and "file usage" in the interface now, so we will need to continue to do that after this is implemented. But, it's kinda weird to show "Files uploaded: 0" when your event is only for say, en.wikipedia, and you don't care about files at all. So I'm thinking, if Commons is not included as part of the event, we will only show Files Uploaded/Usage if at least one file was uploaded. That sound okay?

I think we should be showing the information that is of relevance to the event, even if it is zero. But if a metric isn't applicable to an event we should omit it. Cluttering the page with metrics that aren't applicable would increase the users' cognitive load without benefit.

I believe this is ready for review: https://github.com/wikimedia/eventmetrics/pull/171

I reviewed the code and it looks good to me.

Pages with uploaded files A count of how many pages have Files Uploaded on them, on all wikis (i.e., not just those specified for the event).

This was overlooked. The "deliverable" implies we wanted to extend the current metrics used on Commons to local wikis, but "pages with uploaded files" is a new metric entirely. We are not doing this for Commons uploads, either. As a new metric, I think this should be a separate ticket. You can still QA the "Files uploaded" and "Uploaded files in use" (worded as "Files in use") on https://eventmetrics-dev.wmflabs.org

Just to note this in the record, I crossed out the following sentence in the Description:

This tracking of local file uploads also needs to work when the Category filter is used. I.e., if organizers are using categories to define an Event, the system must track file uploads made to those categories.

Leon and I talked about this: we added a new ticket for using categories with files (T214744) and said it would work only on Commons.

@MusikAnimal I believe the figure for "Files in use" for https://eventmetrics-dev.wmflabs.org/programs/91/events/187/revisions is incorrect. Clicking through I can see that all 3 of the uploaded files are used in at least one place (on Commons and some in other places as well).

I notice that the link in the Page column for "File:Masjid Quba' Rao-Rao.jpg" is incorrect; takes you to https://commons.wikimedia.org/wiki/File:Masjid_Quba%26#039;%20Rao-Rao.jpg. It appears to be HTML encoding the apostrophe in the href. Perhaps the method is also making a similar mistake?

EDIT: This may be a false alarm. The SQL query for the metric only looks for links in the "0" namespace. I guess this is intentional.

I attempted to compare the metrics to the data in the revisions tables
(/programs/m/events/n/revisions). But, this does not appear to show me every
file uploaded for an event (I am not sure why). I abandoned this approach.

Looking at the backend database (of commons) for
https://eventmetrics-dev.wmflabs.org/programs/104/events/243 matched what I was
seeing in the UI.

Similarly /programs/91/events/187 the figures in the UI match what I see when
looking at the files uploaded in commons (after I worked out the false alarm
from above).

I have not so far looked at any of the other events in great detail.

The two SQL statements for the metrics appear to be:

SELECT img_name AS count FROM commonswiki_p.image
WHERE (img_timestamp BETWEEN $start_date AND $end_date)
AND (img_user_text IN ($list_of_user_names));

SELECT COUNT(DISTINCT(img_name)) AS count FROM enwiki_p.imagelinks
INNER JOIN enwiki_p.image ON il_to = img_name
WHERE (il_from_namespace = 0)
AND (img_timestamp BETWEEN $start_date AND $end_date)
AND (img_user_text IN ($list_of_user_names));

They seem pretty straight-forward (do not contain complicated logic, no or only
one JOINs). I am not aware of anything that could go wrong (for what that's
worth considering my limited knowledge in this area).

I say that assuming a number of things:

  • It gets the users' names from centralauth_p.globaluser, is this always accurate, up-to-date, etc.
  • Does the image table ever have duplicate images, for example if you upload a new version of an image does it create another row (I have not seen it do this)?

The bug regarding HTML encoding apostrophes in the hrefs. No evidence it is
related to this work. The image and/or imagelinks tables would need to get the
encoding incorrect, and I can see no evidence of this. Probably just a front-end
bug.

The only other thing I can see wrong are with revisions, again not related. So,
I will probably raise these both as separate bugs.

Thanks for the thorough review @dom_walden!!!

From T206819#4934901, yes it is intentional that we only look at the mainspace.

I attempted to compare the metrics to the data in the revisions tables
(/programs/m/events/n/revisions). But, this does not appear to show me every
file uploaded for an event (I am not sure why). I abandoned this approach.

Interesting! I'm not sure why that is, either. Somethings buggy, that's for sure. Nice catch!

It gets the users' names from centralauth_p.globaluser, is this always accurate, up-to-date, etc.

Correct.

Does the image table ever have duplicate images, for example if you upload a new version of an image does it create another row (I have not seen it do this)?

No, it should only contain the current visible image. Older versions I believe are stored in the oldimage table.

In T206819#4945620, @MusikAnimal wrote:

Thanks for the thorough review @dom_walden!!! Nice catch! ... Interesting! I'm not sure why that is, either. Somethings buggy, that's for sure. ...

Should this move back to In Dev then?

In T206819#4945620, @MusikAnimal wrote:

Thanks for the thorough review @dom_walden!!! Nice catch! ... Interesting! I'm not sure why that is, either. Somethings buggy, that's for sure. ...

Should this move back to In Dev then?

I assume the revisions table is not related to this work. I guess I need to raise it as a separate bug (assuming it is a bug, I don't yet know how it was designed to work).

Therefore, I see no reason why this ticket should not go into the Product sign-off column.

@jmatazzoni That being said...

The revisions table appears to return the whole day (days?) of the event (from midnight to midnight), sometimes getting the date conversion wrong in the case of non-UTC events.

However, even with the correct dates and times, I notice that the revision table for https://eventmetrics-dev.wmflabs.org/programs/104/events/243/revisions would show 135 file uploads. 3 files[1] were uploaded during the event but a few weeks later (outside the event time) someone uploaded a new version. Therefore, they are not included in the file uploads metric.

Still some (product/technical) decisions to be made here?

  1. https://commons.wikimedia.org/wiki/File:Diadumene_neozelanica_by_Tony_Wills.jpg https://commons.wikimedia.org/wiki/File:Corynactis_australis_by_Tony_Wills.jpg https://commons.wikimedia.org/wiki/File:Corynactis_australis2_by_Tony_Wills.jpg

! In T206819#4947149, @dom_walden wrote:

@jmatazzoni That being said...

The revisions table appears to return the whole day (days?) of the event (from midnight to midnight), sometimes getting the date conversion wrong in the case of non-UTC events.

Can you say more about the issue here? I gather it relates to a mismatch between Auckland time and UTC. And I imagine it has to do with why the photo I looked at is listed in Commons as having been uploaded on the 17th instead of the 18th, when the event happened. But what exactly is the problem this causes for our count, if you can state it clearly?

However, even with the correct dates and times, I notice that the revision table for https://eventmetrics-dev.wmflabs.org/programs/104/events/243/revisions would show 135 file uploads. 3 files[1] were uploaded during the event but a few weeks later (outside the event time) someone uploaded a new version. Therefore, they are not included in the file uploads metric.

Still some (product/technical) decisions to be made here?

So, again, please say more about what exactly happened, in your estimation. These files were uploaded and then, I suppose, replaced with newer versions. From the POV of Files Uploaded, that should not matter. @MusikAnimal, what would it take to fix this? This seems related to the issue we've talked about of metrics that should not change once the event is over.

(From the POV of files in use, these replaced images presumably are not used anywhere....Though actually, I'd say the replacement is still the same image. It depends on how you look at it. I'd be content to count that either way.)

The "Edit List" aka "revisions" is a neglected part of the application. The issues Dom uncovered have probably been there since the beginning. I've created T215926 and am fixing this now since it is affecting QA testing of new features such as this one. I am almost done and will have a PR shortly.

I'm not sure about the files uploaded / replaced files issue. This would seemingly affect how stats are generated, since we use image to get the uploaded timestamp, and this table refers only the most recent version of the file. Older versions that were uploaded during the event may be ignored, which probably isn't what we want. It is likely a big-ish change to fix this. Regardless I think it's a separate issue from this task, which is about extending current Commons files uploaded metrics to local file uploads (and Dom has identified a preexisting issue with this functionality).

Sorry, I have put two issues together in a slightly confusing manner, one with the Edit List and another with Files Uploaded and Files Used.

! In T206819#4947149, @dom_walden wrote:

@jmatazzoni That being said...

The revisions table appears to return the whole day (days?) of the event (from midnight to midnight), sometimes getting the date conversion wrong in the case of non-UTC events.

Can you say more about the issue here? I gather it relates to a mismatch between Auckland time and UTC. And I imagine it has to do with why the photo I looked at is listed in Commons as having been uploaded on the 17th instead of the 18th, when the event happened. But what exactly is the problem this causes for our count, if you can state it clearly?

This is the issue with the Edit List (which Leon has raised T215926 for). As I understand it, the timespan for the event in UTC is from 17 Aug 22:00:00 to 18 Aug 04:00:00. Getting the conversion to UTC wrong and only fetching revisions for the 18th (UTC) misses anything from the 17th, such as the original uploads of those files I linked. They are not currently included in the Edit List.

However, even with the correct dates and times, I notice that the revision table for https://eventmetrics-dev.wmflabs.org/programs/104/events/243/revisions would show 135 file uploads. 3 files[1] were uploaded during the event but a few weeks later (outside the event time) someone uploaded a new version. Therefore, they are not included in the file uploads metric.

Still some (product/technical) decisions to be made here?

So, again, please say more about what exactly happened, in your estimation. These files were uploaded and then, I suppose, replaced with newer versions.

That is correct. This is the issue with Files Uploaded and Files Used, which Leon has summarised above.

! In T206819#4948234, @MusikAnimal wrote:

I'm not sure about the files uploaded / replaced files issue. This would seemingly affect how stats are generated, since we use image to get the uploaded timestamp, and this table refers only the most recent version of the file. Older versions that were uploaded during the event may be ignored, which probably isn't what we want. It is likely a big-ish change to fix this.

To fix this, I imagine we could switch so that we're looking at the "Date of page creation," which doesn't change, instead of whatever date we're looking at now? It would be a separate ticket, as you say. And first I'd need to decide that there is no such thing as events that are all about re-uploading images.... And, I imagine this updating of images thing is an edge case. But if we wanted to make that change, Leon, that doesn't sound so bad. Why do you say it's a big-ish change to switch the file uploaded lookups to that? Isn't that what we do for Pages Created?

@dom_walden, all of the above said, and two tickets being created out of your explorations, is this ticket ready to move forward?

To fix this, I imagine we could switch so that we're looking at the "Date of page creation," which doesn't change, instead of whatever date we're looking at now? It would be a separate ticket, as you say. And first I'd need to decide that there is no such thing as events that are all about re-uploading images.... And, I imagine this updating of images thing is an edge case. But if we wanted to make that change, Leon, that doesn't sound so bad. Why do you say it's a big-ish change to switch the file uploaded lookups to that? Isn't that what we do for Pages Created?

I guess it depends on whether re-uploads by participants are of any interest (say if the older version was not by the participant). If not, yes going by the File page itself should make this much more straightforward :)

@dom_walden, all of the above said, and two tickets being created out of your explorations, is this ticket ready to move forward?

Yes. I will do that now.