Page MenuHomePhabricator

Update and expand the COVID-19 data page
Closed, ResolvedPublic

Description

Hang Do Thi Duc, hang@wikimedia.org

This is a follow-up request for support to update and expand the COVID-19 data page (https://wikimediafoundation.org/covid19/data). I am sending this in coordination with Lauren Dickinson from the Communications department.
Here are our (clarifying) questions:

  1. Diego site (https://covid-data.wmflabs.org/), are we using the same list that he is using? If yes, that would mean we can use his data on our page, as well. In the spreadsheet you distinguish between ""Comms list"" and Diego's counts, and at the same time we have heard from your team that we can use his total editor and edit counts from his site (with the starting date of January 1, 2020). We would like clarification on this.
  2. Is a weekly update of data realistic for you to deliver? Please see our desired schedule with details on what data is needed here: https://docs.google.com/document/d/1F50Tspn0TXrTuN7hJbG7KTJZ1ruy3SQzewc7MLTzd-8/edit#heading=h.7i64rqa93rhm

Please let us know what works best!

This is an effort by the Communications department that we already collaborated on, so I would like you to refer those previous conversations. But please of there are any further questions about priority, do let me know!

Medium: This is important to my work

Event Timeline

LGoto triaged this task as Medium priority.Apr 27 2020, 4:50 PM
LGoto added a project: Epic.
LGoto moved this task from Triage to Epics on the Product-Analytics board.
kzimmerman edited subscribers, added: hdothiduc; removed: SNowick_WMF, cchen, Mayakp.wiki.Apr 27 2020, 7:10 PM

Hi @hdothiduc! @SNowick_WMF, @Mayakp.wiki, and @cchen are making subtasks to reflect the weekly updates for the data they've already pulled, and they'll also include links to their related notebooks & repos (so someone else can take over if needed).

We'll plan on doing these updates manually for now.

@hdothiduc regarding your first question:

Diego site (https://covid-data.wmflabs.org/), are we using the same list that he is using? If yes, that would mean we can use his data on our page, as well. In the spreadsheet you distinguish between ""Comms list"" and Diego's counts, and at the same time we have heard from your team that we can use his total editor and edit counts from his site (with the starting date of January 1, 2020). We would like clarification on this.

Yes, we are using the same list, so you can get total editor and edit counts from his site.

@hdothiduc regarding your first question:

Diego site (https://covid-data.wmflabs.org/), are we using the same list that he is using? If yes, that would mean we can use his data on our page, as well. In the spreadsheet you distinguish between ""Comms list"" and Diego's counts, and at the same time we have heard from your team that we can use his total editor and edit counts from his site (with the starting date of January 1, 2020). We would like clarification on this.

Yes, we are using the same list, so you can get total editor and edit counts from his site.

Thank you Kate for clarifying! We kept the total editor and edit counts in the weekly update request, so we can have one consistent Data-current-as-of-date for each section that will be updated on the data page.

I would still be curious what the distinction was in this spreadsheet tab - but it's not crucial for my work anymore, as I assume Shay, Maya or Connie will give us updated numbers according to the request.

Let me know if you want me to update the schedule in the Google Doc linked in the task description, e.g. change the weekly updates to every Wednesday or Thursday or change the frequency!

@hdothiduc The "comms" list was the original list we pulled, which did not include new articles that had been added over time. With these updates, I assume you want:

  • total articles, including any new ones that have been linked to COVID-19
  • edits/editors/pageviews for the above

With that in mind, when you do the next update, can you link the [1] reference in the References section to https://covid-data.wmflabs.org/pagesNoHumans, rather than the (static) meta page it links to currently? (See screenshot below).

Shay and Connie are working on the updates now, using the updated list from Diego's page.

Aaah thank you! That all makes sense now.
Yes, that sounds great! I will update that link when I update the data on the page!

I have updated the schedule in the Google Doc linked in the description of the task (to May 1, May 8, May 15).

@hdothiduc I found some unexpected differences in pageview counts between our "old" Comms list (that's used for https://analytics.wikimedia.org/published/dashboards/Wikipedia_C-19_Comms_Stats/) and the more updated version of Diego's list, I need to check in with him and see what's missing, I may be using the wrong list, or some pages were removed.

I'm still working on the weekly updates dashboard but until we get the list confirmed we are stuck in orbit. Will update you as soon as I get clarification.

Thank you @SNowick_WMF for the update! Once you get clarification, do you mind updating the schedule (same link in the description of the task)? So we can plan around those dates. Thank you very much!

Yes, thanks, @SNowick_WMF and all! We're still getting solid media interest on these stats and sharing on social media, so it would be great to update the data soon. Let us know if there's any way we can help + when you expect the new data to be available.

@hdothiduc I've gotten more clarification on the pages and setting up queries now, getting everything automated will be complete this week but I set the deadline for Friday May 8 just to give a realistic timeline since we need to integrate archived and new data and make the weekly update schedule function smoothly. I can also provide the data you need for this week in a manual report, please let me know if that would help.

Thank you @SNowick_WMF! I adjusted the dates for the following weeks accordingly.
I think if it is easy/quick to create a manual export, that would be great to have it earlier.

Just to give you context:
I am creating a new visualization for Section 4 (Breakdown by language) (here's a draft), so it would be helpful to use updated data for this section in particular. But no worries if it's not easy!

Hi @hdothiduc I'll send you a manual update on Wednesday 5/6 so you have the data you need.

Hi @hdothiduc because we added more articles and languages to the Related Pages list the latest data does not line up to the old data in the usual Covid-19 Infographic GDoc
I made a new data sheet here: https://docs.google.com/spreadsheets/d/11EBCQo5RK82AlPMpcBRTAARHB7EK3Zpp3rC9jJR3z7Y/edit?usp=sharing
If you need totals for pageviews you can use the new data with the caveat that it doesn't line up exactly to the old data because some articles have moved or changed, the view counts are very close, just not exact between 2020-03-13 and 2020-04-03 because of a problem with collecting pages. Let me know if you have questions. Shay

hdothiduc added a comment.EditedMay 7 2020, 2:22 PM

Thank you very much @SNowick_WMF! I tagged you in some comments in the text document linked in the task description and have invited you to a meeting for later.
For documentation the comments were about

  • # of edits
  • Average # of edits made per hour since Dec. 1, 2019
  • total number of editors
  • total number of pageviews per language for the date range *

*we are using the data in the rows and columns, so the granularity is important, however we also use the total numbers, and having them stated explicitly will be good for anyone to check my work once I use the data on any visualizations

I've just looked at the data again and I do have some more questions that we can discuss in the meeting:

  • Section 4 (Breakdown by language), as you can in the draft I am using this data in one single visualization comparing totals of editors, pageviews and articles for each language, so it's important that the date range for this data is the same. Looks like the data for the editors starts January 1 2020, while the others start December 1 2019
  • In the editors column languages are duplicated, what is the meaning?

I also wanted to check in with you on the status of the central page to host the data we are showing on the data page, so we can link to it. @LDickinsonWMF had discussed that this is important for transparency and our communication goals. Do you have any updates regarding that? Maybe that's a question for @kzimmerman

SNowick_WMF added a comment.EditedMay 8 2020, 5:58 PM

@hdothiduc https://docs.google.com/spreadsheets/d/11EBCQo5RK82AlPMpcBRTAARHB7EK3Zpp3rC9jJR3z7Y/edit?usp=sharing now has all historical data and updated data through 2020-04-30.

I am waiting to hear back about getting the missing editor/edit counts for 2019-12, stay tuned, everything else is up to date.

  • # of edits, Average # of edits made per hour since Dec. 1, 2019 UPDATED
  • total number of editors * 2019-12 data UPDATED
  • total number of pageviews per language for the date range *
  • Looks like the data for the Editors by Language starts January 1 2020, while the others start December 1 2019 - PENDING
  • In the editors column languages are duplicated, what is the meaning?- FIXED

PENDING:

hdothiduc added a comment.EditedMay 8 2020, 11:30 PM

Awesome, thank you very much @SNowick_WMF! Nice checklist, too!

Since there is data in tabs that is based on the old static "strongly-related" list and other data is based on the new dynamic list, I suggest each spreadsheet tab should indicate which list (maybe with the links) was used.
The pageviews total tab for example, I am not sure if that uses the new dynamic or the static list.

I have a couple of questions, that might also need input from @LDickinsonWMF
So, in the schedule I had indicated some the updates in sections. The idea behind it is that each section has one indicator for the date range of the data (s. red circle below)


In the spreadsheet, I see articles and languages have this date range: 2019-12-01 - 2020-05-05
Editors and Edits are for: 2019-12-01 - 2020-04-30
I don't feel strongly about either of these date ranges, so whatever is easier for you is ok for us! It's just important to have one date range for every data point in a section (the sections I refer to are in the schedule)

Similarly, we will update this section:


That means the total pageviews (number at the top) and the daily pageviews in the line chart should cover the same date range. Maybe it's best to just use December 1 2019 - April 30 2020, @LDickinsonWMF? Otherwise, Shay, can you provide the necessary data to cover the same date range? I think that would mean, adding/summing up all values in the total_views column in this tab here (Pageview Archive Updated).

Also in the total_views column there is a mixed type of numbers. Would it be possible to remove the thousands comma 3,236,133 -> 3236133, that will be easier for me to just copy the data, convert to json format and add to my page template in Wordpress.

Let me know if anything here is confusing and I can explain further!

Summary

  • Data for one date range for Section 1
  • Data for one date range for Section 2
  • One consistent numbers format

Hey @SNowick_WMF / @hdothiduc: I just read through the above thread and am jumping in!

I agree with Hang about the date ranges: Being consistent by section is more important than having fresher data (although having fresh data is great!). So, if it's possible to pull everything from December 1, 2019 - May 5, 2020, that would be ideal. But if it is faster / more efficient for this update to pull through April 30, 2020, that works great too. Whatever is best for you + your team, Shay.

Thank you both!

HI @LDickinsonWMF, good point, I will be updating everything to 5/14 for deadline and will make sure end dates are consistent across datasets going forward

Thank you very much @SNowick_WMF for all your work.
(I was out Thursday and Friday last week. And I am going through the data now)

I noticed when adding all daily pageviews from "Pageviews Archive (Updated)" it doesn't result to the total number on "Pageviews Month + Total" (366882694 vs 269319592) - and I believe it should be the same, right?

@SNowick_WMF: Are you comfortable with us including a link to this document on the Foundation webpage?

Specifically, it would be hyperlinked in the first reference: "Strongly-related Wikipedia articles include all COVID-19 related pages except for the human-related ones (e.g. Tom Hanks)..."

@hdothiduc Yes, totals should have matched, fixed pageview counts, they are aligned now.

@LDickinsonWMF Diego's list of related pages is probably better to link to - https://covid-data.wmflabs.org/pagesNoHumans - we aren't using that exact list because we've filtered out any non-wikipedia pages but it's the basis of our list. The methodology he used is explained here: https://paws-public.wmflabs.org/paws-public/User:Diego_(WMF)/CoronaAllRelatedPagesMarch30.ipynb. If you prefer to link to the automated pageview count I am ok with that as well.

Great!

For the Section 4 (breakdown by language), I have to do some data processing to fit the need for my visualization, which is a list of the top 15 languages by pageviews, and for each language I need the number of editors, articles and pageviews (and the language in the local language). I uploaded my jupyter notebook here to do exactly that.
I have used the result from above on the develop instance for review.

SNowick_WMF closed this task as Resolved.Jul 28 2020, 5:08 PM