Page MenuHomePhabricator

covid19 data preservation
Closed, ResolvedPublic5 Estimated Story Points

Description

Coronavirus data preservation

Event Timeline

Change 583678 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Temporarilty disable webrequest deletion for 1 week

https://gerrit.wikimedia.org/r/583678

Change 583678 merged by Ottomata:
[operations/puppet@production] Temporarilty disable webrequest deletion for 1 week

https://gerrit.wikimedia.org/r/583678

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Ping @Isaac how goes, let us know when we can re-enable deletion.

@Ottomata thanks for the ping -- I have one final adjustment to make to the query but then we will start loading in data from January 1st onwards. Each day takes a few hours to run the full query so it might take a day or two for us to get through January (and verify that everything is still working as expected) but I will let you know on progress.

@Isaac Also please be so kind to document where results of this query are stored (should be a location on hdfs) and it will be wise to put query in source control.

Also please be so kind to document where results of this query are stored (should be a location on hdfs) and it will be wise to put query in source control.

@Nuria sounds good -- because I've been using the SWAP notebooks and I was hesitant to share the notebook publicly before we had agreed on privacy filters (so I didn't accidentally leak information), it's not beautiful source control but it is mirrored here: https://github.com/geohci/covid-19-sessions

I'm happy to work on a better solution if you'd like, but for now I'll just keep that consistent with what's on SWAP.

And unless @JAllemandou indicates otherwise, I'll be storing the external table at hdfs://analytics-hadoop/user/isaacj/covid19 and it'll be in Hive under isaacj.covid19_sessions.

I have one final adjustment to make to the query

And just because that was an unnecessarily vague statement by me -- the adjustment is attempting to get slightly more complete Wikidata ID joins per the issues raised in T249773. It won't affect anything else in the query though because it's a left join and so even if I screw it up (miss some Wikidata IDs for instance), I can just drop the column and redo the join at a later point.

I'm happy to work on a better solution if you'd like, but for now I'll just keep that consistent with what's on SWAP.

swap + github sounds good, nice

@Isaac we need to enable deletion of webrequest again, are you ready for us to do so? Please let us know cc @JAllemandou @elukey

Thanks for the ping @Nuria

@JAllemandou @elukey can you confirm what days would be deleted at this point?

I have the data that we need through January 14th and am working on gathering through the rest of January (at which point, I'll be sufficiently ahead of normal deletion that the onus will be on me to generate the dataset before webrequests is purged for a given day).

Actually, sorry, give me several more hours with the January 9 - 14th data. I just realized that the "Coronavirus" article was not part of our "Covid-19-related articles" dataset, which is really the only source of information on Wikipedia in these early weeks so I'd like to rebuild those datasets.

Article traffic for Coronavirus: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2020-01-01&end=2020-01-23&pages=Coronavirus

Update: I have now pulled the data up to January 28th and will continue to build the dataset so that I stay ahead of the normal deletion process. I assume that restoring the deletion would mean that January 22nd is the oldest day for which data exists but correct me if I'm wrong.

Correct 90 days ago. January 23 now :) @Nuria Let me know if I should re-enable deletion.

Change 591935 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable webrequest data deletion job

https://gerrit.wikimedia.org/r/591935

Change 591935 merged by Ottomata:
[operations/puppet@production] Re-enable webrequest data deletion job

https://gerrit.wikimedia.org/r/591935

Thanks @Ottomata and @Nuria for coordinating this so we had time to put together the dataset!

Nuria updated the task description. (Show Details)
Nuria set the point value for this task to 5.
Nuria set Final Story Points to 5.