Coronavirus data preservation
Description
Details
Related Objects
Event Timeline
Change 583678 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Temporarilty disable webrequest deletion for 1 week
Change 583678 merged by Ottomata:
[operations/puppet@production] Temporarilty disable webrequest deletion for 1 week
@Ottomata thanks for the ping -- I have one final adjustment to make to the query but then we will start loading in data from January 1st onwards. Each day takes a few hours to run the full query so it might take a day or two for us to get through January (and verify that everything is still working as expected) but I will let you know on progress.
@Isaac Also please be so kind to document where results of this query are stored (should be a location on hdfs) and it will be wise to put query in source control.
Also please be so kind to document where results of this query are stored (should be a location on hdfs) and it will be wise to put query in source control.
@Nuria sounds good -- because I've been using the SWAP notebooks and I was hesitant to share the notebook publicly before we had agreed on privacy filters (so I didn't accidentally leak information), it's not beautiful source control but it is mirrored here: https://github.com/geohci/covid-19-sessions
I'm happy to work on a better solution if you'd like, but for now I'll just keep that consistent with what's on SWAP.
And unless @JAllemandou indicates otherwise, I'll be storing the external table at hdfs://analytics-hadoop/user/isaacj/covid19 and it'll be in Hive under isaacj.covid19_sessions.
I have one final adjustment to make to the query
And just because that was an unnecessarily vague statement by me -- the adjustment is attempting to get slightly more complete Wikidata ID joins per the issues raised in T249773. It won't affect anything else in the query though because it's a left join and so even if I screw it up (miss some Wikidata IDs for instance), I can just drop the column and redo the join at a later point.
I'm happy to work on a better solution if you'd like, but for now I'll just keep that consistent with what's on SWAP.
swap + github sounds good, nice
@Isaac we need to enable deletion of webrequest again, are you ready for us to do so? Please let us know cc @JAllemandou @elukey
Thanks for the ping @Nuria
@JAllemandou @elukey can you confirm what days would be deleted at this point?
I have the data that we need through January 14th and am working on gathering through the rest of January (at which point, I'll be sufficiently ahead of normal deletion that the onus will be on me to generate the dataset before webrequests is purged for a given day).
Actually, sorry, give me several more hours with the January 9 - 14th data. I just realized that the "Coronavirus" article was not part of our "Covid-19-related articles" dataset, which is really the only source of information on Wikipedia in these early weeks so I'd like to rebuild those datasets.
Article traffic for Coronavirus: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&start=2020-01-01&end=2020-01-23&pages=Coronavirus
Update: I have now pulled the data up to January 28th and will continue to build the dataset so that I stay ahead of the normal deletion process. I assume that restoring the deletion would mean that January 22nd is the oldest day for which data exists but correct me if I'm wrong.
Correct 90 days ago. January 23 now :) @Nuria Let me know if I should re-enable deletion.
Change 591935 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Re-enable webrequest data deletion job
Change 591935 merged by Ottomata:
[operations/puppet@production] Re-enable webrequest data deletion job