Page MenuHomePhabricator

Publish dump scraper reports
Closed, ResolvedPublic

Description

We have produced a series of raw analysis of Cite and Kartographer usage over all Wikimedia wiki pages by scraping the June 2023 HTML page dumps. These are in Cloud VPS attached to the private instance where we produced them, but need to be published to a permanent, public location.

The files are a bit hefty at 3.4GB total, so this needs to be considered when choosing the hosting.

Our current understanding is that we can make the data permanently available by uploading files to the Analytics clients where they will be available on the web under https://analytics.wikimedia.org/published/datasets/one-off/ , and then we publish pure metadata about the collection to Figshare, linking back to the files hosted by WMF.

What we want to publish

Draft publication metadata

Explanation of columns

Summary spreadsheet comparing all wikis

Sample summary of a single wiki: P51077
(total uncompressed size 24MB)

Sample of per-page data: P51076
(total uncompressed size 3.3GB)

Sample of per-page map externaldata: P51080
(total uncompressed size 26MB)

Published links

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedthiemowmde
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
Resolvedawight
ResolvedNone
Resolvedawight
Resolvedawight
Resolved taavi
ResolvedAndrew
Resolvedawight
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
ResolvedBUG REPORTProtsack.stephan
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
ResolvedNone
Resolvedawight

Event Timeline

Next step is for me to upload a small sample of the data to a Phabricator paste, then write wikimedia-research-l to ask for recommendations. If nobody responds, we might just upload to figshare or zenodo.

Request for advice posted as:
https://lists.wikimedia.org/hyperkitty/list/wiki-research-l@lists.wikimedia.org/message/M3IDFYT44O2NDGKKU7FG5Q25YTY4KGCS/

awight renamed this task from Preserve dump scraper reports to Publish dump scraper reports.Aug 23 2023, 7:40 AM
awight moved this task from Backlog to Doing on the WMDE-TechWish-Maintenance-2023 board.
awight updated the task description. (Show Details)
awight updated the task description. (Show Details)
awight moved this task from Doing to Watching / Epic on the WMDE-TechWish-Maintenance-2023 board.

@BTullis @Stevemunene
I'm homing in on https://analytics.wikimedia.org/published/datasets/one-off/ as a final resting place for this data set and wanted to check with you first about whether it's okay to host an additional 3.4GB on that filesystem.

@awight: @BTullis is out this week and we really want his input on this. Sorry for the delay, we'll try to move this forward early next week.

@Gehel Thanks for the acknowledgement! There's no huge rush, waiting a week or two to hear back is fine. We know we'll publish the data somewhere, so we're not blocked on creating the metadata and other tasks.

@BTullis @Gehel Let us know how we can help with the decision about where to publish this data. Everything is ready on our side.

Hi @awight - Apologies for the delay in getting back to you about this.

I've checked and everything is fine for you to proceed from our side and your understanding on the technical side is correct.
We currently have 95 GB free on the file system hosting these datasets, so an extra 3.4 GB will be fine. I don't think yours will even be an outlier, since one dataset is over 500 GB and 10s of GBs isn't uncommon either.

You can choose whichever stat server you wish to host the original copy, but I would suggest one of the higher numbers like stat1009 as we're gradually refreshing them in numerical order.
This is entirely up to you though and you're free free to go with whichever server is the most convenient for you.

Let us know if anything doesn't behave as expected and I'll look into it for you.

The data and metadata are published--the final step is to announce on the research-l mailing list.

awight moved this task from Doing to Done on the WMDE-TechWish-Maintenance-2023 board.