Page MenuHomePhabricator

RBrounley_WMF (Ryan)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
May 29 2020, 11:52 PM (10 w, 3 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
RBrounley (WMF) [ Global Accounts ]

Recent Activity

Tue, Jul 14

RBrounley_WMF added a comment to T257480: Sample HTML Dumps - Request for feedback.

English Wiki has 15m articles (I believe)
a full enwiki dump is clocking in at 944gb or something insanely large

I'm pretty sure a large part of this issue is based on how you handle redirects really and not compression format. Enwiki has 9.3M redirects. Right now the HTML of an article is fully reproduced for a redirect (i.e. not just redirect to [[article]] but the full-text of that article that the reader would see). English Wikipedia has just over 6M articles in the classic sense, so reproducing the full article text in the redirects would probably be what explodes it to 15M full articles and a very large file (as opposed to 6M full articles and ~9M very tiny files that just indicate that they are redirects).

Tue, Jul 14, 2:25 PM · Analytics-Radar, Dumps-Generation

Jul 10 2020

RBrounley_WMF added a comment to T257480: Sample HTML Dumps - Request for feedback.

Couple quick thoughts about the format: it would be good for the articles to be written into subdirectories for the larger wikis, so that we don't have hundreds of thousands of files (or millions!) in one directory. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DumpHTML/+/refs/heads/master/dumpHTML.inc#477 for way back when these were produced by extension (in 2008), I think they used three levels of subdirs as the default back then but this could be adjustable depending on the size of the wiki.

Although the large tech partners that will consume these dumps will probably be fine with one large gz tarball, we want these to be easily usable by volunteers and researchers too, so I'd consider providing them also in a format that makes parallel processing of the dumps possible, such as bz2 multistream format with 100 or 1000 pages per 'stream', maybe without any tarring up at all. It might be nice to have a close html tag too, the sample articles I looked at didn't have it.

Jul 10 2020, 3:31 PM · Analytics-Radar, Dumps-Generation

Jul 8 2020

RBrounley_WMF updated subscribers of T257480: Sample HTML Dumps - Request for feedback.
Jul 8 2020, 6:25 PM · Analytics-Radar, Dumps-Generation
RBrounley_WMF added a subtask for T254275: HTML Dumps - June/2020: T257480: Sample HTML Dumps - Request for feedback.
Jul 8 2020, 4:50 PM · Analytics-Radar, Platform Engineering, Dumps-Generation
RBrounley_WMF added a parent task for T257480: Sample HTML Dumps - Request for feedback: T254275: HTML Dumps - June/2020.
Jul 8 2020, 4:50 PM · Analytics-Radar, Dumps-Generation
RBrounley_WMF created T257480: Sample HTML Dumps - Request for feedback.
Jul 8 2020, 4:50 PM · Analytics-Radar, Dumps-Generation

Jul 7 2020

RBrounley_WMF closed T255524: HTML Dumps 429 error on RESTBase endpoints, a subtask of T254275: HTML Dumps - June/2020, as Resolved.
Jul 7 2020, 12:59 AM · Analytics-Radar, Platform Engineering, Dumps-Generation
RBrounley_WMF closed T255524: HTML Dumps 429 error on RESTBase endpoints as Resolved.
Jul 7 2020, 12:58 AM · Traffic, Operations

Jun 24 2020

RBrounley_WMF added a comment to T254275: HTML Dumps - June/2020.

Yep, sorry about the delay here @Sj. @Kelson Interesting, learning about this is interesting. I’d love to learn more about your work and how we might best collaborate with each other and fill some of the technical-gaps. I'll ping you off-phab with some questions once I've done some more reading, and if you're available earlier than your (great-sounding) techtalk I'd love to have a quick video-chat meeting with you. And thank you for your patience whilst I'm digging into the many years of history here!

Jun 24 2020, 5:40 PM · Analytics-Radar, Platform Engineering, Dumps-Generation

Jun 16 2020

RBrounley_WMF added a comment to T254275: HTML Dumps - June/2020.

Great, thanks @CDanis - cited you here on the sub-task related to the 429 errors we're getting. https://phabricator.wikimedia.org/T255524

Jun 16 2020, 3:16 AM · Analytics-Radar, Platform Engineering, Dumps-Generation
RBrounley_WMF added a subtask for T254275: HTML Dumps - June/2020: T255524: HTML Dumps 429 error on RESTBase endpoints.
Jun 16 2020, 3:15 AM · Analytics-Radar, Platform Engineering, Dumps-Generation
RBrounley_WMF added a parent task for T255524: HTML Dumps 429 error on RESTBase endpoints: T254275: HTML Dumps - June/2020.
Jun 16 2020, 3:15 AM · Traffic, Operations
RBrounley_WMF created T255524: HTML Dumps 429 error on RESTBase endpoints.
Jun 16 2020, 3:14 AM · Traffic, Operations

Jun 15 2020

RBrounley_WMF added a comment to T254275: HTML Dumps - June/2020.

@ArielGlenn - oh great, yeah I misunderstood that. So the first run is obviously expensive on RESTBase to grab all of the pages but we're thinking about listening to Kafka through this endpoint below or something similar. Then just changing it via an upsert type approach using RESTBase only on the changes... @Ottomata, @Milimetric - want to make sure I have this right from our call. For now, we're running these bi-weekly and still designing the second dumps out haha.

https://stream.wikimedia.org/?doc#/Streams/get_v2_stream_recentchange
Jun 15 2020, 10:23 PM · Analytics-Radar, Platform Engineering, Dumps-Generation
RBrounley_WMF updated subscribers of T254275: HTML Dumps - June/2020.

Hey all -

Jun 15 2020, 10:09 PM · Analytics-Radar, Platform Engineering, Dumps-Generation

Jun 2 2020

RBrounley_WMF created T254275: HTML Dumps - June/2020.
Jun 2 2020, 7:27 PM · Analytics-Radar, Platform Engineering, Dumps-Generation