Page MenuHomePhabricator

[Spike] Quantify pages and revisions as relevant to dumps
Closed, ResolvedPublic3 Estimated Story Points

Description

To inform the general work of dumps in the new architecture (Event Platform -> Spark -> XML Publishing), do a quick analysis of:

  • number of pages by wiki and namespace
  • revisions per page by namespace
  • anything else that might be easy to count and useful

Details

Other Assignee
VirginiaPoundstone

Event Timeline

WDoranWMF set the point value for this task to 3.Aug 24 2023, 2:18 PM

First, some setup.

A working table, called analyze_page_statistics with total edits, bot edits, and anon edits for each page on each wiki:

create table analyze_page_statistics as

select 'history-2023-07' as source,
       wiki_db,
       page_namespace_historical as page_namespace,
       page_id,
       count(revision_id) as edits,
       sum(if(size(event_user_is_bot_by_historical) > 0, 1, 0)) as bot_edits,
       sum(if(event_user_is_anonymous, 1, 0)) as anon_edits
  from wmf.mediawiki_history
 where snapshot='2023-07'
   and not page_is_deleted
   and not page_is_redirect
 group by
       wiki_db,
       page_namespace_historical,
       page_id

In previous attempts I also imported a dumps snapshot, but it was very similar so I didn't dig too hard into it. And I realized that a bunch of pages have been deleted, so I did it again (query above shows the second attempt).

With this, I can run some queries.

The above gives us 524,283,851 pages across all projects and namespaces to play with.

First I wanted to know how many revisions do the vast majority of pages have [1].

contentp999max_editsmax_bot_editsmax_anon_edits
true6551608355635234104
false2841932417831710361350

So 99.9% of content pages have fewer than 655 edits. And the most edits any content page has is 160835. Non-content pages on the other hand have an even steeper power curve, with almost two million edits as the max. On some of these pages, it looks like bots are a significant presence but not the majority, so we can ignore that dimension for now and focus on overall edits.

So let's see what namespaces, on what wikis, have pages with more than 100k edits [2]

wikinscontentmax_edits
enwiki41932417
dewiki4877030
metawiki2751218
commonswiki4714667
zhwiki4556862
wikidatawiki4509245
enwiki2467614
ruwiki4404976
enwiki11395998
hewiki4385639
eswikinews10335821
eswiki4318701
dewiki10314372
nlwiki4268003
frwiki4266421
enwikinews10232797
frwikinews10227120
frwiki10224636
fawiki10224223
itwikinews10217246
plwiki4214497
test2wiki3188785
plwiki102186295
commonswiki2179321
enwiki3163316
ruwiki2161601
metawiki0TRUE160835
enwiki1154359
ptwikinews10148158
dewiki5142461
hywiki10133181
enwiki5125916
svwiki4122425
hywiktionary10122317
eswiki10122075
hywikiquote10121967
dewikinews10119601
eowikinews10117548
enwiki10107212
frwiki102105516
ruwiki104104401

Some detail on the namespaces that show up here [3]:

wikisnsns name
metawiki0
enwiki1Talk
commonswiki ruwiki metawiki enwiki2User
test2wiki enwiki3User talk
commonswiki wikidatawiki hewiki svwiki zhwiki frwiki eswiki ruwiki dewiki plwiki nlwiki enwiki4Project
dewiki enwiki5Project talk
hywikiquote enwikinews frwiki dewiki frwikinews enwiki eswikinews hywiktionary itwikinews hywiki eowikinews eswiki fawiki ptwikinews dewikinews10Template
enwiki11Template talk
plwiki102Wikiprojekt
frwiki102Projet
ruwiki104Проект

So a single wiki has a content page with more than 100k edits (meta, which makes sense). This is pretty convincing evidence to me that we could maybe create a simpler dump, where we only output the last 1k edits for any non-content page. And then once in a while we dump everything for those namespaces too. It's something to talk to the community about, but it feels like if we can be more mindful here we can dramatically increase the job's performance and, therefore, frequency.

For more analysis, anyone can use this table, so if you have questions, do ask!

[1] overall stats by content/not-content

   with namespaces as (
 select dbname, namespace, namespace_is_content
   from wmf_raw.mediawiki_project_namespace_map
  where snapshot = '2023-08'
)
 select /*+ BROADCAST(namespaces) */
        namespace_is_content,
        approx_percentile(edits, 0.75) as edits_p75,
        approx_percentile(edits, 0.999) as edits_p999,
        max(edits) as edits_max,
        approx_percentile(bot_edits, 0.75) as bot_edits_p75,
        approx_percentile(bot_edits, 0.999) as bot_edits_p999,
        max(bot_edits) as bot_edits_max
   from analyze_page_statistics
            inner join
        namespaces                              on dbname = wiki_db
                                                and namespace = page_namespace
  group by namespace_is_content

[2] what namespaces on what wikis have pages with more than 100,000 pages?

   with namespaces as (
 select dbname, namespace, namespace_is_content
   from wmf_raw.mediawiki_project_namespace_map
  where snapshot = '2023-08'
)
 select /*+ BROADCAST(namespaces) */
        wiki_db,
        page_namespace,
        namespace_is_content,
        max(edits) as max_edits
   from analyze_page_statistics
            inner join
        namespaces                  on namespaces.namespace = page_namespace
                                    and namespaces.dbname = wiki_db
  where edits > 100000
  group by
        wiki_db,
        page_namespace,
        namespace_is_content

[3] -- namespace details

   with namespaces as (
 select dbname, namespace, namespace_canonical_name
   from wmf_raw.mediawiki_project_namespace_map
  where snapshot = '2023-08'
),      wikis as (
 select wiki_db,
        page_namespace,
        max(edits)
   from analyze_page_statistics
  group by wiki_db, page_namespace
 having max(edits) > 100000
)
 select /*+ BROADCAST(namespaces) */
        collect_set(wiki_db),
        page_namespace,
        namespace_canonical_name
   from wikis
            inner join
        namespaces                  on namespaces.namespace = page_namespace
                                    and namespaces.dbname = wiki_db
  group by
        page_namespace,
        namespace_canonical_name

This is pretty convincing evidence to me that we could maybe create a simpler dump, where we only output the last 1k edits for any non-content page.

Should we open a task to investigate this before we close this one? I'm thinking we could run both jobs 1k and full dump, and observe whether there is a significant performance/time difference.