Page MenuHomePhabricator

Parse wikidumps and extract redirect information for 1 small wiki, romanian
Closed, ResolvedPublic

Description

Parse wikidumps and extract redirect information for 1 small wiki, romanian.

We provide info on mediawiki history on whether a page is a redirect or not but we don't have historical info about it. The dumps, however, have this information so we can parse them and extract
historical info about historical revisions of redirect pages. The catch is that since mediawiki is multilingual the redirect code depends on the language.

This work needs to be coded in a distributed fashion in pyspark or similar using data in hadoop rather than it being a one-off job on the stats machine.

Event Timeline

Nuria created this task.Sep 5 2019, 4:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 5 2019, 4:37 PM
Nuria updated the task description. (Show Details)Sep 5 2019, 9:48 PM
Restricted Application added a subscriber: Strainu. · View Herald TranscriptSep 5 2019, 9:48 PM
fdans assigned this task to leila.Sep 9 2019, 3:54 PM
fdans moved this task from Incoming to Radar on the Analytics board.
MGerlach claimed this task.Sep 12 2019, 9:03 AM
MGerlach added subscribers: leila, MGerlach.

Martin will work on this project as part of his onboarding

leila triaged this task as Medium priority.Sep 12 2019, 3:03 PM

Language dependent Redirect Codes

We can extract the aliases for the redirect code from the corresponding dump in *siteinfo-namespaces.json.gz
This contains a dictionary in which we can find all the redirect-codes for every wiki.
Playing with a few examples I obtain the following results.

../data/ikwiki-20190901-siteinfo-namespaces.json.gz
['#REDIRECT']
../data/dewiki-20190901-siteinfo-namespaces.json.gz
['#WEITERLEITUNG', '#REDIRECT']
../data/enwiki-20190901-siteinfo-namespaces.json.gz
['#REDIRECT']
../data/frwiki-20190901-siteinfo-namespaces.json.gz
['#REDIRECTION', '#REDIRECT']
../data/rowiki-20190901-siteinfo-namespaces.json.gz
['#REDIRECTEAZA', '#REDIRECT']

This is great finding!
This file does not contain redirect only, but also every other aliases that might be usefull :)
Awesome

@JAllemandou
I came up with a first solution on spark (see attached notebooks; I ran this on the notebook-server).
This creates a dataframe with all revision-entries that are identified as redirects based on the content (page_id, revision_id, redirect_page).
I tested on rowiki and it runs in no time.
I extract the redirect-aliases automatically, so in principle could be applied to any wiki.

Happy to discuss on how to improve spark-usage.
Let me know what would be a good way to iterate on those results.

Hi @MGerlach,
Awesome results :)

I have some requests for improvement before we plan on how we move forward:

  • Can you confirm which spark-kernel you use? I assume it's PySpark Yarn (large) but would rather be sure
  • Let's remove the filter for namespace 0, as we're interested in all namespaces
  • Let's have a precise time measurement (and a manual check of how busy the cluster when the job runs), to better evaluate the performance cost
  • Finally global statistics would be very welcome:
    • How many revisions
    • How many redirects? By redirect-tag?

Many thanks for the great work :)

MGerlach added a comment.EditedSep 23 2019, 1:30 PM

Thanks for the feedback @JAllemandou

Regarding your questions:

  • yes, I used PySpark Yarn (large)
  • looking at all namespaces is ok
  • I will make some systematic checks for running time
    • I would simply use %%time

CPU times: user 308 ms, sys: 36 ms, total: 344 ms
Wall time: 28.9 s

    • is there an alternative (preferred) way ?
  • Will make some global statistics which we can easily get from the derived dataframes?

Questions from my side:

  • For large wikis (e.g. frwiki) I get a memry error
    • the following error '''container killed by YARN for exceeding memory limits. 6.5 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.'''
    • What I dont understand is that when using LIMIT = 10,000,000 in the query, the error ceases to disappear, which might seem natural as I limit the number of results from the query. However, I only get ~2,000,000 results from the query which is much smaller than 10,000,000.
    • as a result I do not fully understand how the LIMIT statement prevents memory-overflow .
  • if the code ran, would the idea be to add a column to the mediawiki_history table, which states the redirect for each revision_id ?
MGerlach added a comment.EditedSep 24 2019, 3:58 PM

Memory error persists

@JAllemandou
Main problem: memory error for large (and even not super large) wikis such as frwiki.
I implemented some of your suggestions from the discussion today with andrew

  • processing a single query
  • only keeping minimal amount of text (substrings of the redirect command and the redirect-page-title)
  • not saving as pandas, but simply applying the count() function to see how many results we get.

Attached is a new notebook (executed with '''pyspark - YARN (large)'''.

leila moved this task from Staged to In Progress on the Research board.Sep 26 2019, 3:32 PM
leila added a comment.Sep 26 2019, 3:35 PM

@JAllemandou Martin and I had a chat now about this task. Can you give an update on what is left for Martin to do on this task? (I'm aware of the memory issues and my understanding is that that's not something you want Martin to work on. If that's not the case, please let us know as well.)

matching data to mediawiki-history table

The historical redirect table is extracted from wmf.mediawiki_wikitext_history
The above code extracts for each revision_id the redirect-command (say #redirect or #REDIRECT or #Weiterleitung) and the redirect-page (i.e. where it redirects to).
My aim was to write code that could join that information into the wmf.mediawiki_history table for a single snapshot of a given wikiproject (see the notebook).

Hi @MGerlach and @leila - kids and I have been sick almost full last week, explaining me not answering fast.
I have spent time trying to get a precise answer to the memory issue, but couldn't get as precise as would have expected :(

I confirm what @leila says: This memory issue is not to be solved by @MGerlach, and the current state for that task is good for me as it is :)
If ok for @Ottomata and @Nuria I'm gonna close this task and reopen a new one with the memory issue and the path to production for the feature.
Thanks a lot @MGerlach for the awesome work :)

@JAllemandou thanks for expanding. I've moved this task to the Done lane in the Research board. I'll also remove MGerlach as the assignee per what you described. Please update as you see fit and thanks for your work on this. :)

leila removed MGerlach as the assignee of this task.Sep 30 2019, 6:17 PM

@MGerlach congratulations on finishing your first task. :)

leila closed this task as Resolved.Thu, Dec 5, 7:56 PM