Page MenuHomePhabricator

Outreachy Application Task (Round 24): Build Python library to work with html-dumps
Closed, ResolvedPublic

Description

This task is the application microtask for T302237.

Overview

For this task, you're being asked to complete this notebook that explores the difference between two formats of Wikipedia articles (the raw wikitext and parsed HTML): https://public.paws.wmcloud.org/User:Isaac_(WMF)/Outreachy%20Summer%202022/Wikipedia_HTML_Dumps.ipynb

Using your knowledge of Python and data analysis, do your best to complete the notebook. Imagine that your audience for the notebook is people who are new to Wikimedia data analysis (very possibly like you before starting this task) and provide lots of details to explain what you are doing and seeing in the data.

The full Outreachy project will involve more comprehensive coding than what is being asked for here with support from your mentors (and some opportunities for additional explorations as desired). This task will introduce some of the basic concepts and give us a sense of your Python skills, how well you work with new data, documentation of your code, and description of your thinking and results. We are not expecting perfection -- just do your best and explain what you're doing and why! For inspiration, see this example of working with Wikipedia edit tag data and Python library created by a past Outreachy participant.

Set-up

  • Make sure that you can login to the PAWS service with your wiki account: https://paws.wmflabs.org/paws/hub
  • Using this notebook as a starting point, create your own notebook (see these instructions for forking the notebook to start with) and complete the functions / analyses. All PAWS notebooks have the option of generating a public link, which can be shared back so that we can evaluate what you did. Use a mixture of code cells and markdown to document what you find and your thoughts.
  • As you have questions, feel free to add comments to this task (and please don't hesitate to answer other applicant's questions if you can help).
  • If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to both of the mentor (mgerlach@wikimedia.org and isaac@wikimedia.org) with the link to your public PAWS notebook. We will try to make time to give this feedback once to anyone who would like it.
  • When you feel you are happy with your notebook, you should include the public link in your final Outreachy project application as a recorded contribution. We encourage you to record contributions as you go as well to track progress.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi @Isaac and @MGerlach, I am new to Open source and have knowledge of python, and do not have an idea about data analysis. I am stuck in code and need guidance on how can I learn and start to contribute to this project. Can we have a Google meet?

@Radhika_Saini can you provide more specific details about what you are stuck on? We are unable to do videochats for the application period but if there are specific parts where you are stuck or aspects you do not understand, we and the other applicants can help you figure them out.

I just wanted to clarify do we just to have to extract the content from sections as a text that only contains the content displayed on the Wikipedia page or de we also include class styles and other descriptions like table description is also present in the example have shared.

@ShivaniSangwan For the # TODO: write a function for extracting the article text, you want to just extract the text of the article (and not all of the syntax etc. information). For an example, see the output from print(wt.strip_code()) a few cells above. There is no right answer here about what is the "true" text of the article but you want the extracted text to make sense and look mostly like natural language. So the National teamYearAppsGoalsNorth Korea198820198970Total90 from wt.strip_code() is not a great outcome, neither are the Category: prefixes for links and ideally your function for the HTML would do better.

Also, for the final text preprocessing task, should we process each sections differently say if they consist of tables or any other different class type to give a cleaner output as seen on the actual Wikipedia page.

Yes, feel free write custom code for handling tables or other elements. It doesn't have to be perfect and just make sure to explain what you are doing via comments/markdown.

I have attached the content output of a section after using wt.get_sections() it also contains the comment in HTML code.

Thanks for the example -- you can exclude comments in your output.

Hi @Isaac and @MGerlach , is there any reason why this (mwparserfromhell) method considers Category as Wikilinks too?

image.png (256×544 px, 43 KB)

Hello @Isaac and @MGerlach

I am working on the microtask and was trying to scrape data from html.
While selecting different parts in html I was facing two problems which I solved with other approaches but still looking for the correct approach.

First that how can we filter out fields like we did in the case of WikiText.
Second while using Beautiful Soup’s function inside the print function I'm having following error which I'm unable to understand.

Here I was trying to use .get() function

print(f'* {count} wikilinks ({", ".join([str(l.get('title',None) for l in links[:2]])}, ...) ')

 Input In [17]
    print(f'* {count} wikilinks ({", ".join([str(l.get('title',None) for l in links[:2]])}, ...) ')
                                                        ^

but it is giving following error

SyntaxError: invalid syntax

Thank you!

@Appledora

Hi @Isaac and @MGerlach , is there any reason why this (mwparserfromhell) method considers Category as Wikilinks too?

image.png (256×544 px, 43 KB)

This is the expected behavior. Categories are also Wikipedia-pages (though in a different namespace). In the article, you will see the categories as blue links at the bottom.

@MGerlach , after reading the paper by Mitrevski et. al, I had expected there to be less information in the HTML code compared to the wikitext. However, while doing the first to-do, I have found the outcomes to be the opposite. Could you give me any pointers on it? Any supplementary literature would also be appreciated.
Also, for the TODO#1, I am trying to replicate some of the mwparserfromhell functions/methods. I just want to clarify, whether this is what was essentially asked to perform in this TODO. Thanks!

@Appledora

Hi @Isaac and @MGerlach , is there any reason why this (mwparserfromhell) method considers Category as Wikilinks too?

image.png (256×544 px, 43 KB)

This is the expected behavior. Categories are also Wikipedia-pages (though in a different namespace). In the article, you will see the categories as blue links at the bottom.

I understand the explanation. However, the HTML code defines templates also as mw:WikiLink relation, while categories have a relation type mw:PageProp/Category. Ofc I could programmatically combine the category links with wikilinks, as well. But in that case, what else should I consider as Wikilinks.
An example of what I get by filtering with mw:WikiLink is given below. I hope I my question wasn't too stupid :3 Thanks.

image.png (189×332 px, 18 KB)

in #TODO1,
Do we have to extract this URL url= "https://en.wikipedia.org/wiki/Chang_Gum-chol"? if I am not wrong
then find out the categories which this URL contains and then convert this URL to wikitext and then again find out the total categories and then check count?

IF I am wrong, kindly guide me

Hi, @Radhika_Saini . Have you followed the starter notebook attached to this microtask? If you follow it along and extract the first sample article, you will see that it's a JSON file. This JSON file has a key called article_body, which contains both the wikitext and the HTML parsed version of that article. If I am not wrong, you only have to process the HTML code to complete your tasks.
@Isaac and @MGerlach , correct me if I am wrong.

Hi all, what I have understood so far is that we have to extract the content from particular HTML dumps. wikitext versions of that dumps are given and now we have to extract the HTML version with web scrapping to the best of our knowledge and fetch the maximum possible information that we can get from the enwikinews-NS0-20220201-ENTERPRISE-HTML.json.tar and additionally we have to compare what differences we are getting in both the versions(Html and wikitext).How things are better or worse in HTML version. I am trying to fetch the data from tar file rather than only article_body and now trying to extract the maximum possible information using web scrapping. Kindly guide me on whether I am approaching the task in a right way .
Thank you in advance!!

after reading the paper by Mitrevski et. al, I had expected there to be less information in the HTML code compared to the wikitext. However, while doing the first to-do, I have found the outcomes to be the opposite.

@Appledora can you explain more? My takeaway from that work is that the HTML often has much more content.

However, the HTML code defines templates also as mw:WikiLink relation, while categories have a relation type mw:PageProp/Category. Ofc I could programmatically combine the category links with wikilinks, as well. But in that case, what else should I consider as Wikilinks.

These are good questions that don't have obvious answers. Taking a step back, ideally a parser gives as much control to the user to decide what they want to keep / remove while having reasonable default behavior. All of the examples you given above are indeed wikilinks but they point to content in different namespaces. If possible, I'd keep the different types of links separate from each other when you do the parsing (and a user could also decide to add them back together if that's what made sense for their use case). Hope that helps.

I am trying to replicate some of the mwparserfromhell functions/methods. I just want to clarify, whether this is what was essentially asked to perform in this TODO.

Yes, though to my point above, it's perfectly fine to reach a different answer because you choose to provide more fine-grained categories of content. For example, maybe the generic tag makes sense in the wikitext but you'd want to break it down into more specific categories when working with the HTML. If you get a different result though, just make sure to explain why.

Do we have to extract this URL url= "https://en.wikipedia.org/wiki/Chang_Gum-chol"? if I am not wrong then find out the categories which this URL contains and then convert this URL to wikitext and then again find out the total categories and then check count?

If I am not wrong, you only have to process the HTML code to complete your tasks.

Thanks @Appledora for chiming in. @Radhika_Saini when parsing the HTML, you just should use the HTML text that is in the json object and do not need to inspect other articles etc. to do the parsing. Hope that helps.

now we have to extract the HTML version with web scrapping to the best of our knowledge

@Antima_Dwivedi You do not need to scrape the webpages -- the HTML is already present in the HTML dump that is accessible on PAWS without making any additional web requests.

enwikinews-NS0-20220201-ENTERPRISE-HTML.json.tar

That's for Wikinews -- you'll want enwiki-NS0-20220201-ENTERPRISE-HTML.json.tar.gz which is for English Wikipedia.

@Appledora can you explain more? My takeaway from that work is that the HTML often has much more content.

@Isaac This screenshot is directly from the WikiHist.html: English Wikipedia’s Full Revision History in HTML Format by Mitrevski et. al . Upon reading this section, I had assumed that issues with macro expansion in the historical revisions might also exist in other places. Thus assuming some form of information loss. I hope I am being coherent here :3

image.png (260×671 px, 50 KB)

Upon reading this section, I had assumed that issues with macro expansion in the historical revisions might also exist in other places.

@Appledora thanks for clarifying. The issue mentioned in that passage is specific to generating the HTML for historical revisions -- e.g., taking the wikitext for an article from 2010 and trying to convert it into HTML. That is very difficult to do and will likely result in missing/incorrect content. In this case, the HTML dumps that you are working with are created from the current versions of articles so they don't have this issue and you can assume the HTML is complete and correct.

Thanks , @Isaac for the explanations. But as you mentioned and as I have discovered while working on the data, HTML does seem to have more content than the wikitext. Is it owing to the inner workings of the parser, i.e: mwparserfromhell or it's the actual case? And once again, really appreciate you for bearing with me today.

HTML does seem to have more content than the wikitext. Is it owing to the inner workings of the parser, i.e: mwparserfromhell or it's the actual case?

It's the actual case. See Figure 4 in the paper for the example of links in wikitext vs. HTML: https://arxiv.org/pdf/2001.10256.pdf#page=6
The reason there are more links in the HTML than appear directly in the wikitext is because many templates add a lot of extra content to Wikipedia articles. So in the wikitext of a Wikipedia article about movies, you might see something like {{Film genres}}. The effect of this on the HTML is adding all the content on this page: https://en.wikipedia.org/wiki/Template:Film_genres (which looks to be well over 100 links). Hope that helps.

That clears up a lot of things. Thanks, @Isaac !

@Isaac @MGerlach
My question is :
Why is this particular template represented differently in HTML than the rest of the templates? Should I consider this as a template in my implementation of wt.filter_templates()? I know {{}} represents a template.

In wikitext :

{{DEFAULTSORT:Chang, Gum-chol}}

In HTML:

<meta content="Chang, Gum-chol" id="mwOA" property="mw:PageProp/categorydefaultsort"/>

Rest of the templates are in this form (the tags have the data-mw attribute in this form):

<style about="#mwt17" data-mw='{"parts":[{"template":{"target":{"wt":"reflist","href":"./Template:Reflist"},"params":{},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1011085734" id="mwKQ" typeof="mw:Extension/templatestyles mw:Transclusion">

Why is this particular template represented differently in HTML than the rest of the templates? Should I consider this as a template in my implementation of wt.filter_templates()? I know {{}} represents a template.

@Talika2002 good catch! The reason those curly brackets don't behave like other templates is because it's technically not a template (though obviously it looks very similar). It's called a magic word and there are several that are parsed in a special way from wikitext -> HTML. I don't know myself the different ways in which they will all show up in the HTML (some become just standard text in the article; others affect meta tags as you showed). If you think you can handle their behavior specifically, go for it. Otherwise, I'd just leave a comment/note in your notebook noting their existence. mwparserfromhell unfortunately treats them like a template, which is not technically correct, so you would see a difference between counts of "templates" between the two and that is expected.

@Isaac and @Talika2002 , I didn't quite get the question posed here. Could I kindly have some more examples/explanations on it?

@Appledora,

All the templates in the HTML are represented like so:

<div about="#mwt1" class="shortdescription nomobile noexcerpt noprint searchaux" data-mw='{"parts":[{"template":{"target":{"wt":"short description","href":"./Template:Short_description"},"params":{"1":{"wt":"North Korean footballer"}},"i":0}}]}' id="mwAg" style="display:none" typeof="mw:Transclusion">
    North Korean footballer
   </div>
<table about="#mwt2" class="box-Orphan plainlinks metadata ambox ambox-style ambox-Orphan" data-mw='{"parts":[{"template":{"target":{"wt":"Orphan","href":"./Template:Orphan"},"params":{"date":{"wt":"January 2022"}},"i":0}}]}' id="mwBA" role="presentation" typeof="mw:Transclusion">
<style about="#mwt3" data-mw='{"parts":[{"template":{"target":{"wt":"family name hatnote","href":"./Template:Family_name_hatnote"},"params":{"1":{"wt":"Chang"},"2":{"wt":""},"lang":{"wt":"Korean"}},"i":0}}]}' data-mw-deduplicate="TemplateStyles:r1033289096" id="mwBg" typeof="mw:Extension/templatestyles mw:Transclusion">

You get the point. And so I implemented my filter_templates() function to find all the tags that have the attribute 'data-mw' in this format. But what I noticed was that there were only 8 templates being returned in my function whereas there were 9 being returned in the filter_templates() function of the mwparserfromhell. It turns out there was one odd template that was represented in HTML like this:

<meta content="Chang, Gum-chol" id="mwOA" property="mw:PageProp/categorydefaultsort"/>

And so I wasn't sure if I had to consider this as a template or not. Isaac basically answers this question.

Hope this helps.

@Appledora,

Also this is what wt.filer_templates() returns:

['{{short description|North Korean footballer}}',
 '{{Orphan|date=January 2022}}',
 '{{family name hatnote|Chang||lang=Korean}}',
 '{{Infobox football biography\n|name           = Chang Gum-chol\n|image          = \n|birth_date     = \n|birth_place    = \n|height         = \n|position       = [[Midfielder]]\n|currentclub    = \n|clubnumber     = \n|youthyears1    = \n|youthclubs1    = \n|years1         = |clubs1 = |caps1 =  |goals1 = \n|nationalyears1 = 1988–1989\n|nationalteam1  = [[North Korea national football team|North Korea]]\n|nationalcaps1  = 9\n|nationalgoals1 = 0\n}}',
 '{{Infobox Korean name\n|context = north\n| hangul  = \n| hanja   = \n| rr      = \n| mr      = \n}}',
 '{{NFT player|pid=84845}}',
 '{{reflist}}',
 '{{DEFAULTSORT:Chang, Gum-chol}}',
 '{{NorthKorea-footy-bio-stub}}']

{{DEFAULTSORT:Chang, Gum-chol}} is represented in {{}} in the wikitext, which mwparserfromhell assumes to be a template. But then why was it represented in an odd way than the rest of the templates in HTML? Turns out it's not really a template but a magic word.

Hope this clarifies it further.

I went through the thread again and dug around about magic words more :D Thanks both of you!

Hi @Isaac and @MGerlach,
Hope you are doing well.
While forking the notebook when I have added ?format=raw at the end of the URL, It is not giving me the option to download in the .ipynb format. Will it be okay if I firstly download it in pdf format and then convert it into .ipynb via an online converter or if there is any other way to download it please let me know.
Thank you in advance!!

Even I'm facing the same problem could you share your solution if you solved it

@Antima_Dwivedi Hi, I noticed you are having problems with downloading the notebook. I hope you're still not facing it, but here's what I did.

  • Concatenated ?format=raw after the ipynb URL, which prompted me to a raw text page.
  • Right-click and select Save as... -> which saved the page as a .txt file
  • Simply rename the extension .txt -> .ipynb and upload

Hope this helps.

@Vaishnavi_1210 this is how I solved my issue.

Hi @Isaac and @MGerlach,
Hope you are doing well.
While forking the notebook when I have added ?format=raw at the end of the URL, It is not giving me the option to download in the .ipynb format. Will it be okay if I firstly download it in pdf format and then convert it into .ipynb via an online converter or if there is any other way to download it please let me know.
Thank you in advance!!

Even I'm facing the same problem could you share your solution if you solved it.

Hi @Vaishnavi_1210 , If you are using Firefox, then put your cursor anywhere in the address bar where https://public.paws.wmcloud.org/User:Isaac_(WMF)/Outreachy%20Summer%202022/Wikipedia_HTML_Dumps.ipynb?format=raw this URL is present and press Alt key.
Then you will be able to see file option present there in the topmost corner.
Go to file->Save Page as-> and save it from there.
Hope it will help!!

@Isaac @MGerlach ,

I don't understand what's going on in this <link> tag in the HTML code:

<link about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"Infobox Korean name\n","href":"./Template:Infobox_Korean_name"},"params":{"context":{"wt":"north"},"hangul":{"wt":""},"hanja":{"wt":""},"rr":{"wt":""},"mr":{"wt":""}},"i":0}}]}' href="./Category:Articles_needing_Korean_script_or_text#Chang%20Gum-chol" id="mwCg" rel="mw:PageProp/Category" typeof="mw:Transclusion"/>

What is this <link> tag doing? Is this linking a category? Or is it a template?

@Isaac @MGerlach ,

I don't understand what's going on in this <link> tag in the HTML code:

<link about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"Infobox Korean name\n","href":"./Template:Infobox_Korean_name"},"params":{"context":{"wt":"north"},"hangul":{"wt":""},"hanja":{"wt":""},"rr":{"wt":""},"mr":{"wt":""}},"i":0}}]}' href="./Category:Articles_needing_Korean_script_or_text#Chang%20Gum-chol" id="mwCg" rel="mw:PageProp/Category" typeof="mw:Transclusion"/>

What is this <link> tag doing? Is this linking a category? Or is it a template?

@Talika2002 this is a template. In the sample article, templates have appeared inside different HTML tags including style, li and div, either as a stand-alone attribute or as part of a dictionary inside the data-mw attribute. Additionally, some magic words are also considered as templates by the mwparserfromhell library. eg:

<meta content="Chang, Gum-chol" id="mwOA" property="mw:PageProp/categorydefaultsort"/>

@Isaac and @MGerlach , I am a little confused about the following TODO :

Are there features / data that are available in the HTML but not the wikitext?

What exactly should I be showing here? Codes or just study references?
Similarly here,

are there certain words that show up more frequently in the HTML versions but not the wikitext? Why?

what do you mean by words here? tags, attributes, patterns?

Are there features / data that are available in the HTML but not the wikitext?
What exactly should I be showing here? Codes or just study references?

I'm not sure what you mean by Codes or just study references but the thinking with this TODO is that the HTML contains various attributes that aren't included in the wikitext that can tell us things about the article / links / text / etc. So just asking for an example or two of these. Does that help?

are there certain words that show up more frequently in the HTML versions but not the wikitext? Why?
what do you mean by words here? tags, attributes, patterns?

For this one, we mean actual words in the article. So comparing the output of wt.strip_code() for many articles with your corresponding HTML function. In theory the two outputs should be quite similar but in practice, the filtering for the HTML function will be slightly different (and it's working with different source material -- HTML vs. wikitext) so presumably you might see certain words appear more or less often in the HTML outputs. Does that help clarify?

Yes @Isaac , I think I got it more or less now. Thanks!

@Isaac @MGerlach,

Are these 'href' considered as internal links or external links in a wikipedia article?

<a class="external text" href="//en.wikipedia.org/w/index.php?title=Special:Whatlinkshere&amp;target=Chang_Gum-chol&amp;namespace=0" rel="mw:ExtLink">link to it</a>

<a class="external text" href="//en.wikipedia.org/w/index.php?title=Special%3ASearch&amp;redirs=1&amp;search=Chang+Gum-chol&amp;fulltext=Search&amp;ns0=1&amp;title=Special%3ASearch&amp;advanced=1&amp;fulltext=Chang+Gum-chol" rel="mw:ExtLink">related articles</a>

<a class="external text" href="//edwardbetts.com/find_link?q=Chang_Gum-chol" rel="mw:ExtLink">Find link tool</a>

Hey, Samanvi here. I have been working mostly with python for programming purposes and am quite comfortable with it. So,I am excited to be working on this.

@Isaac @MGerlach,

Are these 'href' considered as internal links or external links in a wikipedia article?

<a class="external text" href="//en.wikipedia.org/w/index.php?title=Special:Whatlinkshere&amp;target=Chang_Gum-chol&amp;namespace=0" rel="mw:ExtLink">link to it</a>

<a class="external text" href="//en.wikipedia.org/w/index.php?title=Special%3ASearch&amp;redirs=1&amp;search=Chang+Gum-chol&amp;fulltext=Search&amp;ns0=1&amp;title=Special%3ASearch&amp;advanced=1&amp;fulltext=Chang+Gum-chol" rel="mw:ExtLink">related articles</a>

<a class="external text" href="//edwardbetts.com/find_link?q=Chang_Gum-chol" rel="mw:ExtLink">Find link tool</a>

@Talika2002 Since, the 'rel' tag specifies them as 'ExtLink', So I think they are supposed to be considered as External links for that article.

@Talika2002 Since, the 'rel' tag specifies them as 'ExtLink', So I think they are supposed to be considered as External links for that article.

@SamanviPotnuru But aren't all of those links leading back to wikipedia.org? If that's the case, shouldn't they be internal links?

@Isaac @MGerlach,

Are these 'href' considered as internal links or external links in a wikipedia article?

<a class="external text" href="//en.wikipedia.org/w/index.php?title=Special:Whatlinkshere&amp;target=Chang_Gum-chol&amp;namespace=0" rel="mw:ExtLink">link to it</a>

<a class="external text" href="//en.wikipedia.org/w/index.php?title=Special%3ASearch&amp;redirs=1&amp;search=Chang+Gum-chol&amp;fulltext=Search&amp;ns0=1&amp;title=Special%3ASearch&amp;advanced=1&amp;fulltext=Chang+Gum-chol" rel="mw:ExtLink">related articles</a>

<a class="external text" href="//edwardbetts.com/find_link?q=Chang_Gum-chol" rel="mw:ExtLink">Find link tool</a>

@Talika2002 I believe internal links are supposed to be links that redirect the user somewhere within the web page itself. So, even if these links lead back to wikipedia.org, they'll still be considered external links as these links lead to a page external to the current article.

According to this article - Specs/HTML/2.4.0, these are treated as Named external links which are a type of external links.

@Talika2002 I believe internal links are supposed to be links that redirect the user somewhere within the web page itself. So, even if these links lead back to wikipedia.org, they'll still be considered external links as these links lead to a page external to the current article.

According to this article - Specs/HTML/2.4.0, these are treated as Named external links which are a type of external links.

That's very helpful, thanks @SamanviPotnuru

@SamanviPotnuru and @Talika2002 , I personally did not quite get the relevance of Named External Links as an explanation for the question. I think NELs are basically those external links that have text, in between the tags (e.g : "link to it", "related articles" etc ).
However, after digging around, I found these directives on what can be linked as external links here. This tells us that it is okay to add other wikiarticles as external links.

@Appledora Yep you're right about the Named External Links. I think the first paragraph had me convinced.
Thanks for the article!

@SamanviPotnuru and @Talika2002 , I personally did not quite get the relevance of Named External Links as an explanation for the question. I think NELs are basically those external links that have text, in between the tags (e.g : "link to it", "related articles" etc ).

@Appledora There is no relevance of NELs for the explanation. I just mentioned it because I found some additional information while digging around.

However, after digging around, I found these directives on what can be linked as external links here. This tells us that it is okay to add other wikiarticles as external links.

This was informative. I was talking about the same i.e, how linking any other articles will also be considered as external links even though they're from Wikipedia only.

Exciting to see you all digging into the details of HTML parsing and lots of good points / questions. If you're still curious, you'll find some more details about why those links to Wikipedia are created as external links here: https://en.wikipedia.org/wiki/Wikipedia:Namespace#Virtual_namespaces

In short, when the parser that converts wikitext to HTML reaches a link that is not of the double-bracketed [[link]] format reserved for internal links, I believe it labels it as an external link even if it points to a Wikipedia page. This is likely because it would be overly complicated and error-prone to check all the different formats that a link to Wikipedia could show up as.

@Isaac , it seems parsing wikitext has a significantly long way to go to be accurate, so far :v

Hello Everyone!

My name is Alonso Lopez and I am an Outreachy applicant. I am looking forward to contributing to this project and interacting with you all.

@Isaac and @MGerlach , should we mail the task notebook to both of you or either one would do it? Thanks.

@Isaac and @MGerlach , should we mail the task notebook to both of you or either one would do it? Thanks.

@Appledora thanks for asking -- the ideal approach is to send the email to both of us and we will coordinate who provides feedback. I'll update the task to clarify that.

Hi everyone,
welcome everyone who joined since the last posts. Great to see the ongoing discussions.

As the deadline for the contribution period is (slowly) approaching I wanted to give some updates:

  • One request for everyone working on this task: To get a sense of how many people are intending to apply to the project (T302237), I'd like to ask that you make an initial contribution on the Outreachy site with a link to your current progress by the end of next week (April 17). You can still edit your final application until the final deadline (April 22).
  • If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to the mentors (mgerlach@wikimedia.org and isaac@wikimedia.org) with the public link to your PAWS notebook. We will try to make time to give this feedback once to anyone who would like it (though is is entirely optional). Since this will take us a day or two, it is best to send us your notebook by the end of next week (April 17) to make sure we can get back to you before the final deadline.
  • some general recommendations on the notebooks from previous rounds:
  • Explain each piece of code that you are running. The idea is to make the notebook easy to understand. Don't make the reader have to guess what you were trying to do.
  • Describe your motivation and conclusions for every observation you show. For example, why are you showing X, or Y? and what is your takeaway/conclusions?
  • Avoid long/repetitive code outputs that doesn't provide relevant information. For example, if you are applying a model that runs 1000 epochs, avoid to print 1000 lines which each epoch, because makes the notebook difficult to read. If you think that there is relevant information on those outputs, think how to show that information in a way that is compact and easy to understand (for example a plot).

Thanks for your efforts. Keep the questions / collaboration among yourself going!

hi @MGerlach , just for the sake of clarifications, recording contributions and making the final application is not the same right? I know that contributions can be updated, something like a version controlling mechanism. But can we also edit our applications once we send them in? Thanks.

@Appledora

hi @MGerlach , just for the sake of clarifications, recording contributions and making the final application is not the same right? I know that contributions can be updated, something like a version controlling mechanism. But can we also edit our applications once we send them in? Thanks.

Yes, these are not the same thing. You can edit the final application. The outreachy application guide states that:

Only applicants that record a contribution to a project will be able to create a final application for that project. We encourage applicants to submit their final application at least a day before the deadline (April 22, 2022 at 4pm UTC). You can edit your final application until the deadline.

My request above was to record a contribution (not the final application) by April 17 so we can get a sense of how many applicants there are. For the evaluation we will only consider the final application (not the previous contributions).
I hope this helps.

This definitely helps. I really apologize for being so redundant, and thanks for bearing with me.

Hi,

My question is related to this page: https://en.wikipedia.org/wiki/Chang_Gum-chol

When I inspect webpage in browser wikilcnks has /wiki/ prepended to them. However, html scraped in notebook doesn't have it. Are these html processed before being saved to DB?

@FatimaArshad-DS, hello. The HTML saved inside the HTML-dump is generated by an internal Wikipedia API (see Parsoid for reference) from the Wikitext code. This is why the generated HTML and the browser HTML are entirely different things. Hope this helps.

"# TODO: write a function for extracting the article text

  1. It doesn't have to look the same as the output of wt.strip_code() above (in fact, it likely won't)
  2. but it should be very similar in that you should aim for something
  3. that captures the text of the article without a lot of markup etc.
  4. NOTE: straightforward HTML -> text functions likely won't perform well here and you'll probably
  5. want to write something more custom to handle the specifics of Wikipedia articles"

Do we have to write a generic function here? or we can make use of one specific to this webpage at the moment

@FatimaArshad-DS , you have to write a generic function because the later tasks ask you to work on more than one (atleast 100) articles.

"# TODO: write a function for extracting the article text

I am stuck in this Todo, I extracted but am not sure if that is right or wrong because extracted result is in a similar template like wt.strip_code() above but there is other information also.

@Radhika_Saini, if you don't mind, this is what I did, I iterated over all the visible tags in the HTML and extracted text from them. Optionally, I also iterated over other Page elements like templates and categories to extract text (if present) from them. My approach also extracts some stub information, which I couldn't omit tbh. Hope this helps.

Thanks, @Appledore But what's the need to move inside the Tags when we have content in HTML(above in article_body) to extract?

Basically, I wanted to be flexible about what I can extract or not and implemented the function likewise. Otherwise, just using the default bs4 get_text() method should suffice for the purpose. However, as you mentioned in your earlier comment, mwparserfromhell output, extracts more text than bs4 and I wanted to remedy that in my custom implementations and hence taking the long way of iterating over tags, which is not perfect either tbh. I hope I understood your question properly this time :3
@Radhika_Saini

@Appledora Actually, I did get you. But I am only saying that there is a code of a single web page in article_body and Todo is to extract the text from that. so, there is no need to go inside the Tags to get more information or text which is not even asked for.
Please correct me if I am wrong. @Isaac @MGerlach
Do we need to go inside the Tags and Links for more information?

@Appledora There is no need to go inside tags manually. You can extract all the visible text very easily using BS4 :)

@Appledora. Do you find the 1000 articles in one place or any database present?
or you find out individually?

Basically, I wanted to be flexible about what I can extract or not and implemented the function likewise. Otherwise, just using the default bs4 get_text() method should suffice for the purpose. However, as you mentioned in your earlier comment, mwparserfromhell output, extracts more text than bs4 and I wanted to remedy that in my custom implementations and hence taking the long way of iterating over tags, which is not perfect either tbh. I hope I understood your question properly this time :3
@Radhika_Saini

@FatimaArshad-DS , in this comment I tried explaining why I chose to iterate over each visible tag individually. I wanted the function to be as customizable as possible. Hope you can understand my motivation.

@Appledora. Do you find the 1000 articles in one place or any database present?
or you find out individually?

@Radhika_Saini , if I understand you correctly, you're asking how I am analyzing more than one article data. So remember the starter notebook at the beginning of this microtask? In the first few cells, they tell you how you can download your own Wikipedia HTML-dump or better, how you can access the already stored English dump on the PAWS server. Using the python tarfile library I iterate over the tar file, which has an article dump on each line. I then simply maintain a counter and extract my required number of article data into a list.

Maybe I am not being very concise about my approach. So, please feel free to ask follow-up questions.

@Appledora did you find any dataset which contains content of 1000 articles? to work for analysis.

@Radhika_Saini , no, I am not using any external dataset. I created my own from the html dump.

Does it happen to anyone else... PAWS stops saving notebook after a while?

I still don't understand the concept of templates. What are they?

I still don't understand the concept of templates. What are they?

You can take a look at the above comments which I hope will really clarify a lot of confusion and apart from that you can take a look at these pages
https://en.wikipedia.org/wiki/Help:A_quick_guide_to_templates and https://en.wikipedia.org/wiki/Help:Template

@FatimaArshad-DS , in a very basic sense, the template is exactly what you would expect it to be. It is officially defined in Wikipedia as :

Wikimedia pages are embedded into other pages to allow for the repetition of information

Templates can be interpreted as prebuilt structures, where you can insert data against certain keys. There are templates for all sorts of things. For example, this is a template for emojis where by changing the internal values you can show different emojis on the webpage.
Pretty much the only thing you need to look for to identify a template is it's Template Namespace .

Does it happen to anyone else... PAWS stops saving notebook after a while?

and yes it does. Keep an eye on the "Saving started ..." notification at the bottom, and in its absence, I took manual backups.

@Appledora Can you please share some links, How can I make my own dataset? from this HTML dump

@Isaac @MGerlach

I had two questions.

  1. In this TODO: 'Are there features / data that are available in the HTML but not the wikitext?', Do you mean that we need to look for attributes in individual HTML tags to find the extra data?
  1. In the final application,
  • Are there any Community-specific Questions that we have to answer?
  • Also, how should I go about writing the 'outreachy internship project timeline' in the final application? Any pointers/tips would be helpful.

@Talika2002 , I think HTML Specs might give you some helpful pointers on your first query.

@Appledora Can you please share some links, How can I make my own dataset? from this HTML dump

@Radhika_Saini , hmm... I didn't really use any other blogs or SO posts to create my dataset. Kind of merged the ideas presented in the starter notebook and my previous pandas experience to create the dataset. I don't think sharing code is allowed, but here's my thought process :

1. PAWS server has an internal dump directory, where the Wikipedia dumps are stored as tarfile. The tarfiles are further divided into chunks of around 10 GBs. You can programmatically pick any chunk you want.
2. Each line of a chunk corresponds to a single Wikipedia article and its related information in the form of a json.  
3. Python has a tarfile library/module which can be used to iterate over the tarfile line by line.  You can use a counter in combination with this library to iterate over your preferred number of article samples and store them in a list.
4. Now you can iterate over this article list and load each item as a json (because that's what they are) . 
5. Json files essentially just have a key-value structure. Familiarize yourself with the structure of these jsons and go-ahead to extract  whatever features you want to extract from them. 
6. I store the features in lists as i go along and I feel that it is a rather nasty way of doing it.
7. Once you're done storing your features, you can convert it to a pandas dataframe and manipulate as you wish.

Hope this helps.

@Isaac and @MGerlach , are we allowed to ask for feedback on our final applications? Particularly need help on the project timeline, like several others here I think.
Also, are there any community-specific questions for this task?
Thanks!

@MGerlach & @Isaac I am also having the same doubt & need some suitable replies on community specific questions & project timeline covering the time commitments of both the applicants & the mentors for final application so that everything can be planned smoothly.

@Talika2002 @Appledora @Robot_Jelly
Some comments on the final application, especially with respect to the project timeline and community specific questions:

  • The most important aspect of the application is the notebook from the application task. We wont be able to give additional feedback on the other aspects in the final application.
  • Project timeline: try to give a rough sketch of the different steps you plan to take in order to complete the project (T302237 mentions 4 phases: becoming familiar with the dumps, writing the code, writing the documentation, and performing analysis). No need to be perfect and too detailed. The aim is to organize what things you think should be done and in which order. This is not set in stone and will likely change as you embark on the project. we feel free to also use it to identify which aspects are most interesting to you and would like to spend more time on; as well as any additional steps you might think are pertinent for your work on this project
  • Community-specific questions: you can skip this question if you want.

Text coming from Wikitext is in pretty format. Was anyone able to pretty print HTML text?

@FatimaArshad-DS , have you tried Beautifulsoup's soup.prettify() ?
If you want to adopt a more customizable approach, here's a SO post about it.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Friday, April 22 at 4pm UTC (a little bit more than 3 days from when I am posting this).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site. Please also make sure to include the public link to your notebook (see the documentation for how to get the public link).

Thanks for all the great contributions and discussions.

Hi @MGerlach,

Do we still get a chance for your feedback after the deadline?

Hi everyone,
the final application deadline has passed. I wanted to thank you for all your hard work and effort you put into your submissions. you all did a really good job not only in your analysis and notebooks but also in being curious, asking questions, and helping each other out!

We are now reviewing the submissions. According to information from the organizers, the selection of interns will be announced on May 20 at 4pm UTC.

Do not hesitate to reach out in case you have any questions.

MGerlach claimed this task.