User Details
- User Since
- Mar 27 2022, 5:38 PM (14 w, 2 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- Appledora [ Global Accounts ]
Sun, Jun 26
Apr 27 2022
Apr 17 2022
@FatimaArshad-DS , have you tried Beautifulsoup's soup.prettify() ?
If you want to adopt a more customizable approach, here's a SO post about it.
Apr 15 2022
@Radhika_Saini , hmm... I didn't really use any other blogs or SO posts to create my dataset. Kind of merged the ideas presented in the starter notebook and my previous pandas experience to create the dataset. I don't think sharing code is allowed, but here's my thought process :
1. PAWS server has an internal dump directory, where the Wikipedia dumps are stored as tarfile. The tarfiles are further divided into chunks of around 10 GBs. You can programmatically pick any chunk you want. 2. Each line of a chunk corresponds to a single Wikipedia article and its related information in the form of a json. 3. Python has a tarfile library/module which can be used to iterate over the tarfile line by line. You can use a counter in combination with this library to iterate over your preferred number of article samples and store them in a list. 4. Now you can iterate over this article list and load each item as a json (because that's what they are) . 5. Json files essentially just have a key-value structure. Familiarize yourself with the structure of these jsons and go-ahead to extract whatever features you want to extract from them. 6. I store the features in lists as i go along and I feel that it is a rather nasty way of doing it. 7. Once you're done storing your features, you can convert it to a pandas dataframe and manipulate as you wish.
@Talika2002 , I think HTML Specs might give you some helpful pointers on your first query.
Apr 14 2022
@FatimaArshad-DS , in a very basic sense, the template is exactly what you would expect it to be. It is officially defined in Wikipedia as :
Wikimedia pages are embedded into other pages to allow for the repetition of information
Templates can be interpreted as prebuilt structures, where you can insert data against certain keys. There are templates for all sorts of things. For example, this is a template for emojis where by changing the internal values you can show different emojis on the webpage.
Pretty much the only thing you need to look for to identify a template is it's Template Namespace .
Apr 11 2022
@Radhika_Saini , no, I am not using any external dataset. I created my own from the html dump.
Apr 10 2022
Basically, I wanted to be flexible about what I can extract or not and implemented the function likewise. Otherwise, just using the default bs4 get_text() method should suffice for the purpose. However, as you mentioned in your earlier comment, mwparserfromhell output, extracts more text than bs4 and I wanted to remedy that in my custom implementations and hence taking the long way of iterating over tags, which is not perfect either tbh. I hope I understood your question properly this time :3
@Radhika_Saini
Apr 9 2022
@Radhika_Saini, if you don't mind, this is what I did, I iterated over all the visible tags in the HTML and extracted text from them. Optionally, I also iterated over other Page elements like templates and categories to extract text (if present) from them. My approach also extracts some stub information, which I couldn't omit tbh. Hope this helps.
@FatimaArshad-DS , you have to write a generic function because the later tasks ask you to work on more than one (atleast 100) articles.
Apr 8 2022
Hi @Mike_Peel , I am very very late to the party. But I have started getting familiar with WikiData and creating my page. My question is probably redundant and dumb, but I am curious about what's your expectation from this microtask. For our created pages, would you prefer it be structured and represented in a conventional Wikipedia article style? Or would you rather prefer a more descriptive page (something like a jupyter notebook) that would represent the creator's thought process and explorations? Thanks.
@FatimaArshad-DS, hello. The HTML saved inside the HTML-dump is generated by an internal Wikipedia API (see Parsoid for reference) from the Wikitext code. This is why the generated HTML and the browser HTML are entirely different things. Hope this helps.
This definitely helps. I really apologize for being so redundant, and thanks for bearing with me.
hi @MGerlach , just for the sake of clarifications, recording contributions and making the final application is not the same right? I know that contributions can be updated, something like a version controlling mechanism. But can we also edit our applications once we send them in? Thanks.
Apr 6 2022
Apr 5 2022
Apr 4 2022
@Isaac , it seems parsing wikitext has a significantly long way to go to be accurate, so far :v
Apr 2 2022
@SamanviPotnuru and @Talika2002 , I personally did not quite get the relevance of Named External Links as an explanation for the question. I think NELs are basically those external links that have text, in between the tags (e.g : "link to it", "related articles" etc ).
However, after digging around, I found these directives on what can be linked as external links here. This tells us that it is okay to add other wikiarticles as external links.
Apr 1 2022
Yes @Isaac , I think I got it more or less now. Thanks!
Mar 31 2022
@Isaac and @MGerlach , I am a little confused about the following TODO :
Are there features / data that are available in the HTML but not the wikitext?
What exactly should I be showing here? Codes or just study references?
Similarly here,
are there certain words that show up more frequently in the HTML versions but not the wikitext? Why?
what do you mean by words here? tags, attributes, patterns?
Mar 30 2022
I went through the thread again and dug around about magic words more :D Thanks both of you!
@Isaac and @Talika2002 , I didn't quite get the question posed here. Could I kindly have some more examples/explanations on it?
That clears up a lot of things. Thanks, @Isaac !
Mar 29 2022
Thanks , @Isaac for the explanations. But as you mentioned and as I have discovered while working on the data, HTML does seem to have more content than the wikitext. Is it owing to the inner workings of the parser, i.e: mwparserfromhell or it's the actual case? And once again, really appreciate you for bearing with me today.
@Appledora can you explain more? My takeaway from that work is that the HTML often has much more content.
Hi, @Radhika_Saini . Have you followed the starter notebook attached to this microtask? If you follow it along and extract the first sample article, you will see that it's a JSON file. This JSON file has a key called article_body, which contains both the wikitext and the HTML parsed version of that article. If I am not wrong, you only have to process the HTML code to complete your tasks.
@Isaac and @MGerlach , correct me if I am wrong.
@MGerlach , after reading the paper by Mitrevski et. al, I had expected there to be less information in the HTML code compared to the wikitext. However, while doing the first to-do, I have found the outcomes to be the opposite. Could you give me any pointers on it? Any supplementary literature would also be appreciated.
Also, for the TODO#1, I am trying to replicate some of the mwparserfromhell functions/methods. I just want to clarify, whether this is what was essentially asked to perform in this TODO. Thanks!
Mar 27 2022
@Antima_Dwivedi Hi, I noticed you are having problems with downloading the notebook. I hope you're still not facing it, but here's what I did.
Hello, Nazia here. Absolutely delighted to try these challenges out. I have gone through the setup stages, and for the sake of clarifications (and some PSA), I would like to elaborate my understanding of the tasks here :
- There are 6 tasks to be done in the notebook.
- We will be dealing with 2 types of markup language, HTML and wikitext
- The high-level goal is to extract wikitext equivalent information from the HTML formatted pages
- There are additional documents we may need to go through to complete the microtasks.
Please correct me if I am wrong @Isaac, @MGerlach and the rest.