Page MenuHomePhabricator
Feed Advanced Search

Jun 29 2023

Appledora closed T316941: NLP Tools for Content Gaps as Resolved.
Jun 29 2023, 6:34 PM · Research (FY2022-23-Research-April-June), Epic
Appledora closed T328260: NLP Tools: Sentence Tokenization, a subtask of T316941: NLP Tools for Content Gaps, as Resolved.
Jun 29 2023, 6:34 PM · Research (FY2022-23-Research-April-June), Epic
Appledora closed T328260: NLP Tools: Sentence Tokenization as Resolved.
Jun 29 2023, 6:34 PM · Research (FY2022-23-Research-April-June)
Appledora closed T328264: NLP Tools: Word Tokenization, a subtask of T316941: NLP Tools for Content Gaps, as Resolved.
Jun 29 2023, 6:33 PM · Research (FY2022-23-Research-April-June), Epic
Appledora closed T328264: NLP Tools: Word Tokenization as Resolved.
Jun 29 2023, 6:33 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Jun 29 2023, 6:32 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Jun 29 2023, 6:27 PM · Research (FY2022-23-Research-April-June)
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [26.06.2023 - 27.06.2023]

  1. Final presentation preparations, feedback and discussions
  2. Packaged released in PyPI
  3. Wrapped up with a presentation with research team!
Jun 29 2023, 6:25 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [21.06.2023]

  1. Exploratory notebook generation
  2. Test PyPI uploads
  3. Listed newer low priority issues
Jun 29 2023, 6:24 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [13.06.2023]

  1. Benchmarking output and log analysis
  2. Closed remaining MRs
  3. Discussing packaging decisions
Jun 29 2023, 6:19 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [06.06.2023]

  1. Shifted to paragraph-level dataset from sentence-level due to bad segmentation error
  2. Trained individual + combined + clusterwise SPC models
Jun 29 2023, 6:17 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [30.05.2023]

  1. Sentence Evaluation Dataset expansion
  2. Fixed recall greater than 1 error
  3. Clustered NWS language scripts
Jun 29 2023, 6:02 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [23.05.2023]

  1. Word tokenization benchmarking error logging integration
  2. Fixed stat-machines related errors + upgraded notebooks
  3. Identified newer edge-cases for sentence tokenization
Jun 29 2023, 6:00 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [16.05.2023]

  1. Completed MR on Word Tokenization Evaluation dataset generation
  2. FLORES language name alignment sheet
Jun 29 2023, 5:58 PM · Research (FY2022-23-Research-April-June), Epic

May 9 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [09.05.2023]

  1. Addressed reviews on current MR
  2. Created new benchmarking dataset for sentence tokenization evaluation
  3. Scripts on wikilink parsing
May 9 2023, 12:45 PM · Research (FY2022-23-Research-April-June), Epic

May 2 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [02.05.2023]

  1. 3-way alignment between BN/EN/DE splits
  2. Script writing on word tokenization wikipedia benchmark dataset building
May 2 2023, 12:55 PM · Research (FY2022-23-Research-April-June), Epic

Apr 28 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update : [25.04.2023]

  1. Addressed reviews on the MR
  2. Additional alignment stats for BN-EN splits
  3. Upgraded existing notebooks to spark3
Apr 28 2023, 3:31 AM · Research (FY2022-23-Research-April-June), Epic

Apr 18 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [18.04.2023]

  1. Pushed MR on NWS word tokenization evaluation
  2. Calculated BN-EN split alignment
  3. Working on upgrading spark version
Apr 18 2023, 5:27 PM · Research (FY2022-23-Research-April-June), Epic

Apr 17 2023

Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Apr 17 2023, 1:45 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Apr 17 2023, 1:43 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Apr 17 2023, 1:43 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Apr 17 2023, 1:42 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Apr 17 2023, 1:41 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Apr 17 2023, 1:40 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Apr 17 2023, 1:38 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Apr 17 2023, 1:36 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Apr 17 2023, 1:35 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Apr 17 2023, 1:33 PM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Apr 17 2023, 1:14 PM · Research (FY2022-23-Research-April-June)
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [03.04.2023] - [11.04.2023]

  1. Informal quarterly review
  2. Annotated sentence evaluation dataset for Bangla
  3. Addressed reviews on earlier MRs
  4. Identified some more issues
Apr 17 2023, 12:22 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [28.03.2023]

  1. Added test modules for NWS sentence tokenizations.
  2. Integrated abbreviations in SPC tokenization
  3. Defined the unsupervised word tokenization performance evaluation scheme
Apr 17 2023, 12:20 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [21.03.2023]

  1. Added MR on SPC integration with training and corpus collection scripts
  2. Adopted the test suit for the new repo structure
  3. Updated the ground truth dataset format for evaluation
Apr 17 2023, 11:49 AM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [14.03.2023]

  1. Merged remaining MRs on packaging and code restructuring.
  2. Started working on NWS word tokenization with sentencepiece
  3. Corpus collection for SPC tokenizer and training
Apr 17 2023, 11:45 AM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [07.03.2023]

  1. Identified newer issues stemming from the tokenization class implementations
  2. Set-up repo for packagin/easy installation
  3. Added testing modules
Apr 17 2023, 11:38 AM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [21.02.2023] - [28.02.2023]

  1. Restructured tokenizer class
  2. Addressed reviews on optimizing the word-tokenization schemes
  3. Reorganized
Apr 17 2023, 11:36 AM · Research (FY2022-23-Research-April-June), Epic

Feb 15 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [15.02.2023]

  1. Using a JSON of the abbreviations files
  2. Implemented character-level word tokenization scheme
  3. Minor CI/CD reconfiguration
Feb 15 2023, 9:09 AM · Research (FY2022-23-Research-April-June), Epic

Feb 7 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [07.02.2023]

  1. Adapted to use the abbreviation lists from a pickle file
  2. Minor modifications in the notebook
  3. Implemented a rule-based word tokenization method for whitespace-delimited languages (resources from this paper)
  4. Some issue clean-up
Feb 7 2023, 1:32 PM · Research (FY2022-23-Research-April-June), Epic

Feb 2 2023

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [01.02.2023]

  1. Uploaded language wise filtered abbreviations lists with an MR
  2. Trained sentencepiece on sample group of languages
  3. Created new MR for sentencepiece scripts
  4. More gitlab issue-cleanup
Feb 2 2023, 12:21 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [24.01.2023]

  1. Updated abbreviation pipeline to consider minimum word frequency
  2. Moved tasks to phabricator to establish hierarchical structure
  3. Built the pipeline for sentencepiece corpus generation
Feb 2 2023, 12:17 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [9.01.2023] and [17.03.2023]

  1. Informal review
  2. Started cleaning up issues on GitLab, to make them more verbose
  3. Debugged some pyspark issues related to resource requirements + working with distributed files
  4. Grouping languages by cluster (delimiter + fallback wise)
  5. Started annotating FLORES101 dataset for sentence segmentation task.
Feb 2 2023, 12:13 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [3.01.2023]

  1. Finished adapting abbreviation filtering code for each wiki project
  2. Got started on word tokenization
  3. Addressed reviews of an open MR
Feb 2 2023, 12:08 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [20.12.2022]

  1. Mostly spent time trying to get familiarized with pyspark
  2. Updated the Word tokenization literature review with more information on existing opensource tools
  3. Discovered some additional edge-cases on sentence tokenizations (e.g: parenthesis and quotation tracking)
Feb 2 2023, 12:05 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [13.12.2022]

  1. Adapted Martin's code on wikitext processing for abbreviation filtering
  2. Moved to statmachine for running simulations using pyspark
  3. Started literature review + background study on Word Tokenization
Feb 2 2023, 12:03 PM · Research (FY2022-23-Research-April-June), Epic
Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [06.12.2022]

  1. Addressed reviews on the abbreviation filtering scheme
  2. discussion on sentence segmentation evaluation datasets
  3. wrote algorithm for abbreviation filtering
Feb 2 2023, 12:01 PM · Research (FY2022-23-Research-April-June), Epic

Feb 1 2023

Appledora renamed T328265: Word Tokenization: White-spaced language Tokenization from White-spaced language Tokenization to Word Tokenization: White-spaced language Tokenization.
Feb 1 2023, 2:57 PM · Research (FY2022-23-Research-October-December)

Jan 30 2023

Appledora updated subscribers of T328272: Sentence Tokenization: Evaluation Pipeline.
Jan 30 2023, 9:58 AM · Research (FY2022-23-Research-October-December)
Appledora created T328272: Sentence Tokenization: Evaluation Pipeline.
Jan 30 2023, 9:58 AM · Research (FY2022-23-Research-October-December)
Appledora created T328270: Sentencepiece: all non-whitespace languages.
Jan 30 2023, 9:46 AM · Research (FY2022-23-Research-October-December)
Appledora created T328269: Sentencepiece: Language Family Wise training.
Jan 30 2023, 9:44 AM · Research (FY2022-23-Research-October-December)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Jan 30 2023, 9:43 AM · Research (FY2022-23-Research-April-June)
Appledora updated the task description for T328264: NLP Tools: Word Tokenization.
Jan 30 2023, 9:42 AM · Research (FY2022-23-Research-April-June)
Appledora created T328267: Word Tokenization: Non-whitespace languages.
Jan 30 2023, 9:41 AM · Research (FY2022-23-Research-October-December)
Appledora created T328265: Word Tokenization: White-spaced language Tokenization.
Jan 30 2023, 9:35 AM · Research (FY2022-23-Research-October-December)
Appledora created T328264: NLP Tools: Word Tokenization.
Jan 30 2023, 9:32 AM · Research (FY2022-23-Research-April-June)
Appledora created T328263: Consider abbreviations while sentence splitting.
Jan 30 2023, 9:30 AM · Research (FY2022-23-Research-October-December)
Appledora created T328261: Rule-based Sentence Tokenization.
Jan 30 2023, 9:28 AM · Research (FY2022-23-Research-October-December)
Appledora updated the task description for T328260: NLP Tools: Sentence Tokenization.
Jan 30 2023, 9:26 AM · Research (FY2022-23-Research-April-June)
Appledora created T328260: NLP Tools: Sentence Tokenization.
Jan 30 2023, 9:25 AM · Research (FY2022-23-Research-April-June)

Nov 28 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [18.11.2022] and [27.11.2022]

  1. Implemented abbreviation replacement scheme
  2. performance analysis of segmentation before and after abbreviation post-processing.
  3. Implement a filtration scheme for the wiktionary abbreviations
  4. Performance analysis of abbreviation filtration, across a range of frequency ratio threshold
Nov 28 2022, 1:03 PM · Research (FY2022-23-Research-April-June), Epic

Nov 11 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update:[11.11.2022]

  1. Curated list of abbreviations for all languages with a wiktionary project.
  2. Working on integrating the abbreviation search as a replacement scheme.
Nov 11 2022, 6:04 PM · Research (FY2022-23-Research-April-June), Epic
Appledora updated the task description for T322070: Onboard Nazia onto analytics infrastructure.
Nov 11 2022, 6:02 PM · Research

Nov 4 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [04. 11.2022]

  1. Server Onboarding
  2. Building a deterministic Benchmark Module and dataset development
  3. Going through the example pySpark notebook by Martin and other walkthrough documentations by Isaac
Nov 4 2022, 8:13 PM · Research (FY2022-23-Research-April-June), Epic
Appledora updated the task description for T322070: Onboard Nazia onto analytics infrastructure.
Nov 4 2022, 5:36 AM · Research

Oct 29 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [18.10.2022] and [25.10.2022]

  1. compiled list of unicode sentence terminators
  2. built a benchmark sample for four langauges ( EN ES DE AR)
  3. implemented the naive rule-based sentence segmenter
  4. collected dataset for testing and future supervised training
Oct 29 2022, 1:13 PM · Research (FY2022-23-Research-April-June), Epic

Oct 26 2022

Appledora added a comment to T321086: Requesting access to analytics-privatedata-users for appledora.

Hey @Dzahn , for now I have updated my email with the contractor email. Hope this helps!

Oct 26 2022, 4:01 PM · SRE, SRE-Access-Requests

Oct 20 2022

Appledora updated the task description for T321086: Requesting access to analytics-privatedata-users for appledora.
Oct 20 2022, 10:34 AM · SRE, SRE-Access-Requests

Oct 16 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.
  1. Finished the review on standard tools of Sentence Tokenization doc
  2. Informal Report on Milestones link
  3. Collecting resources for rule-based segmentation
  4. Isaac created a meta page
Oct 16 2022, 9:24 AM · Research (FY2022-23-Research-April-June), Epic

Oct 7 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update: [07.10.2022]

  1. Building a report on Sentence Tokenization link
  2. Renewed focus on memory-footprint and compute- cost
Oct 7 2022, 5:21 PM · Research (FY2022-23-Research-April-June), Epic

Sep 30 2022

Appledora added a comment to T316941: NLP Tools for Content Gaps.

update : [30.09.2022]

  1. Setup the basic project structure on gitlab. PR#2
  2. Started analysis on pre-established NLP packages (NLTK, Gensim, Spacy) PR#3
Sep 30 2022, 9:39 AM · Research (FY2022-23-Research-April-June), Epic

Jun 26 2022

Appledora added a watcher for Wikimedia-Hackathon-2022: Appledora.
Jun 26 2022, 2:25 PM

Apr 27 2022

Appledora removed a watcher for good first task: Appledora.
Apr 27 2022, 5:35 AM

Apr 17 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@FatimaArshad-DS , have you tried Beautifulsoup's soup.prettify() ?
If you want to adopt a more customizable approach, here's a SO post about it.

Apr 17 2022, 3:59 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 15 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac and @MGerlach , are we allowed to ask for feedback on our final applications? Particularly need help on the project timeline, like several others here I think.
Also, are there any community-specific questions for this task?
Thanks!

Apr 15 2022, 8:23 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Appledora Can you please share some links, How can I make my own dataset? from this HTML dump

@Radhika_Saini , hmm... I didn't really use any other blogs or SO posts to create my dataset. Kind of merged the ideas presented in the starter notebook and my previous pandas experience to create the dataset. I don't think sharing code is allowed, but here's my thought process :

1. PAWS server has an internal dump directory, where the Wikipedia dumps are stored as tarfile. The tarfiles are further divided into chunks of around 10 GBs. You can programmatically pick any chunk you want.
2. Each line of a chunk corresponds to a single Wikipedia article and its related information in the form of a json.  
3. Python has a tarfile library/module which can be used to iterate over the tarfile line by line.  You can use a counter in combination with this library to iterate over your preferred number of article samples and store them in a list.
4. Now you can iterate over this article list and load each item as a json (because that's what they are) . 
5. Json files essentially just have a key-value structure. Familiarize yourself with the structure of these jsons and go-ahead to extract  whatever features you want to extract from them. 
6. I store the features in lists as i go along and I feel that it is a rather nasty way of doing it.
7. Once you're done storing your features, you can convert it to a pandas dataframe and manipulate as you wish.
Apr 15 2022, 7:26 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Talika2002 , I think HTML Specs might give you some helpful pointers on your first query.

Apr 15 2022, 7:16 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 14 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Does it happen to anyone else... PAWS stops saving notebook after a while?

Apr 14 2022, 5:35 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@FatimaArshad-DS , in a very basic sense, the template is exactly what you would expect it to be. It is officially defined in Wikipedia as :

Wikimedia pages are embedded into other pages to allow for the repetition of information

Templates can be interpreted as prebuilt structures, where you can insert data against certain keys. There are templates for all sorts of things. For example, this is a template for emojis where by changing the internal values you can show different emojis on the webpage.
Pretty much the only thing you need to look for to identify a template is it's Template Namespace .

Apr 14 2022, 5:33 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 11 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Radhika_Saini , no, I am not using any external dataset. I created my own from the html dump.

Apr 11 2022, 8:16 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 10 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Appledora. Do you find the 1000 articles in one place or any database present?
or you find out individually?

Apr 10 2022, 8:22 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Basically, I wanted to be flexible about what I can extract or not and implemented the function likewise. Otherwise, just using the default bs4 get_text() method should suffice for the purpose. However, as you mentioned in your earlier comment, mwparserfromhell output, extracts more text than bs4 and I wanted to remedy that in my custom implementations and hence taking the long way of iterating over tags, which is not perfect either tbh. I hope I understood your question properly this time :3
@Radhika_Saini

Apr 10 2022, 8:14 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Basically, I wanted to be flexible about what I can extract or not and implemented the function likewise. Otherwise, just using the default bs4 get_text() method should suffice for the purpose. However, as you mentioned in your earlier comment, mwparserfromhell output, extracts more text than bs4 and I wanted to remedy that in my custom implementations and hence taking the long way of iterating over tags, which is not perfect either tbh. I hope I understood your question properly this time :3
@Radhika_Saini

Apr 10 2022, 5:49 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 9 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Radhika_Saini, if you don't mind, this is what I did, I iterated over all the visible tags in the HTML and extracted text from them. Optionally, I also iterated over other Page elements like templates and categories to extract text (if present) from them. My approach also extracts some stub information, which I couldn't omit tbh. Hope this helps.

Apr 9 2022, 7:53 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@FatimaArshad-DS , you have to write a generic function because the later tasks ask you to work on more than one (atleast 100) articles.

Apr 9 2022, 5:03 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 8 2022

Appledora added a comment to T301733: What's in a name? Task 1.

Hi @Mike_Peel , I am very very late to the party. But I have started getting familiar with WikiData and creating my page. My question is probably redundant and dumb, but I am curious about what's your expectation from this microtask. For our created pages, would you prefer it be structured and represented in a conventional Wikipedia article style? Or would you rather prefer a more descriptive page (something like a jupyter notebook) that would represent the creator's thought process and explorations? Thanks.

Apr 8 2022, 9:59 PM · Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@FatimaArshad-DS, hello. The HTML saved inside the HTML-dump is generated by an internal Wikipedia API (see Parsoid for reference) from the Wikitext code. This is why the generated HTML and the browser HTML are entirely different things. Hope this helps.

Apr 8 2022, 6:47 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

This definitely helps. I really apologize for being so redundant, and thanks for bearing with me.

Apr 8 2022, 11:15 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

hi @MGerlach , just for the sake of clarifications, recording contributions and making the final application is not the same right? I know that contributions can be updated, something like a version controlling mechanism. But can we also edit our applications once we send them in? Thanks.

Apr 8 2022, 10:36 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 6 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac and @MGerlach , should we mail the task notebook to both of you or either one would do it? Thanks.

Apr 6 2022, 3:16 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 5 2022

Appledora added a watcher for Bengali-Sites: Appledora.
Apr 5 2022, 6:52 AM

Apr 4 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac , it seems parsing wikitext has a significantly long way to go to be accurate, so far :v

Apr 4 2022, 7:09 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a watcher for good first task: Appledora.
Apr 4 2022, 5:40 AM

Apr 2 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@SamanviPotnuru and @Talika2002 , I personally did not quite get the relevance of Named External Links as an explanation for the question. I think NELs are basically those external links that have text, in between the tags (e.g : "link to it", "related articles" etc ).
However, after digging around, I found these directives on what can be linked as external links here. This tells us that it is okay to add other wikiarticles as external links.

Apr 2 2022, 8:19 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Apr 1 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Yes @Isaac , I think I got it more or less now. Thanks!

Apr 1 2022, 9:46 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 31 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac and @MGerlach , I am a little confused about the following TODO :

Are there features / data that are available in the HTML but not the wikitext?

What exactly should I be showing here? Codes or just study references?
Similarly here,

are there certain words that show up more frequently in the HTML versions but not the wikitext? Why?

what do you mean by words here? tags, attributes, patterns?

Mar 31 2022, 10:18 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac @MGerlach ,

I don't understand what's going on in this <link> tag in the HTML code:

<link about="#mwt13" data-mw='{"parts":[{"template":{"target":{"wt":"Infobox Korean name\n","href":"./Template:Infobox_Korean_name"},"params":{"context":{"wt":"north"},"hangul":{"wt":""},"hanja":{"wt":""},"rr":{"wt":""},"mr":{"wt":""}},"i":0}}]}' href="./Category:Articles_needing_Korean_script_or_text#Chang%20Gum-chol" id="mwCg" rel="mw:PageProp/Category" typeof="mw:Transclusion"/>

What is this <link> tag doing? Is this linking a category? Or is it a template?

Mar 31 2022, 8:07 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Antima_Dwivedi Hi, I noticed you are having problems with downloading the notebook. I hope you're still not facing it, but here's what I did.

  • Concatenated ?format=raw after the ipynb URL, which prompted me to a raw text page.
  • Right-click and select Save as... -> which saved the page as a .txt file
  • Simply rename the extension .txt -> .ipynb and upload

Hope this helps.

Mar 31 2022, 10:03 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 30 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

I went through the thread again and dug around about magic words more :D Thanks both of you!

Mar 30 2022, 7:26 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

@Isaac and @Talika2002 , I didn't quite get the question posed here. Could I kindly have some more examples/explanations on it?

Mar 30 2022, 1:47 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)
Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

That clears up a lot of things. Thanks, @Isaac !

Mar 30 2022, 3:48 AM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)

Mar 29 2022

Appledora added a comment to T302242: Outreachy Application Task (Round 24): Build Python library to work with html-dumps.

Thanks , @Isaac for the explanations. But as you mentioned and as I have discovered while working on the data, HTML does seem to have more content than the wikitext. Is it owing to the inner workings of the parser, i.e: mwparserfromhell or it's the actual case? And once again, really appreciate you for bearing with me today.

Mar 29 2022, 7:06 PM · Research (FY2021-22-Research-April-June), Outreach-Programs-Projects, Outreachy (Round 24)