Development plan for T386128: Outreachy 30: Addressing the new Lusophone technological wishlist proposals
Profile
- Name: Nisha Chandila
- Email: nishachandila21@gmail.com
- GitHub: https://github.com/NishaChandila
- Mentors: @Ederporto, @Arcstur
Problem Statement
Many Wikipedia pages contain dead links—references that no longer work because the original webpage has been deleted, moved, or altered. This creates several challenges:
- Editors and readers rely on citations to verify facts, but broken links make it difficult to trust information.
- No easy way exists to find all pages with broken links on Portuguese Wikipedia.
- Editors waste time manually searching for broken citations, slowing down their work.
- Readers lose confidence in Wikipedia’s accuracy when they encounter dead links.
Proposed Solution
Step 1: Research and Preparation
Understanding the MediaWiki API
To effectively scan Wikipedia for broken links, we will use the MediaWiki API to retrieve articles that contain external citations, specifically those using citation templates like:
- {{Cite web}}
- {{Cite book}}
- {{Cite journal}}
The tool will extract external URLs embedded within these citation templates and validate them for errors.
Analyzing Common Citation Issues
Our tool will identify and categorize common citation issues, including:
- Dead links (404 errors): Referenced webpage no longer exists.
- Redirected links: URLs that have changed and may now point to different or unintended content.
- Timeout errors: Websites that fail to load, indicating potential accessibility issues or server downtime.
- Incomplete citations: Missing essential details like the publication date, author name, or other critical information needed for full citation accuracy.
Step 2: Prototype Development
Building a Link Checker
The script will:
- Retrieve articles containing external citations.
- Check each link’s status to determine if it is active or broken.
- Identify errors such as 404 Not Found, 403 Forbidden, timeout failures, and redirects.
- Flag missing or incomplete citation details for editors to review.
Why Not Use Existing Solutions?
While tools like InternetArchiveBot and the German Broken Link Script exist, they have limitations for Portuguese Wikipedia. This project aims to develop a dedicated tool that:
- Specifically addresses the needs of the Portuguese Wikipedia community.
- Improves accuracy in detecting and reporting broken links.
- Provides a structured and user-friendly report for editors, reducing manual effort.
Step 3: Testing and Feedback
Testing on Wikipedia Articles
The tool will be tested on 100+ Wikipedia articles to evaluate:
- Detection accuracy – Ensuring that broken links are correctly identified.
- Error classification – Categorizing different types of link failures (404, redirects, timeouts, etc.).
- Performance – Measuring how efficiently the script processes multiple articles.
Gathering Feedback from Editors
Once reports are generated, they will be shared with Wikipedia editors—particularly in the Portuguese Wikipedia community—to:
- Assess usability and clarity of the reports.
- Identify false positives or missed broken links.
- Suggest improvements for better integration into Wikipedia’s workflow.
Based on feedback, the tool will be refined and optimized to ensure it effectively meets the needs of editors.
Step 4: Developing the Link Checker for Portuguese Wikipedia
Script Development Approach
Extract Relevant Articles
- Use the MediaWiki API to retrieve articles that contain external citations.
- Focus on citation templates such as {{Cite web}}, {{Cite book}}, and {{Cite journal}}.
- Filter articles based on language settings to ensure we are targeting Portuguese Wikipedia.
Parse and Extract Links
- Use mwparserfromhell to process and analyze Wikipedia’s wikitext.
- Extract external URLs from citation templates while ensuring proper formatting.
Check Link Status
- Use Requests and urllib to send HTTP requests and analyze response codes.
- Identify broken links based on error codes such as 404 Not Found, 403 Forbidden, and timeout failures.
- Detect redirects and validate if they still point to relevant content.
Enhance Detection with Web Scraping
- Integrate BeautifulSoup to further analyze webpage content.
- Identify soft 404 errors (pages that exist but indicate missing content).
- Verify if redirected links match the original citation context.
Store and Format Results
- Save detected issues in CSV or JSON format for structured data storage.
- Generate an HTML report for easy review by Wikipedia editors.
Automation and Continuous Updates
- Use Jupyter Notebook for testing and iterative improvements.
- Host and maintain the script on GitHub, allowing community contributions and updates.
- Explore integration with Wikimedia bots to automate periodic checks and updates.
Leveraging Existing Tools and Custom Enhancements
While existing solutions like InternetArchiveBot focus on archiving, our tool will specifically detect and report broken links in Portuguese Wikipedia. By leveraging:
- MediaWiki API for article retrieval
- mwparserfromhell for template parsing
- Requests and BeautifulSoup for web request handling
- CSV, JSON, and HTML for structured reporting
We ensure that our solution is both efficient and adaptable to the unique needs of the Portuguese Wikipedia community.
Step 5: Tools, Technologies, and Strategic Alignment
Tools and Technologies Used
- Programming Language: Python
- Wikipedia API and Parsing:
- MediaWiki API
- mwparserfromhell
- Web Request and Link Checking:
- Requests
- urllib
- BeautifulSoup
- Data Storage and Reporting:
- CSV
- JSON
- HTML Report Generation
- Testing and Deployment:
- Jupyter Notebook
- GitHub
Connection to Wikimedia’s Strategy
This project aligns with Wikimedia’s strategic goals by enhancing citation maintenance and improving Wikipedia’s reliability:
- Better Editing for Everyone (Recommendation 2): Simplifies the process of finding and fixing broken links, allowing editors to focus on improving content rather than manually checking citations.
- Stronger, Smarter Knowledge (Recommendation 9): Improves citation quality, ensuring Wikipedia remains a trustworthy source of information.
- Empowering Smaller Communities (Lusophone Wishlist): Addresses a critical need for Portuguese Wikipedia, where existing tools like InternetArchiveBot have limitations.
Expected Outcomes and Benefits
- For Editors: A clear, automated report that highlights broken or incomplete citations, enabling faster and more efficient corrections.
- For Readers: Strengthened trust in Wikipedia’s reliability through properly maintained references.
- For the Wikimedia Community: A proactive approach to maintaining citation quality, reducing manual errors, and keeping content up-to-date.
Project Timeline for the Script Development
June
Week 1 (June 3 - June 7)
- Research MediaWiki API and citation templates ({{Cite web}}, {{Cite book}}, {{Cite journal}}).
- Study how Wikipedia stores external links.
- Analyze existing tools like InternetArchiveBot and German Broken Link Script.
Week 2 (June 10 - June 14)
- Experiment with MediaWiki API to extract external citations.
- Set up Python environment with necessary libraries (requests, mwparserfromhell, BeautifulSoup).
- Write initial script to retrieve Wikipedia articles with citations.
Week 3 (June 17 - June 21)
- Develop parsing logic using mwparserfromhell to extract URLs from citation templates.
- Implement HTTP request handling to check link statuses (404, 403, timeouts, etc.).
- Categorize errors and create a basic logging mechanism.
Week 4 (June 24 - June 28)
- Optimize the script to improve efficiency and reduce unnecessary API calls.
- Start working on storing results in structured formats (CSV, JSON).
- Conduct preliminary tests on a small Wikipedia dataset.
July
Week 5 (July 1 - July 5)
- Implement a web scraping mechanism using BeautifulSoup to detect soft 404 errors.
- Develop logic for identifying redirected links and verifying if they are still relevant.
- Begin refining error classification (distinguishing between temporary and permanent issues).
Week 6 (July 8 - July 12)
- Create an HTML-based report format for easy review by Wikipedia editors.
- Improve data visualization (tables, color-coding broken links).
- Run broader tests on at least 100 Wikipedia articles.
Week 7 (July 15 - July 19)
- Gather feedback from a few Wikipedia editors to refine report clarity.
- Optimize script performance to handle larger datasets efficiently.
- Explore GitHub integration for community contributions.
Week 8 (July 22 - July 26)
- Implement automation features (Jupyter Notebook for testing, scheduling periodic link checks).
- Document code and create README for future maintainers.
- Finalize version 1 of the tool.
August
Week 9 (July 29 - August 2)
- Conduct final round of testing with a focus on error detection accuracy.
- Fix any major issues found during feedback/testing.
Week 10 (August 5 - August 9)
- Deploy the tool and share it with the Portuguese Wikipedia community.
- Gather final feedback and document any future improvements.
Week 11 (August 12 - August 16)
- Wrap up documentation and finalize the project report.
- Submit the tool and report for review.