Page MenuHomePhabricator

GSoC 2026 Proposal Wikifile-Transfer Enhancement
Open, Needs TriagePublic

Description

Project Title: Wikifile-Transfer: Batch Upload, History, Metadata Extraction & Testing
Project task: T415562
Proposal Link: Proposal
Organization: Wikimedia Foundation (Indic-TechCom)
Mentors: @ParasharSarthak @Jnanaranjan_sahu
Project Size: Large (~350 hours)


About Me

Contact Information

Goals and Qualifications

I am applying to GSoC with Wikimedia to scale production-level applications using Python (Flask/Celery) and React. My primary goal is to evolve Wikifile-Transfer into a robust, job-based batch transfer pipeline that reduces repetitive manual work for Wikimedia contributors across Indic-language communities.

I am a third-year undergraduate student at Indraprastha Institute of Information Technology Delhi (IIIT-Delhi). I have focused heavily on building fault-tolerant backend pipelines, async concurrency, and write-ahead logging. My background in distributed state durability aligns directly with the architectural challenges of this project: replacing volatile task queues with durable, crash-safe MySQL state machines.

Prior Contributions

To demonstrate my ownership of the codebase and technical readiness, I completed the prerequisite microtasks (T415715 and T415717) via PR #61. I treated these not merely as bug fixes, but as foundational groundwork for the batch-upload project:

  • Custom Exception Hierarchy: Created AppError, MediaWikiAPIError, and UploadError to enforce clean error boundaries.
  • Structured Logging: Implemented rotating, structured file-based logging (logging_config.py) to provide clear observability across request and worker processes.
  • API Standardization: Replaced broad, silent exception handlers with explicit catches and standardized JSON API responses.
  • Deterministic Cleanup: Built a safe cleanup primitive (_safe_remove_temp()) to guarantee temporary files are wiped from the Docker container on both success and failure paths.

These reliability improvements ensure the system does not leave orphaned temporary files or hide critical MediaWiki API failures as we scale to batch processing.


Project Abstract & Problem Statement

Existing System Overview

Currently, Wikifile-Transfer acts as an effective but highly constrained single-transfer utility.

  • In app.py, it handles exactly one upload at a time, branching into a synchronous path for files under 50 MB and a Celery queue path for larger ones.
  • In utils.py, get_localized_wikitext() uses mwparserfromhell to localize licensing templates via templatelist.py, but it completely bypasses category localization.
  • Crucially, the tool’s state is ephemeral. State is not persisted durably, and the frontend currently has limited user-facing visibility into task failures or past transfers.

The Problems

This architecture creates three major bottlenecks for Wikimedia contributors:

  1. The Repetition Problem: users transferring many non-free media files must repeat the identical UI flow multiple times.
  2. The Black Box Problem: without durable database persistence, users cannot easily view transfer history, diagnose failures, or resume interrupted workflows.
  3. The Metadata Problem: categories remain in the source language and require tedious, manual localization on the target wiki post-transfer.

Technical Approach & Architecture

Solution Architecture: A Job-Based Transfer Pipeline

Batch uploading should be implemented as a batched job pipeline rather than a single large HTTP request, to avoid timeouts and to allow per-file recovery. Inspired by Wikimedia extensions like UploadWizard and BatchUploadTools, I will evolve Wikifile-Transfer into a job-based transfer pipeline with three distinct layers:

  • Intake Layer (React Frontend & Flask API): The React frontend will handle file arrays, breaking the flow into review, submission, and progress states. /api/upload/batch will validate the request, write to the database, dispatch tasks, and return 202 Accepted.
  • Transfer Orchestration Layer (Celery): Celery remains the execution engine. For every file in a batch, a unique upload_image_task is triggered so a single failed file does not fail the entire batch.
  • Result Tracking Layer (MySQL/SQLAlchemy): A durable MySQL/SQLAlchemy layer will become the source of truth for transfer history, while Redis continues to serve Celery task execution.

Note: The legacy single-upload route will be refactored into a thin compatibility wrapper that calls this new batch service layer with a single-item array, preventing duplicated business logic.

The State Machine & Database Design

History and safe retry require durable state tracking in MySQL via SQLAlchemy. I will define two persistent tables, UploadBatch and TransferJob, with clear parent-child relationships and explicit status fields to track each batch and file-level transfer attempt.

Model 1: UploadBatch (The Parent)
  • Purpose: Represents a single user submission of N files.
  • States: PENDING, RUNNING, PARTIAL, COMPLETED, FAILED
  • Fields: id (PK), user_id (FK), created_at, total_files
  • Rule: A batch becomes PARTIAL when at least one file succeeds and at least one fails; it becomes FAILED only when all child jobs fail.
Model 2: TransferJob (The Child)
  • Purpose: Represents the lifecycle of an individual file transfer.
  • States: QUEUED, PROCESSING, SUCCESS, FAILED, RETRYING
  • Fields: id (PK), batch_id (FK), srcUrl, trproject, trlang, trfilename, status, error_log, target_url, retry_count

History Dashboard & Safe Targeted Retry

The React frontend will feature a new History Dashboard that queries the durable DB records. It will support filtering by status, expanding a parent batch to view per-file results, and linking directly to successful target wiki pages.

Targeted Retry Logic

If 10 files are submitted and 1 fails due to a network timeout, the user will see 9 completed jobs and 1 failure. Clicking Retry will trigger /api/retry/<job_id>. This endpoint is designed defensively:

  • It checks the database to ensure the job status is FAILED.
  • It increments the retry_count, resets the state to QUEUED, and re-dispatches the specific Celery task.
  • This ensures failed files can be retried without repeating successful uploads.

Category Localization Strategy (Metadata Extraction)

Category localization will extend the existing get_localized_wikitext() pipeline rather than inventing a secondary parser. Crucially, the approach will be strictly best-effort and non-destructive:

  1. Parse: Extract source wikitext and detect [[Category:...]] tags using mwparserfromhell.
  2. Query: Resolve possible target-wiki category equivalents using MediaWiki API lookups, including langlinks when a translated category page exists.
  3. Replace: Swap the category in the AST only with a matching localized category page when one exists.
  4. Fallback (Preserve): If no safe mapping exists, preserve the original category. This is safer than silently dropping metadata, as it retains structural categorization that local wiki admins can clean up later.

Scope Control

To ensure this 350-hour project remains highly feasible and focused on reliability, the following are out of scope:

  • A full Media Asset Management (MAM) interface
  • Cross-wiki file deduplication or SHA1 hash checking
  • Perfect AI or machine-translation of obscure, un-linked categories

Risks and Mitigations

Batch uploading and metadata manipulation inherently introduce new failure modes. My architecture mitigates these systematically:

RiskImpactMitigation Strategy
API rate limiting and gateway timeoutsHighSending many files simultaneously could trip Wikimedia API limits. Mitigation: Celery tasks will use bounded concurrency, request timeouts, and retry/backoff for MediaWiki API calls; the frontend will read status from the backend rather than waiting on a long-running request.
History lost due to Celery result expiryMediumRedis result backends are volatile. Mitigation: MySQL is the durable source of truth. Redis is used solely as a message broker.
Data loss from category manglingHighUnsafe AST replacement could corrupt page formatting. Mitigation: the preserve-on-fallback rule ensures that if API:Langlinks returns no mapping, the AST node is left completely untouched.
System instability or silent failuresHighUncaught exceptions during batching leave zombie jobs. Mitigation: custom exceptions (MediaWikiAPIError) and _safe_remove_temp() isolate failures, ensuring the database status correctly flips to FAILED and temporary files are wiped.

Timeline and Development Plan

This schedule is specifically structured for a Large (~350 hour) project. It prioritizes the core backend orchestration first, ensuring users can submit multiple files, each file gets its own tracked job, failures are visible, and temp cleanup is reliable by the midterm, and reserves the second half for metadata localization, E2E testing, and hardening.

Community Bonding (May 1 – May 24, 2026)

  • Finalize the UploadBatch and TransferJob SQLAlchemy schema definitions with mentors.
  • Draft React UI wireframes for the History dashboard and Batch submission dropzone.
  • Agree on exact retry semantics and category-localization edge-case handling.
  • Ensure the existing GitHub Actions CI workflow is fully passing on the main branch.

Milestone 1: Batch Engine & Persistence (May 25 – June 28)

Goal: The core orchestration engine is built, and batch uploads are processed safely with durable state tracking.

  • Week 1 (May 25 - May 31): Initialize Flask-Migrate. Implement UploadBatch and TransferJob SQLAlchemy models. Build the /api/upload/batch REST route to accept arrays of file metadata.
  • Week 2 (Jun 1 - Jun 7): Refactor upload_image_task in tasks.py to accept a job_id. Implement database state transitions (QUEUED -> PROCESSING -> SUCCESS/FAILED). Update the error catcher to log specific AppError messages to the database error_log column.
  • Week 3 (Jun 8 - Jun 14): Update the React frontend. Modify the dropzone component to accept multiple files. Build a basic Batch Upload progress UI that polls the API for task status updates.
  • Week 4 (Jun 15 - Jun 21): Write backend Pytest unit tests for the batch API and Celery state transitions. Ensure _safe_remove_temp correctly triggers for all items in a batch on both success and failure paths. Refactor the legacy single-upload route to use the new batch service layer under the hood.
  • Week 5 (Jun 22 - Jun 28): Buffer week for debugging, mentor code reviews, and preparing the codebase for the Midterm Evaluation.

Milestone 2: History Dashboard & Targeted Retry (June 29 – July 26)

Goal: The system becomes a fully operational tracking tool with a persistent dashboard and targeted retry capabilities.

  • Week 6 (Jun 29 - Jul 5): Implement /api/history with offset-based pagination initially. Map DB relationships so queries for batches return nested TransferJob arrays efficiently.

Midterm Evaluations (July 6 – July 10): Mentors and contributors submit their midterm evaluations. Codebase is stable: users can reliably submit batch uploads, and state is tracked safely in MySQL.

  • Week 7 (Jul 6 - Jul 12): Build the React History tab. Implement frontend state filtering (All, Failed, Successful, In Progress) and expand/collapse logic for batch rows.
  • Week 8 (Jul 13 - Jul 19): Implement the /api/retry/<job_id> endpoint. Add strict backend validation to allow retries only for FAILED jobs. Add contextual Retry buttons to the React History UI.
  • Week 9 (Jul 20 - Jul 26): Write backend integration tests for the Retry flow. Ensure the retry_count increments correctly and the job seamlessly re-enters the Celery queue.

Milestone 3: Metadata Localization & Quality Hardening (July 27 – Aug 16)

Goal: Categories are safely localized via API lookups, and the CI/CD pipeline enforces >80% test coverage.

  • Week 10 (Jul 27 - Aug 2): Extend utils.py to parse [[Category:XYZ]] using mwparserfromhell. Implement batched API:Langlinks fetch logic and the safe-fallback preservation rule.
  • Week 11 (Aug 3 - Aug 9): Integrate the new category localization function into the Celery task pre-upload flow. Complete the Pytest suite for get_localized_wikitext(), heavily mocking MediaWiki API responses to test edge cases.
  • Week 12 (Aug 10 - Aug 16): Write Cypress End-to-End (E2E) tests. Simulate a multi-file upload, use Cypress network intercepts to force a 502 failure on one file, and assert that the UI displays a functional Retry button. Finalize GitHub Actions configuration for automated testing.

Final Evaluations (Aug 17 – Aug 24, 2026)

  • Week 13 (Aug 17 - Aug 24): Final buffer and documentation. Final UI polish. Write extensive documentation (README updates, API docs, architecture notes). Prepare for the Final Evaluation and submit Final Project Materials.

Testing & Documentation Plan

Reliability is not a side quest; it is a strict requirement for batch systems that magnify the cost of hidden failures. Testing will be integrated continuously:

Backend (Pytest)

I will utilize unittest.mock.patch to simulate specific failure modes: download API 403/404, missing imageinfo, failed CSRF token fetch, upload response missing imageinfo, and cleanup after failure. Tests will cover:

  • AST manipulation correctness in utils.py (simulating both exact matches and missing translations from API:Langlinks)
  • Database state transitions, ensuring a MediaWikiAPIError correctly transitions a TransferJob to FAILED and captures the stack trace in the error_log

Frontend (Cypress)

I will write automated E2E tests simulating the complete user journey:

  • Mocking an OAuth session
  • Submitting 3 files via the batch dropzone
  • Intercepting the network to force 1 upload to fail
  • Asserting that the History UI correctly displays 2 successes, 1 failure, and a clickable Retry button
  • Clicking Retry and asserting the state updates to PROCESSING

Documentation

I will update README.md to reflect the new batch capabilities. I will also write a brief developer guide explaining the state machine, how to extend the TransferJob schema, and how to run the Cypress/Pytest suites locally.


Related Work and Inspiration

This proposal is informed by Wikimedia’s existing upload ecosystem. These projects motivated the batch/job/history design, but the implementation plan above is an original, clean-room extension of the existing Wikifile-Transfer codebase rather than a port or copy.

Reference inspiration used for design concepts only:

  • Wikimedia UploadWizard
  • ProfessionalWiki SimpleBatchUpload
  • lokal-profil BatchUploadTools
  • JeanFred MassUploadLibrary
  • Wikimedia Commons Android App

How these references shaped the architecture:

  • UploadWizard and SimpleBatchUpload guided the decision to make batch upload a multi-step workflow rather than a single form submission.
  • BatchUploadTools reinforced the backend orchestration model with a central coordinator to handle repeated uploads, logging, and state.
  • MassUploadLibrary provided the conceptual precedent for treating metadata as a first-class concern, justifying the separation of category localization from upload execution.

Event Timeline

@Aklapper can you add this as a subtask in the wikifile transfer enhancement project so that mentors can see accordingly

Anirudh23090 renamed this task from GSoC Proposal 2026 Proposal Wikifile-Transfer Enhancement to GSoC 2026 Proposal Wikifile-Transfer Enhancement.Mar 26 2026, 1:39 PM
Anirudh23090 renamed this task from GSoC 2026 Proposal Wikifile-Transfer Enhancement to GSoC Proposal 2026 Proposal Wikifile-Transfer Enhancement.
Anirudh23090 renamed this task from GSoC Proposal 2026 Proposal Wikifile-Transfer Enhancement to GSoC 2026 Proposal Wikifile-Transfer Enhancement.

Hi, thanks for submitting your GSoC 2026 project proposal with Wikimedia!

Please make sure you’ve also submitted your proposal on the official Summer of Code website: https://summerofcode.withgoogle.com. The deadline for both submission and any edits is the same, so ensure everything is finalized before March 31, 18:00 UTC, as changes won’t be possible after that.

We strongly recommend completing any updates at least 30 minutes before the deadline to avoid last-minute glitches or unexpected technical issues.

Wishing you all the best for your application. Hope to see you as part of the program soon! 🚀

Yes, I have submitted my proposal on the GSoC platform as well. I really appreciate the support and am looking forward to contributing to Wikimedia!