Page MenuHomePhabricator

GSoC 2026: Programs & Events Dashboard system-wide metrics and data downloads
Closed, DeclinedPublic

Description

Profile Information

Name: Arpit Tripathi
Slack Handle: nexpectArpit
Email: arpittripathiayo@gmail.com
Phone No.: +918707424119
Github: nexpectArpit
Linkedin: arpit tripathi
Location: India (UTC+5.30)
Working hours: Between 8:30 AM to 8:30 PM UTC


Synopsis

The Programs & Events Dashboard collects the data related to article edits, their creations, uploads, and participation figures, and subsequently transforms this information into metrics. These metrics are then utilized to assess the impact of the various programs. The system already has some ways of handling this data. Contributions are processed in stages and converted into reusable metrics to be displayed through dashboards for individual courses and campaigns. This makes it easier to monitor activity and performance at a smaller scale.However, accessing data at a system-wide level is still difficult. Metrics are available in different parts of the system but there is no centralised way to retrieve or download aggregated data across all programs.
This project address the issue to solve to probelm of how to access the data in a centralised way. The project will enable structured access to aggregated data and provide improved ways to retrieve and export it. The solution will build on the existing architecture while making the system more consistent, scalable, and easier to use for analysing overall impact.

Estimated Size: 350 hours

Possible mentors: @Ragesoss and @Abishekdascs


Understanding of the Current System

Overview of the Dashboard

The Programs & Events Dashboard is used to manage and track different types of programs such as courses, campaigns, and edit-a-thons. It collects contribution data from Wikimedia projects and organizes it so that user activity and program impact can be measured. The system is mainly structured around entities like courses, campaigns, users, and articles. A course acts as the main unit where contributions are tracked, while campaigns group multiple courses together. Users represent contributors, and articles represent the content being edited. These entities are connected through relational models such as courses_users and articles_courses, which also store cached metrics like character counts, revision counts, and references added.

Data Flow in the System

The data flow starts with fetching contribution data such as revisions, edits, uploads, and article details from MediaWiki APIs. This raw data is then processed through backend services, mainly through the UpdateCourseStats pipeline.
One important concept used here is the timeslice-based approach. Instead of processing all data together, the system breaks contributions into smaller time-based segments using models like ArticleCourseTimeslice, CourseWikiTimeslice, and CourseUserWikiTimeslice. These timeslices store partial metrics for specific time ranges, which are later aggregated. After processing, the aggregated values are stored in models like Course, ArticlesCourses, and CoursesUsers as cached fields. This reduces repeated computation. Background workers like UpdateCourseWorker and DailyUpdateWorker are used to keep this data updated over time.
Finally, this processed data is exposed through controllers and used by the frontend to display dashboards and statistics.

How Metrics are Computed

Currently the system uses a precomputed and cached approach for metrics computation. Instead of calculating metrics directly from raw revisions every time, it processes data through services and stores summarized values.The main pipeline is handled by UpdateCourseStats, which updates timeslices and then refreshes cached values in different models. Higher-level aggregation is done through components like CourseStatistics, MonthlyReport, and CoursesPresenter, which combine data across multiple courses or campaigns. Metrics include values like total edits, characters added, articles created, references added, uploads, and page views. These are computed using a mix of timeslice aggregation and SQL-based summations. Caching is applied using Rails cache and Redis, especially at the campaign level, to avoid expensive repeated queries. Even though this setup is efficient, the computation logic is spread across multiple services and is not centralized.

Existing Data Access and Export System

The system provides access to data through various JSON endpoints such as course-level and campaign-level APIs. These endpoints are used by the frontend to display metrics and statistics.For data downloads, there is an existing CSV export system implemented using ReportsController and multiple CSV builder classes. Exports are available for courses and campaigns, including data like students, articles, uploads, and assignments. These files are generated asynchronously using ReportCsvWorker and stored for download.
There is also a system-wide export available, such as all_courses_and_instructors_csv, but it is limited and restricted to admin usage. There is no dedicated API or flexible mechanism for retrieving system-wide aggregated metrics. Most exports are static and not designed for broader analysis or reuse.


Proposed Solution and Implementation Approach

Starting from the existing metrics pipeline

I will begin by understanding and relying on the current metrics pipeline instead of modifying it. Right now the system processes data through ArticleCourseTimeslice, CourseWikiTimeslice, and CourseUserWikiTimeslice, and then aggregates everything through CourseCacheManager. Because of this, the courses table already stores precomputed values like character_sum, view_sum, user_count, and article_count. This becomes the starting point of the implementation.
Instead of working with raw revisions or timeslices again, I will directly use these cached fields from courses. All further steps in the implementation will depend on this decision. Aggregation will always be done on Course.nonprivate so that private data is excluded from system-wide results.

Introducing a snapshot layer on top of existing data

Once the data source is clear (the courses table), the next step is to avoid recomputing system-wide metrics repeatedly.For this, I will introduce a snapshot layer. I will create a new table system_stats which will store precomputed system-wide metrics. Instead of calculating metrics on every request, I will compute them once and store them as a snapshot.This snapshot becomes the single source of truth for system-wide data.
The metrics will be stored as a serialized hash so that new fields can be added later without schema changes. This follows the same pattern used by CourseStat, which stores Wikidata metrics in a serialized stats_hash field. This step prepares the system for fast reads, but at this point the snapshot is still empty, so the next step is to build how this data gets populated.

# db/migrate/YYYYMMDD_create_system_stats.rb
create_table :system_stats do |t|
  t.text :metrics_data       # serialized hash, same pattern as CourseStat
  t.datetime :computed_at, null: false
  t.timestamps
end
add_index :system_stats, :computed_at
Building the aggregation layer that fills the snapshot

After the snapshot table is ready, I will implement the logic that actually computes and fills it. For this, I will create a service called SystemMetricsAggregator .This service will follow the same pattern used in CoursesPresenter, especially the COURSE_SUMS_SQL query. Since all required values already exist as cached columns in the courses table, aggregation will be done using a single SQL query with multiple SUM operations on Course.nonprivate. This reduces the entire system-wide computation to a single SQL query with no JOINs, completing in under 50ms even on large datasets.
So the flow becomes:

  • read from courses (cached metrics)
  • run aggregation query (SUM)
  • store result into system_stats

Once this step is completed, the system has a working way to compute system-wide metrics, but it still needs to be triggered at the right time.

# lib/analytics/system_metrics_aggregator.rb
SYSTEM_SUMS_SQL = <<~SQL.squish
  SUM(character_sum), SUM(view_sum), SUM(user_count),
  SUM(article_count), SUM(new_article_count), SUM(revision_count),
  SUM(references_count), SUM(upload_count), COUNT(*)
SQL

def self.compute_and_store
  totals = Course.nonprivate.pick(Arel.sql(SYSTEM_SUMS_SQL))
  SystemStat.create!(
    metrics_data: { total_characters: totals[0], total_views: totals[1], ... },
    computed_at: Time.zone.now
  )
end
Connecting aggregation with the update cycle

After building the aggregator, the next step is to ensure that it runs automatically with fresh data. I will introduce a background worker called SystemMetricsUpdateWorker. This worker will be triggered after DailyUpdateWorker finishes, so that all course-level updates (timeslices → caches → courses table) are already completed before system-wide aggregation runs.
This creates a proper pipeline:

  • DailyUpdateWorker updates courses
  • SystemMetricsUpdateWorker aggregates system-wide data
  • system_stats gets updated snapshot

At this point, the backend pipeline for generating system-wide metrics becomes complete and consistent.

# app/workers/daily_update_worker.rb (modified)
def perform
  DailyUpdate.new
  SystemMetricsUpdateWorker.perform_async  # trigger system snapshot
end
Exposing the computed data through API and exports

Once the snapshot is being generated correctly, the next step is to make this data accessible. I will add a new endpoint /system_metrics.json . This endpoint will return the latest snapshot from system_stats. To avoid repeated database queries, I will cache the response using Redis with a fixed TTL. After this, I will extend the existing CSV export system. I will create a SystemCsvBuilder which will iterate through Course.nonprivate using find_each(batch_size: 1000) to handle large datasets safely. This follows the same batching approach already used in AllCoursesAndInstructorsCsvBuilder. CSV generation will be handled by ReportCsvWorker, following the same asynchronous pattern already used in the system.
So at this stage:

  • API provides structured data
  • CSV provides downloadable data
  • both rely on the same underlying system
# lib/analytics/system_csv_builder.rb
def generate_csv
  CSV.generate do |csv|
    csv << CSV_HEADERS
    Course.nonprivate.find_each(batch_size: 1000) do |course|
      csv << [course.slug, course.character_sum, course.view_sum, ...]
    end
  end
end
Adding frontend support after backend is stable

After the backend pipeline (aggregation → snapshot → API → export) is stable, I will add a frontend component. I will reuse existing components like <OverviewStat /> and patterns from CampaignStats so that the UI remains consistent.
The component will:

  • fetch data from /system_metrics.json
  • display aggregated metrics
  • provide a download button for CSV export

Since the backend is already complete at this point, the frontend becomes a thin layer on top of it.

Testing, validation, and performance checks

Once all parts are connected, I will validate the entire flow step by step.
I will write tests for:

  • SystemMetricsAggregator
  • API endpoint (/system_metrics.json)
  • SystemCsvBuilder
  • SystemMetricsUpdateWorker

I will ensure that:

  • aggregation runs in a single query (no N+1)
  • find_each keeps memory usage stable
  • private courses are excluded correctly
  • edge cases like empty datasets and null values are handled

I will also verify performance on large datasets to ensure the system scales properly. Key risks include memory pressure during CSV export of large datasets (mitigated by find_each batching) and potential cache staleness (mitigated by Redis TTL and a manual refresh option similar to CampaignsController#refresh_stats).

Final Execution Flow

The implementation follows a strict build order:

  1. Start from existing cached metrics in courses
  2. Add system_stats as a snapshot layer
  3. Build SystemMetricsAggregator to compute totals
  4. Connect it with DailyUpdateWorker via SystemMetricsUpdateWorker
  5. Expose data through API and CSV export
  6. Add frontend visualization on top
  7. Validate everything through testing and performance checks

Each step builds on the previous one, and the system becomes usable progressively instead of all at once.

Diagram 1: System Architecture Overview

Screenshot 2026-03-22 at 11.32.05 AM.png (1×1 px, 211 KB)

Diagram 2: Backend Data Pipeline and System-wide Aggregation Flow

This diagram shows how contribution data is processed into timeslices and aggregated into cached metrics in the courses table through ArticlesCourses, CoursesUsers, and CourseCacheManager.

Screenshot 2026-03-30 at 12.59.09 AM.png (1×886 px, 165 KB)

Diagram 3: CSV Export and Asynchronous Processing Flow

This diagram explains how system-wide CSV downloads are generated using background workers. It shows the request flow from the user to the ReportsController, the delegation to ReportCsvWorker, batch processing via SystemCsvBuilder, and final file generation.

Screenshot 2026-03-22 at 2.57.05 AM.png (1×1 px, 231 KB)


Timeline (350 hours, ~12 weeks)

Timeline Overview

PeriodPhaseFocus AreaDuration
May 1 – May 24Community BondingSystem understanding, mentor alignment, design discussion~3 weeks
May 25 – June 8Phase 1Aggregation foundation (core backend)2 weeks
June 9 – June 22Phase 2API layer + caching integration2 weeks
June 23 – July 6Phase 3CSV export + async processing2 weeks
July 6 – July 10Midterm EvaluationBackend completion milestone-
July 7 – July 20Phase 4Frontend integration2 weeks
July 21 – August 3Phase 5Performance + edge cases2 weeks
August 4 – August 17Phase 6Testing, documentation, buffer2 weeks
August 17 – August 24Final EvaluationFinal submission-
Before Midterm Evaluation
Community Bonding (May 1 – May 24)

This phase is used to get comfortable with how the Dashboard already computes and serves metrics. The focus will be on understanding how data flows from timeslices into cached course-level fields, and how those cached values are later used for aggregation. The existing aggregation logic (CoursesPresenter, CourseStatistics) and CSV generation flow will be studied carefully, along with running sample queries on the development database to understand dataset size and performance behavior.
Discussions with mentors such as @Ragesoss and @FRomeo_WMF will help narrow down which system-wide metrics are actually useful, instead of building unnecessary outputs. Based on these discussions, a short design outline will be prepared describing how aggregation, snapshot storage, and exports will fit into the existing system.

Phase 1: Aggregation Foundation (May 25 – June 8)

The work begins by introducing a system-wide aggregation layer that operates directly on cached course-level metrics. Since the system already stores precomputed values in the courses table, this phase focuses on reusing those values instead of recomputing from raw revisions. A snapshot-based approach will be introduced to store aggregated results, so that repeated queries do not trigger heavy computations.
Deliverable: A working aggregation mechanism that computes global metrics and persists them as snapshots for reuse.

Phase 2: API Layer + Caching (June 9 – June 22)

Once aggregation is stable, the next step is to make this data accessible. A structured API layer will be introduced to expose system-wide metrics, while a caching layer (Redis-based) ensures that repeated requests do not hit the database unnecessarily. Background updates will be integrated with the existing update cycle so that snapshots remain reasonably fresh without requiring real-time computation.
Deliverable: System-wide metrics accessible through an API, backed by caching for efficient retrieval.

Phase 3: CSV Export + Async Processing (June 23 – July 6)

With aggregation and API in place, the system is extended to support data downloads. The focus here is on building a scalable export mechanism that can handle large datasets. Instead of loading all records into memory, batch processing (find_each) will be used. CSV generation will follow the existing asynchronous pattern using background workers, ensuring that large exports do not block user requests.
Deliverable: An asynchronous CSV export system capable of generating system-wide reports safely and efficiently.

Midterm Evaluation (July 6 – July 10)

At this point, the core backend functionality is expected to be complete. This includes aggregation based on cached data, snapshot storage, API access, and CSV export through background processing.

Phase 4: Frontend Integration (July 7 – July 20)

After the backend stabilizes, system-wide metrics will be made visible through the interface. Instead of introducing new UI patterns, existing dashboard components will be reused to maintain consistency. The goal here is to present aggregated data in a simple and understandable way, along with providing access to CSV downloads.
Deliverable: System-wide metrics integrated into the dashboard with basic interaction and export support.

Phase 5: Performance and Stability (July 21 – August 3)

This phase focuses on ensuring the system behaves well under realistic conditions. Aggregation queries and caching behavior will be reviewed, especially for larger datasets. Edge cases such as missing snapshots, stale cache entries, or empty datasets will be handled to ensure the system remains reliable.
Deliverable: A stable and scalable implementation that performs consistently across different data sizes.

Phase 6: Testing, Documentation, Buffer (August 4 – August 17)

The final phase focuses on validation and completeness. Instead of adding new features, the effort shifts to testing the full pipeline end-to-end and ensuring that each component behaves as expected. Documentation will be prepared to explain how metrics are computed, how data can be accessed, and how exports are generated. Buffer time is reserved to incorporate mentor feedback and handle any remaining issues.
Deliverable: A well-tested, documented, and production-ready implementation.

Final Evaluation (August 17 – August 24)

By the end of the project, the system will support system-wide metrics aggregation, efficient data access through APIs and caching, scalable CSV exports, and basic frontend visibility.

NOTE: The proposed solution mentioned above is adaptable. I am open to feedback and can make changes as needed based on discussions with my mentors.

Post GSOC

So the current work focuses on making system wide data accessible in a structured way but there are a few natural directions where it can be extended further. Add filtering options like date range, campaign or wiki so you can explore the data instead of just seeing totals. Build analytics with trends over time or program comparisons to see how initiatives perform. Extend the API for use and let other tools or researchers pull the data more easily. Improve the frontend with visualizations such as charts and breakdown views so the data can be interpreted from the dashboard. These additions extend the same aggregation and, snapshot approach without requiring major changes to the existing system.


About Me

I am currently pursuing Bachelors in CSE, worked with languages like Python, Java, JavaScript, ReactJS and Ruby. I also worked with frameworks like MERN stack and FastAPI, and exploring Ruby on Rails as well, where i am still improving my understanding while going through the codebase. On the database side, i have worked with MySQL, PostgreSQL, MongoDB and Redis, and also have some learnings of computer networking and low level designs (LLDs). I have idea about concepts like OOPs which helped me understand how systems behave internally. Also aside my dev journey, i solve problems based on DSA by practicing regularly on competitive programming platforms. Currently i am learning about computer architecture and exploring Golang.


Past Experience

Past Contributions to Wiki Education Dashboard - Link
Pull RequestTitleInvolved
#6745 (merged)Cap weekMeetings to timeline_end instead of course end date• Updated weekMeetings to cap with timeline_end • Fixed meetings showing beyond assignment end • timeline logic
#6733 (merged)Remove unused private methods in RevisionScoreImporter• Removed unused private methods from RevisionScoreImporter after data rearchitecture • cleaned orphaned methods
#6715 (merged)hamburger menu disappearance fix• Debugged UI issue • Fixed hamburger menu disappearing on mobile for campaign tabs • corrected nav layout issue where desktop links were overlapping
#6696 (merged)Fix stale sandbox URLs when user is renamed• Added after_save callback in User model to update sandbox URLs on username change • Synced assignment sandbox_url with new username while preserving custom URLs • model callback + data consistency
#6681 (merged)Fix Timeline Sunday Start bug• Fixed weekMeetings logic to skip empty leading week • Corrected week numbering using weeksBeforeTimeline offset
#6678 (merged)Support multi-wiki article quality ratings• Refactored RatingImporter for on-demand rating updates across multiple wikis • Added international rating mappings + exposed rating_class in API/views
#6674 (merged)CI: Replace Redis and MariaDB actions with native service containers• Replaced Redis GitHub Action with native service container in CI
#6671 (open)Fix interwiki assignments and user links• Restored interwiki prefix handling in AssignmentManager and models • Fixed article/user link logic to work correctly across wikis
#6664 (merged)Fix NotSignedInError to redirect with UI instead of plain text• Replaced plain text NotSignedInError with proper UI redirect flow • Added post-login redirect handling with validation for safe paths • auth + redirect logic
#6660 (merged)Redesign /usage page with improved layout and interactive features• Redesigned /usage page layout with structured stats and sections • Added interactive wiki list with sorting and dropdown views • UI redesign
#6656 (merged)Fix tooltip icons mobile responsive• Fixed CSS rule hiding tooltip icons on mobile views
#6643 (merged)Update Article Views documentation to reflect new calculation method• Updated Article Views tooltip docs to reflect new calculation method
#6753 (merged)Scope .table-responsive CSS to mobile only• Fixes desktop UI scroll issue
#6757 (merged)Allow Declining Submitted Courses Gracefully• Adds a Decline Course workflow with a persistent declined state to prevent repeated “Unsubmitted Course” alert emails • Introduces declined flag • Stops alert loop
  • During my contributions, I went through multiple PR review iterations which helped me understand the codebase better. The feedback and cross-questioning improved my reasoning and how I approach changes.
  • I have also created a few issues based on my observations while working on the codebase, which helped me explore different parts of the system further.

Availability and Communication

During the community bonding period, I'm looking forward to understanding the project more deeply with the guidance of @FRomeo_WMF and other mentors, which will help in starting the implementation with better clarity. For communication I will stay in regular contact with mentors through the channels they prefer like Slack, Zulip or GitHub I will give updates on progress talk about any blockers early and keep communication steady throughout the project. I can dedicate about 25 to 30 hours a week during the coding period. I have my end semester exams in the last week of May so my availability during that time might be a bit reduced which I have kept in mind while planning the initial project phase. I can dedicate more hours as needed. I will maintain steady progress.

Note: I'm open to changes, especially after talking with mentors. What matters most is aligning with real project needs.

Event Timeline

Pppery added a project: Trash.
nexpectArpit updated the task description. (Show Details)

@Aklapper Apologies for that, i was just starting the platform to use

Aklapper edited projects, added Google-Summer-of-Code (2026); removed Trash.

@nexpectArpit In case this is related to GSoC, please follow the instructions. Thanks.

nexpectArpit renamed this task from in between prop to GSoC 2026: Programs & Events Dashboard system-wide metrics and data downloads.Mar 22 2026, 6:37 PM
nexpectArpit updated the task description. (Show Details)

Hi, thanks for submitting your GSoC 2026 project proposal with Wikimedia!

Please make sure you’ve also submitted your proposal on the official Summer of Code website: https://summerofcode.withgoogle.com. The deadline for both submission and any edits is the same, so ensure everything is finalized before March 31, 18:00 UTC, as changes won’t be possible after that.

We strongly recommend completing any updates at least 30 minutes before the deadline to avoid last-minute glitches or unexpected technical issues.

Wishing you all the best for your application. Hope to see you as part of the program soon! 🚀

Hi, thank you for your submission and the effort you put into your proposal. This year we received over 380 strong applications, and unfortunately we were not able to offer you a slot. This was a very competitive process, and many high quality proposals could not be selected. We truly encourage you to stay engaged and continue contributing to Wikimedia projects. Over the years, many contributors who were not selected for Google Summer of Code have gone on to make impactful contributions and become long term members of the community. Please do not see this as a failure, but as a step forward in your journey. We would love to stay in touch and support your continued involvement.

If you would like guidance on how to contribute to our projects outside GSoC, feel free to reach out to any of the mentors or org admins, they will be happy to help you get started.

You can get started or continue contributing here:

We hope to see your contributions in our community soon.