Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project.
Open, LowPublic
Actions

Assigned To

Authored By

	empty-codes
	Oct 24 2024, 5:46 PM

Description

My Meta-wiki User Page: Emptycodes
Typical working hours: 10:00am - 6:00pm PT (6pm - 2am WAT / 5pm - 1am UTC, I am open to flexible hours as needed)

Project ‘Improve observability of Wiki Education Dashboard’

This project focuses on improving the observability of the Dashboard and making it easier for system administrators — and possibly end-users as well — to detect and understand problems with the system.

Problem Statements

The current Sentry setup does not reliably detect common service disruptions and is too noisy, making it difficult to surface new bugs or regressions.
Performance problems are not always isolated effectively for quick troubleshooting.
System administrators and potentially end-users face difficulty detecting and understanding problems within the system due to inadequate visibility of errors and performance issues.

Mentor : @Ragesoss

Deliverables

Improved error tracking and alert configuration in Sentry and New Relic.
A user-friendly system status and performance information UI.
Documentation for Dashboard Deployment in CloudVPS and Toolforge

Implementation Details

[A] Existing Features:

The following gems are used for error reporting and observability (relevant to this project):

1. Sidekiq (with sidekiq-unique-jobs and sidekiq-cron)

Function: Manages background worker jobs, ensuring that tasks (e.g., data updates, notifications) run asynchronously without blocking the main application.
Integration:
- Workers on VMs: Each worker is assigned a separate virtual machine (VM) to isolate potential bottlenecks and failures, ensuring smoother processing for large tasks (e.g., course updates).
- Queues: Separate queues for long, medium, and short jobs are created, with dedicated VMs (8-core for long/medium jobs, 4-core for short jobs).
- Cron Jobs: sidekiq-cron schedules tasks at regular intervals, such as course updates.
- Error Handling: Uses Sentry for reporting errors in job execution.

2. Sentry (sentry-ruby, sentry-rails, sentry-sidekiq)

Function: Provides error reporting and observability for both Ruby server-side code and client-side JavaScript.
Integration:
- Error Logging: Sentry captures and logs errors across the system. It sends "envelopes" containing error sessions to Sentry via background jobs (handled by Sidekiq). The Sentry.capture_exception or Sentry.capture_message methods log errors and messages.
- Contextual Data: The system provides extra context in logs, such as project info, revision IDs, or error counts (e.g., error_count for LiftWingApi or ReferenceCounterApi). This helps track specific actions or issues.
- Admin Access: Only administrators can view Sentry logs, but developers can point their environment to Sentry to test logging.
- Custom Error Reporting: Developers can create custom error reports, such as sending logs from course updates (log_update_progress) or user actions (FAQ query logging).
- Error Handling Logic in lib/errors: Custom error handling logic tracks errors for designated processes e.g the course update process. Specific logs track the progress of updates (e.g., UpdateLogger.update_course) and ensure errors are reported with contextual information (e.g., sentry_tag_uuid).

3. New Relic (newrelic_rpm)

Function: Performance monitoring tool that tracks the application's performance, including metrics like queue latency and throughput.
Integration:
- Queue Latency Monitoring: The NewRelicQueueLatencyLogger middleware tracks how long jobs have been in the queue (e.g., for Sidekiq workers). Latency is added as a custom attribute in New Relic for detailed performance analysis.
- Performance Graphs: New Relic provides visual insights into system performance, such as latency spikes or processing bottlenecks.
- Queue Info: The logger captures detailed info for specific jobs, such as course updates, adding context to performance metrics.
- Development Overhead: Running New Relic in developer mode adds overhead, so it is typically enabled in production for performance analysis.

4. Performance Monitoring Gems

rack-mini-profiler: Provides real-time performance profiling of pages. It shows how much time each part of the page load process takes.
stackprof: Provides more in-depth performance profiling, allowing the analysis of CPU time spent on specific parts of the code.

5. Webpack

Function: Manages bundling and source maps for JavaScript assets.
Integration:
- Source Maps: In production, Webpack generates source maps (devtool = source map) to aid debugging of JavaScript errors. This ensures that even minified code can be traced back to the original source.
- Development Configuration: For development, cheaper source maps are used (eval-cheap-source-map) for faster builds and better debugging during development.

https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&action=edit&section=new for users to drop comments about bugs / observations / errors.

Current general sites to check system status and verify the overall health and performance of the wiki environment, and find out about major outages. (primarily for developers) To be used as reference in building UI

[B] Proposed Features:

Main Task	Subtasks	Issue	Implementation
Reducing Noise in Sentry	1. Adjusting Sentry Alerts to Reduce Noise	Difficult to track series of failures affecting update systems	- Configure Sentry Alerts: Adjust alerts to reduce noise following best practices.
	2. Suppressing Benign Sentry TCP Errors	Encountering TCP connection errors in development can clutter logs and mislead beginners during setup	- Modify Sentry Configuration: Add logic to check whether the environment is development and whether the sentry_dsn environment variable is set. Can then either: initialize sentry if only sentry_dsn is set or specifically capture and suppress the logging of getaddrinfo in dev
			- Add a note in docs that explains Sentry is optional in development and how to configure or skip it since only admins have access to Sentry server normally.
	3. Managing Storage of Error Records	Storing individual error records without a deletion strategy risks filling the database and contributes to noise on Sentry	- Set Retention Policies: Configure Sentry to retain events based on age or volume.
			- Document and implement retention policies.
	4. Preventing Excessive Sentry Logging from ReferenceCounterApi	Non-200 responses flood Sentry logs	- Determine and Configure Log Thresholds: For example, log only the first occurrence of a specific non-200 response within a 5-minute window.
			- Use Sentry Tagging: Group logs by error type or response code.
	5. Enhancing Handled Errors in Sentry	Current error handling is too generic (You can find an example of 'handled' errors in replica.rb)	- Expand TYPICAL_ERRORS: Include additional network errors to be handled.
			- Filter Handled Errors in Sentry: Implement a mechanism to filter out handled errors so they do not clutter Sentry logs.
	6. Setting Custom Event(s) for Systematic Course Update Failures	Consistent failures in updating specific courses may go unnoticed	- Track Pre- and Post-Update Statistics: Log and compare course stats before and after updates.
			- Custom Sentry Event for Course Errors: Aggregate errors related to UpdateCourseStats are aggregated by course, not just by error type.
Enhancing Sentry Configuration	1. Setting Additional Sentry Configurations	Sentry is not fully configured for optimal tracking and error logging as seen in sentry.rb	- Set Up Additional Configurations such as transaction tracing, breadcrumbs, and cron job monitoring.
			- Use Source Maps: Upload generated source maps by webpack to Sentry if not already doing so.
	2. Handling Bugs Without Explicit Errors	Rare bugs may occur without explicit errors in the JavaScript console or network requests	- Implement additional input/response validation to reduce the chance of unexpected results.
			- If invalid input / responses are captured, trigger custom Sentry events capturing relevant data.
			- Log issues directly to the console in development/staging to avoid production noise.
	3. Preventing Server Running out of Resources Failure Mode	Failure mode solved by disabling manual updates. Not a perfect fix however as multiple users report that the dashboard did not update stats even after days of updating.	- Set up a Sentry alert for Sidekiq tasks that take an abnormally long time or fail to complete, include details in the alert such as program name, expected completion time, elapsed time, so administrators can investigate.
Optimizing Performance	1. Implementing Dynamic Queue Sorting for Updates	Current queue sorting behavior is static, relying only on program length	- Implement Dynamic Queue Sorting: Prioritize jobs based on average update time instead of fixed characteristics like program length.
			- Dynamically adjust the load across different queues to prevent a bottleneck, especially in the long queue.
			- Implement a smarter sorting algorithm that recalculates average update times regularly; configure Sidekiq to support dynamic priority-based processing across different queues.
			- Expand new relic’s Queue Latency Monitoring logic in the nr_sidekiq_queue_latency.rb to be more robust
	2. Preventing Server Running Out of Memory Failure Mode	Update processes could run out of memory, potentially causing crashes. Killed processes due to memory exhaustion do not generate Sentry logs	- Set Up Memory Monitoring: Use New Relic and define memory thresholds.
			- Implement Custom Sentinel Process: Monitor for unexpected terminations and alert when they occur.
			- (Optional) Automatic Process Management: Implement Linux OOM killer and log shutdown events. Note that it is important to terminate the processes gracefully; application-level shutdown tools.
	3. Monitoring Update Scheduling Queue Health	Outage caused by a backlog in the update scheduling queue	- Implement Automated Monitoring: Set up Sidekiq and New Relic metrics to track queue health.
			- Automatic Query Recovery: Implement mechanisms to clear or retry jobs if stuck.
	4. Optimizing New Relic Configuration for Global Server	Exceeding the data ingestion limit in New Relic too early in the month prevents proper performance monitoring for the rest of the period.	- Modify the configuration to optimize which data gets sent to New Relic.
			- Focus on capturing essential metrics, balancing between detailed data and the allowed quota.
Building a status UI to present system status and performance information	1. Design Process: Define the visual and functional layout of the Status Info UI.		- User Research and Requirement Analysis: Convert user needs into user and system requirements and Audit existing metrics and prioritize the importance of available and future metrics ✔️
			- Develop a new site architecture ✔️
			- Wireframing and Prototyping
	2. Development: Implement the UI and integrate it with the backend data sources.		- Frontend Development
			- Backend Integration
	3. Deployment		TBD
	4. Testing and Maintenance : Ensure the reliability of the UI and plan for ongoing updates.		- Unit and Integration Testing + Other Tests
			- Maintenance: Maintain comprehensive documentation for developers and end-users
Improving Documentation	1. Write documentation about how the app is deployed in Cloud VPS and Toolforge	T375642 - It would be very useful to have a wiki page somewhere with more complete and accurate information about the deployment, how to debug issues, restart the app, etc. We have a fair amount of documentation in the codebase, but it's spread out and not geared towards letting others easily hop in and fix production problems. The app runs in Cloud VPS, in the globaleducation project. There is also the Toolforge tool wikiedudashboard which is running the code from WikiEduDashboardTools
Additional Tasks	1. Addressing Concerns About Decreasing Pageviews	Multiple users expressed concern over decreasing pageviews	- Create a tooltip explaining that pageview fluctuations are normal as they are just estimates based 30-day✔️
			- Integrate the tooltip into the UI where pageviews are displayed ✔️
	2. Enhancing User-Friendly Error Handling	TODOs in lib/errors/rescue_errors.rb. to improve handling for HTML errors.	- Add more user-friendly handling for HTML errors.
			- Review current error handling logic.
			- Implement user-friendly error messages.

Note: All tasks are subject to change and approval by the project mentor.

Current Initiatives in the Project

-The Status UI
Based on a discussion with the mentor and insights gathered from user research using the Program and Events dashboard talk forum (https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&action=edit&section=new), I developed an initial set of requirements for the system UI that include the prioritized data points and functionality users need to monitor system status and performance effectively. I then designed a site architecture map and categorized the proposed metrics to provide an intuitive starting point for data exploration:

Draft of WikiEduDashboard Proposed Status UI-light.png (2×3 px, 874 KB)

Google Drive Link: https://drive.google.com/drive/folders/1RwcY-5Hg8QntjiTjhn7Oe0dOOuJrwG6R
To support these requirements, I also created a prototype for the system status page in Canva, using dummy data to showcase functionality. This prototype features two alternative layouts: a widget-based design and a full-page format. Both formats aim to offer users clear, immediate access to key metrics and drill-down options for more detailed insights.
Both designs are intended as starting points and are adaptable to future feedback and refinements.

Prototype
Status UI Widget View

Proposed Prototype of Status UI Widget View.png (768×1 px, 91 KB)

or alternatively, Status UI Full View

Proposed Prototype of Status UI Full View.png (768×1 px, 61 KB)

Timeline

Summarized Timeline capturing the main milestones across 3 week segments:

Weeks	Description
Weeks 1 - 3	Initial research, documentation, start of Sentry noise reduction, and UI wireframing and prototype creation.
Weeks 4 - 6	Start development stage of UI, start of Sentry storage management, and performance optimizations.
Weeks 7 - 9	Continue with queue optimizations, Sentry alerts, and Prepare for UI Deployment.
Weeks 10 - 12	Complete UI testing and deployment, Add final improvements to Sentry / New Relic, Monitor improvements to ensure all working fine .
Week 13	Final documentation and finishing touches + project report.

Week 1 (Dec. 9 - 13)

Discuss project priorities and goals with mentors and change timeline and deliverables as needed.
Detailed documentation on current Sentry and New Relic implementations.
Research how the Dashboard uses CloudVPS and Toolforge and define documentation scope
Create site architecture map of proposed user-facing UI.
Milestone: Draft of current Sentry/New Relic documentation and created folder for the user-facing UI.

Week 2 (Dec 16 - 20)

Complete analysis of Sentry/New Relic implementations.
Attempt to configure Sentry alerts and begin initial noise reduction (sub)tasks.
Start development stage of UI, integrating feedback.
Start Documentation
Milestone: Progress on UI, completed Sentry/New Relic documentation.

Week 3 (Dec 23 - 27)

Continue noise reduction work in Sentry.
Continue developing UI, integrating feedback.
Ongoing Documentation
Milestone: Progress on noise reduction task, UI and Updated Documentation.

Week 4 (Dec 30 - Jan 3)

Continue developing UI, integrating feedback.
Begin integrating dynamic data display in the UI.
Milestone: Progress on UI.

Week 5 (Jan 6 - 10)

Start work on Sentry storage management (retention policy) setup.
Start implementing handled errors filtering and enhancements.
Continue developing UI, integrating feedback.
Milestone: Documentation of proposed storage management approach and implemented UI elements.

Week 6 (Jan 13 - 17)

Finish Sentry storage management and document retention policy setup
Begin performance optimization, focusing on dynamic queue sorting.
Start setting up memory monitoring.
Milestone: Draft on performance optimization strategies, documented progress on Sentry setup.

Week 7 (Jan 20 - 24)

Continue dynamic queue sorting and implement job/task duration alerts in Sentry.
Further UI development; expand UI for detailed performance views.
Milestone: Mid-development UI and feedback on progress with queue optimizations.

Week 8 (Jan 27 - Jan 31)

Complete job/task duration alerts in Sentry.
Implement new handled error detection logic and improve Sentry tagging.
Continue UI refinement, add tooltips and user-oriented details.
Milestone: Review of improved alerts functionality and Refined UI.

Week 9 (Feb 3 - 7)

Finalize UI improvements (tooltips, filtering/breakdown selectors).
Gather final feedback on UI before Deployment
Milestone: UI refinement and feedback before Deployment.

Week 10 (Feb 10 - 14)

Start UI Deployment Stage ; Decide on tools and methods + Update Documentation
Enhance systematic course update failure handling in Sentry.
Milestone: Review of current Sentry Improvements, Deployed UI.

Week 11 (Feb 17 - 21)

Refine memory monitoring and automatic query recovery.
Finalize the dynamic queue sorting implementation.
Final UI touches based on performance insights + Add unit / integration tests to UI code
Milestone: Review of improvements, Documented performance insights and final UI adjustments.

Week 12 (Feb 24 - Feb 28)

Monitor added configurations to Sentry / New Relic and status UI and improve all as needed
Finish adding tests to UI code
Milestone: Monitoring results, Finished tests for UI code.

Week 13 (Mar 3 - 7)

Final testing, bug fixes, documentation and finishing touches.
Project completion and internship end on Mar 7 2025
Milestone: Final project report and documentation of completed project to mentors and community.

(Note: All public and general holidays are assumed to be observed and all timelines are subject to change and approval by the project mentor. I am flexible and can accommodate any changes as needed.)

WikiEduDashboard Contribution Period Contributions

Merged PR Adds an additional string to the DELETED_REVISION_ERRORS array and renames a cassette fixes this issue.
Documented Debugging Process GitHub Issue, Phabricator Issue: T377898
Merged PR Feature to allow Programs to be Deleted before Removing Campaigns on Programs & Events Dashboard fixes this issue.
Merged PR Fixing layout bug on Articles Edited table fixes this issue.
Merged PR Adding Tooltip for Article Views Metric fixes this issue.
Merged PR Handles Truncated API responses that lead to revisions being null and causing a 'NoMethodError' fixes these issues: WikidataDiffAnalyzerGem Issue, WikiEdu Issue.

Related Objects
Search...

Status	Assigned	Task
In Progress	debt	T372834 Coordinate Wikimedia's participation in Outreachy Round 29
In Progress	None	T374390 Outreachy Round 29: Improve observability of Wiki Education Dashboard
Open	empty-codes	T378119 Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project.

Event Timeline

empty-codes created this task.Oct 24 2024, 5:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2024, 5:46 PM

@empty-codes: Hi, please provide a link to the corresponding task for context. Thanks!

empty-codes removed empty-codes as the assignee of this task.Oct 24 2024, 7:01 PM

empty-codes claimed this task.

empty-codes triaged this task as Low priority.

empty-codes updated the task description. (Show Details)

@Ragesoss Hello, please I would like to request some feedback on my proposal. Are my tasks feasible / on the right path? Thank you in advance!

@empty-codes: Hi, please provide a link to the corresponding task for context. Thanks!

In T378119#10261190, @Aklapper wrote:

@empty-codes: Hi, please provide a link to the corresponding task for context. Thanks!

Oh sorry, you mean the parent outreachy project, the proposal is for?

Aklapper added a parent task: T374390: Outreachy Round 29: Improve observability of Wiki Education Dashboard.Oct 25 2024, 9:37 AM

This looks like a solid timeline.

empty-codes updated the task description. (Show Details)Oct 28 2024, 5:13 AM

empty-codes updated the task description. (Show Details)Oct 28 2024, 6:36 AM

empty-codes updated the task description. (Show Details)Oct 29 2024, 1:05 PM

empty-codes updated the task description. (Show Details)Oct 29 2024, 2:05 PM

empty-codes updated the task description. (Show Details)

empty-codes renamed this task from Draft Proposal for Review to Outreachy Round 29 Proposal for ‘Improve observability of Wiki Education Dashboard’ project..Oct 29 2024, 2:12 PM

empty-codes updated the task description. (Show Details)

empty-codes renamed this task from Outreachy Round 29 Proposal for ‘Improve observability of Wiki Education Dashboard’ project. to Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project..Oct 29 2024, 2:30 PM

empty-codes updated the task description. (Show Details)

empty-codes updated the task description. (Show Details)Oct 29 2024, 2:46 PM

empty-codes updated the task description. (Show Details)Oct 29 2024, 3:02 PM

empty-codes updated the task description. (Show Details)Oct 29 2024, 3:12 PM

empty-codes updated the task description. (Show Details)

empty-codes updated the task description. (Show Details)Fri, Dec 6, 1:39 PM

empty-codes updated the task description. (Show Details)Mon, Dec 9, 5:24 PM

Week 1 (Dec. 9 - 13) Update

Tasks Completed:

Completed Community bonding tasks: Created wiki meta user page, Wrote my first blog, Updated my Phabricator task, Joined Zulip
Started Analysis of Sentry Implementation
Started Documentation on P&E Tools for Cloud
Created GitHub branch for UI and a draft PR using site architecture map as reference

Learnings:

What 'core values' are and what I think mine are
Learnt about CloudVPS and Toolforge in detail; Understood how the dashboard servers are setup and deployed
Expanded my knowledge of Sentry concepts and configuration options
Explored Rails MVC structure and Sidekiq APIs

Milestone Achieved ✔️

Ragesoss moved this task from Backlog to Project Proposal (Intern) on the Outreachy (Round 29) board.Tue, Dec 17, 6:38 PM

Week 2 (Dec. 16 - 20) Update

Tasks Completed:

Created first MVP of UI
Completed Analysis of Sentry Implementation
Implemented actions to reduce noise in Sentry
Wrote 2nd blog post on the theme 'Everyone Struggles'

Learnings:

How rails maps MVC structure; haml syntax
Understood what git rebase does and when it should be used
Learnt various ways to filter out data in Sentry

Milestone Achieved ✔️

empty-codes added a comment.Sun, Dec 29, 1:23 PM

This comment was removed by empty-codes.

Week 3 (Dec. 23 - 27) Update

Tasks Completed:

Continued work on UI; wrote tests (specs) for controller and service class
Monitored Sentry dashboards and archived more un-actionable issues
Created syntax to add filters to Sentry SDK Configuration

Learnings:

Sentry configuration keywords like before_send, config.excluded_exceptions
Sentry event structure:

{ 'level': 'error', 'exception': { 'values': [{ 'module': 'exceptions', 'type': ' ', 'value': ........

More Ruby and Rails Concepts and Syntax

	F57655838: Draft of WikiEduDashboard Proposed Status UI-light.png
	Oct 29 2024, 2:06 PM

	F57655834: Proposed Prototype of Status UI Full View.png
	Oct 29 2024, 2:05 PM

	F57655832: Proposed Prototype of Status UI Widget View.png
	Oct 29 2024, 2:05 PM

Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project.Open, LowPublicActions

Description

Project ‘Improve observability of Wiki Education Dashboard’

Problem Statements

Mentor : @Ragesoss

Deliverables

Implementation Details

[A] Existing Features:

1. Sidekiq (with sidekiq-unique-jobs and sidekiq-cron)

2. Sentry (sentry-ruby, sentry-rails, sentry-sidekiq)

3. New Relic (newrelic_rpm)

4. Performance Monitoring Gems

5. Webpack

[B] Proposed Features:

Current Initiatives in the Project

Timeline

Summarized Timeline capturing the main milestones across 3 week segments:

WikiEduDashboard Contribution Period Contributions

Related ObjectsSearch...

Event Timeline

Week 1 (Dec. 9 - 13) Update

Tasks Completed:

Learnings:

Milestone Achieved ✔️

Week 2 (Dec. 16 - 20) Update

Tasks Completed:

Learnings:

Milestone Achieved ✔️

Week 3 (Dec. 23 - 27) Update

Tasks Completed:

Learnings:

Milestone Achieved ✔️

Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project.
Open, LowPublic
Actions

Related Objects
Search...