My Meta-wiki User Page: Emptycodes
Typical working hours: 10:00am - 6:00pm PT (6pm - 2am WAT / 5pm - 1am UTC, I am open to flexible hours as needed)
Project ‘Improve observability of Wiki Education Dashboard’
This project focuses on improving the observability of the Dashboard and making it easier for system administrators — and possibly end-users as well — to detect and understand problems with the system.
Problem Statements
- The current Sentry setup does not reliably detect common service disruptions and is too noisy, making it difficult to surface new bugs or regressions.
- Performance problems are not always isolated effectively for quick troubleshooting.
- System administrators and potentially end-users face difficulty detecting and understanding problems within the system due to inadequate visibility of errors and performance issues.
Mentor : @Ragesoss
Deliverables
- Reduced noise in Sentry ✔️ - This was achieved by separating errors into actionable and unactionable ones and providing fixes for the actionable errors with 15+ commits and 8+ merged PRs, Commit History, PR History
- A user-friendly system status and performance information UI. ✔️ - See https://dashboard.wikiedu.org/status and https://outreachdashboard.wmflabs.org/status
- Documentation for Dashboard Deployment in CloudVPS and Toolforge ✔️ - View Admin Guide
Implementation Details
[A] Existing Features:
The following gems are used for error reporting and observability (relevant to this project):
1. Sidekiq (with sidekiq-unique-jobs and sidekiq-cron)
- Function: Manages background worker jobs, ensuring that tasks (e.g., data updates, notifications) run asynchronously without blocking the main application.
- Integration:
- Workers on VMs: Each worker is assigned a separate virtual machine (VM) to isolate potential bottlenecks and failures, ensuring smoother processing for large tasks (e.g., course updates).
- Queues: Separate queues for long, medium, and short jobs are created, with dedicated VMs (8-core for long/medium jobs, 4-core for short jobs).
- Cron Jobs: sidekiq-cron schedules tasks at regular intervals, such as course updates.
- Error Handling: Uses Sentry for reporting errors in job execution.
2. Sentry (sentry-ruby, sentry-rails, sentry-sidekiq)
- Function: Provides error reporting and observability for both Ruby server-side code and client-side JavaScript.
- Integration:
- Error Logging: Sentry captures and logs errors across the system. It sends "envelopes" containing error sessions to Sentry via background jobs (handled by Sidekiq). The Sentry.capture_exception or Sentry.capture_message methods log errors and messages.
- Contextual Data: The system provides extra context in logs, such as project info, revision IDs, or error counts (e.g., error_count for LiftWingApi or ReferenceCounterApi). This helps track specific actions or issues.
- Admin Access: Only administrators can view Sentry logs, but developers can point their environment to Sentry to test logging.
- Custom Error Reporting: Developers can create custom error reports, such as sending logs from course updates (log_update_progress) or user actions (FAQ query logging).
- Error Handling Logic in lib/errors: Custom error handling logic tracks errors for designated processes e.g the course update process. Specific logs track the progress of updates (e.g., UpdateLogger.update_course) and ensure errors are reported with contextual information (e.g., sentry_tag_uuid).
3. New Relic (newrelic_rpm)
- Function: Performance monitoring tool that tracks the application's performance, including metrics like queue latency and throughput.
- Integration:
- Queue Latency Monitoring: The NewRelicQueueLatencyLogger middleware tracks how long jobs have been in the queue (e.g., for Sidekiq workers). Latency is added as a custom attribute in New Relic for detailed performance analysis.
- Performance Graphs: New Relic provides visual insights into system performance, such as latency spikes or processing bottlenecks.
- Queue Info: The logger captures detailed info for specific jobs, such as course updates, adding context to performance metrics.
- Development Overhead: Running New Relic in developer mode adds overhead, so it is typically enabled in production for performance analysis.
4. Performance Monitoring Gems
- rack-mini-profiler: Provides real-time performance profiling of pages. It shows how much time each part of the page load process takes.
- stackprof: Provides more in-depth performance profiling, allowing the analysis of CPU time spent on specific parts of the code.
5. Webpack
- Function: Manages bundling and source maps for JavaScript assets.
- Integration:
- Source Maps: In production, Webpack generates source maps (devtool = source map) to aid debugging of JavaScript errors. This ensures that even minified code can be traced back to the original source.
- Development Configuration: For development, cheaper source maps are used (eval-cheap-source-map) for faster builds and better debugging during development.
https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&action=edit§ion=new for users to drop comments about bugs / observations / errors.
Current general sites to check system status and verify the overall health and performance of the wiki environment, and find out about major outages. (primarily for developers) To be used as reference in building UI
- https://stats.wikimedia.org/#/all-projects
- https://pageviews.wmcloud.org/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-20&pages=Cat%7CDog
- https://www.wikimediastatus.net/
- https://performance.wikimedia.org/
[B] Proposed Features:
Main Task | Subtasks | Issue | Implementation |
---|---|---|---|
Reducing Noise in Sentry | 1. Adjusting Sentry Alerts to Reduce Noise | Difficult to track series of failures affecting update systems | - Configure Sentry Alerts: Adjust alerts to reduce noise following best practices. |
2. Suppressing Benign Sentry TCP Errors | Encountering TCP connection errors in development can clutter logs and mislead beginners during setup | - Modify Sentry Configuration: Add logic to check whether the environment is development and whether the sentry_dsn environment variable is set. Can then either: initialize sentry if only sentry_dsn is set or specifically capture and suppress the logging of getaddrinfo in dev | |
- Add a note in docs that explains Sentry is optional in development and how to configure or skip it since only admins have access to Sentry server normally. | |||
3. Managing Storage of Error Records | Storing individual error records without a deletion strategy risks filling the database and contributes to noise on Sentry | - Set Retention Policies: Configure Sentry to retain events based on age or volume. | |
- Document and implement retention policies. ✔️ (Issues deleted after 3 months automatically) | |||
4. Preventing Excessive Sentry Logging from ReferenceCounterApi | Non-200 responses flood Sentry logs | - Determine and Configure Log Thresholds: For example, log only the first occurrence of a specific non-200 response within a 5-minute window. | |
- Use Sentry Tagging: Group logs by error type or response code. | |||
5. Enhancing Handled Errors in Sentry | Current error handling is too generic (You can find an example of 'handled' errors in replica.rb) | - Expand TYPICAL_ERRORS: Include additional network errors to be handled. | |
- Filter Handled Errors in Sentry: Implement a mechanism to filter out handled errors so they do not clutter Sentry logs. | |||
6. Setting Custom Event(s) for Systematic Course Update Failures | Consistent failures in updating specific courses may go unnoticed | - Track Pre- and Post-Update Statistics: Log and compare course stats before and after updates. | |
- Custom Sentry Event for Course Errors: Aggregate errors related to UpdateCourseStats are aggregated by course, not just by error type. | |||
Enhancing Sentry Configuration | 1. Setting Additional Sentry Configurations | Sentry is not fully configured for optimal tracking and error logging as seen in sentry.rb | - Set Up Additional Configurations such as transaction tracing, breadcrumbs, and cron job monitoring. |
- Use Source Maps: Upload generated source maps by webpack to Sentry if not already doing so. | |||
2. Handling Bugs Without Explicit Errors | Rare bugs may occur without explicit errors in the JavaScript console or network requests | - Implement additional input/response validation to reduce the chance of unexpected results. | |
- If invalid input / responses are captured, trigger custom Sentry events capturing relevant data. | |||
- Log issues directly to the console in development/staging to avoid production noise. | |||
3. Preventing Server Running out of Resources Failure Mode | Failure mode solved by disabling manual updates. Not a perfect fix however as multiple users report that the dashboard did not update stats even after days of updating. | - Set up a Sentry alert for Sidekiq tasks that take an abnormally long time or fail to complete, include details in the alert such as program name, expected completion time, elapsed time, so administrators can investigate. | |
Optimizing Performance | 1. Implementing Dynamic Queue Sorting for Updates | Current queue sorting behavior is static, relying only on program length | - Implement Dynamic Queue Sorting: Prioritize jobs based on average update time instead of fixed characteristics like program length. |
- Dynamically adjust the load across different queues to prevent a bottleneck, especially in the long queue. | |||
- Implement a smarter sorting algorithm that recalculates average update times regularly; configure Sidekiq to support dynamic priority-based processing across different queues. | |||
- Expand new relic’s Queue Latency Monitoring logic in the nr_sidekiq_queue_latency.rb to be more robust | |||
2. Preventing Server Running Out of Memory Failure Mode | Update processes could run out of memory, potentially causing crashes. Killed processes due to memory exhaustion do not generate Sentry logs | - Set Up Memory Monitoring: Use New Relic and define memory thresholds. | |
- Implement Custom Sentinel Process: Monitor for unexpected terminations and alert when they occur. | |||
- (Optional) Automatic Process Management: Implement Linux OOM killer and log shutdown events. Note that it is important to terminate the processes gracefully; application-level shutdown tools. | |||
3. Monitoring Update Scheduling Queue Health | Outage caused by a backlog in the update scheduling queue | - Implement Automated Monitoring: Set up Sidekiq and New Relic metrics to track queue health. | |
- Automatic Query Recovery: Implement mechanisms to clear or retry jobs if stuck. | |||
4. Optimizing New Relic Configuration for Global Server | Exceeding the data ingestion limit in New Relic too early in the month prevents proper performance monitoring for the rest of the period. | - Modify the configuration to optimize which data gets sent to New Relic. | |
- Focus on capturing essential metrics, balancing between detailed data and the allowed quota. | |||
Building a status UI to present system status and performance information | 1. Design Process: Define the visual and functional layout of the Status Info UI. | - User Research and Requirement Analysis: Convert user needs into user and system requirements and Audit existing metrics and prioritize the importance of available and future metrics ✔️ | |
- Develop a new site architecture ✔️ | |||
- Wireframing and Prototyping ✔️ | |||
2. Development: Implement the UI and integrate it with the backend data sources. ✔️ | - Frontend Development ✔️ | ||
- Backend Integration ✔️ | |||
3. Deployment ✔️ | TBD | ||
4. Testing and Maintenance : Ensure the reliability of the UI and plan for ongoing updates. ✔️ | - Unit and Integration Testing + Other Tests✔️ | ||
- Maintenance: Maintain comprehensive documentation for developers and end-users | |||
Improving Documentation | 1. Write documentation about how the app is deployed in Cloud VPS and Toolforge | T375642 - It would be very useful to have a wiki page somewhere with more complete and accurate information about the deployment, how to debug issues, restart the app, etc. We have a fair amount of documentation in the codebase, but it's spread out and not geared towards letting others easily hop in and fix production problems. The app runs in Cloud VPS, in the globaleducation project. There is also the Toolforge tool wikiedudashboard which is running the code from WikiEduDashboardTools ✔️ | |
Additional Tasks | 1. Addressing Concerns About Decreasing Pageviews | Multiple users expressed concern over decreasing pageviews | - Create a tooltip explaining that pageview fluctuations are normal as they are just estimates based 30-day✔️ |
- Integrate the tooltip into the UI where pageviews are displayed ✔️ | |||
2. Enhancing User-Friendly Error Handling | TODOs in lib/errors/rescue_errors.rb. to improve handling for HTML errors. | - Add more user-friendly handling for HTML errors. | |
- Review current error handling logic. | |||
- Implement user-friendly error messages. | |||
Note: All tasks are subject to change and approval by the project mentor.
Current Initiatives in the Project
-The Status UI
Based on a discussion with the mentor and insights gathered from user research using the Program and Events dashboard talk forum (https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&action=edit§ion=new), I developed an initial set of requirements for the system UI that include the prioritized data points and functionality users need to monitor system status and performance effectively. I then designed a site architecture map and categorized the proposed metrics to provide an intuitive starting point for data exploration:
Google Drive Link: https://drive.google.com/drive/folders/1RwcY-5Hg8QntjiTjhn7Oe0dOOuJrwG6R
To support these requirements, I also created a prototype for the system status page in Canva, using dummy data to showcase functionality. This prototype features two alternative layouts: a widget-based design and a full-page format. Both formats aim to offer users clear, immediate access to key metrics and drill-down options for more detailed insights.
Both designs are intended as starting points and are adaptable to future feedback and refinements.
Prototype
Status UI Widget View
or alternatively, Status UI Full View
- Live(?) Link: https://www.canva.com/design/DAGU0nls4WM/_I-fK4qkrM6g3sA2TqVdaQ/view?utm_content=DAGU0nls4WM&utm_campaign=designshare&utm_medium=link&utm_source=editor
- Google Drive Photos: https://drive.google.com/file/d/1Bmga_VPRZe9hMj93ZQBXasSbNeTdWWeE/view?usp=sharing and https://drive.google.com/file/d/1gfGCC2tjl016moP0fHbJSNv-uqDPmo53/view?usp=sharing
Timeline
Summarized Timeline capturing the main milestones across 3 week segments:
Weeks | Description |
---|---|
Weeks 1 - 3 | Initial research, documentation, start of Sentry noise reduction, and UI wireframing and prototype creation. |
Weeks 4 - 6 | Start development stage of UI, start of Sentry storage management, and performance optimizations. |
Weeks 7 - 9 | Continue with queue optimizations, Sentry alerts, and Prepare for UI Deployment. |
Weeks 10 - 12 | Complete UI testing and deployment, Add final improvements to Sentry / New Relic, Monitor improvements to ensure all working fine . |
Week 13 | Final documentation and finishing touches + project report. |
Week 6 (Jan 13 - 17)
- Finish Sentry storage management and document retention policy setup
- Begin performance optimization, focusing on dynamic queue sorting.
- Start setting up memory monitoring.
- Milestone: Draft on performance optimization strategies, documented progress on Sentry setup.
Week 7 (Jan 20 - 24)
- Continue dynamic queue sorting and implement job/task duration alerts in Sentry.
- Further UI development; expand UI for detailed performance views.
- Milestone: Mid-development UI and feedback on progress with queue optimizations.
Week 8 (Jan 27 - Jan 31)
- Complete job/task duration alerts in Sentry.
- Implement new handled error detection logic and improve Sentry tagging.
- Continue UI refinement, add tooltips and user-oriented details.
- Milestone: Review of improved alerts functionality and Refined UI.
Week 9 (Feb 3 - 7)
- Finalize UI improvements (tooltips, filtering/breakdown selectors).
- Gather final feedback on UI before Deployment
- Milestone: UI refinement and feedback before Deployment.
Week 10 (Feb 10 - 14)
- Start UI Deployment Stage ; Decide on tools and methods + Update Documentation
- Enhance systematic course update failure handling in Sentry.
- Milestone: Review of current Sentry Improvements, Deployed UI.
Week 11 (Feb 17 - 21)
- Refine memory monitoring and automatic query recovery.
- Finalize the dynamic queue sorting implementation.
- Final UI touches based on performance insights + Add unit / integration tests to UI code
- Milestone: Review of improvements, Documented performance insights and final UI adjustments.
Week 12 (Feb 24 - Feb 28)
- Monitor added configurations to Sentry / New Relic and status UI and improve all as needed
- Finish adding tests to UI code
- Milestone: Monitoring results, Finished tests for UI code.
Week 13 (Mar 3 - 7)
- Final testing, bug fixes, documentation and finishing touches.
- Project completion and internship end on Mar 7 2025
- Milestone: Final project report and documentation of completed project to mentors and community.
(Note: All public and general holidays are assumed to be observed and all timelines are subject to change and approval by the project mentor. I am flexible and can accommodate any changes as needed.)
WikiEduDashboard Contribution Period Contributions
- Merged PR Adds an additional string to the DELETED_REVISION_ERRORS array and renames a cassette fixes this issue.
- Documented Debugging Process GitHub Issue, Phabricator Issue: T377898
- Merged PR Feature to allow Programs to be Deleted before Removing Campaigns on Programs & Events Dashboard fixes this issue.
- Merged PR Fixing layout bug on Articles Edited table fixes this issue.
- Merged PR Adding Tooltip for Article Views Metric fixes this issue.
- Merged PR Handles Truncated API responses that lead to revisions being null and causing a 'NoMethodError' fixes these issues: WikidataDiffAnalyzerGem Issue, WikiEdu Issue.