Page MenuHomePhabricator

Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project.
Open, LowPublic

Description

My Meta-wiki User PageEmptycodes
Typical working hours: 10:00am - 6:00pm PT (6pm - 2am WAT / 5pm - 1am UTC, I am open to flexible hours as needed)

Project ‘Improve observability of Wiki Education Dashboard’

This project focuses on improving the observability of the Dashboard and making it easier for system administrators — and possibly end-users as well — to detect and understand problems with the system.

Problem Statements

  • The current Sentry setup does not reliably detect common service disruptions and is too noisy, making it difficult to surface new bugs or regressions.
  • Performance problems are not always isolated effectively for quick troubleshooting.
  • System administrators and potentially end-users face difficulty detecting and understanding problems within the system due to inadequate visibility of errors and performance issues.

Mentor : @Ragesoss

Deliverables

Implementation Details

[A] Existing Features:

The following gems are used for error reporting and observability (relevant to this project):

1. Sidekiq (with sidekiq-unique-jobs and sidekiq-cron)

  • Function: Manages background worker jobs, ensuring that tasks (e.g., data updates, notifications) run asynchronously without blocking the main application.
  • Integration:
    • Workers on VMs: Each worker is assigned a separate virtual machine (VM) to isolate potential bottlenecks and failures, ensuring smoother processing for large tasks (e.g., course updates).
    • Queues: Separate queues for long, medium, and short jobs are created, with dedicated VMs (8-core for long/medium jobs, 4-core for short jobs).
    • Cron Jobs: sidekiq-cron schedules tasks at regular intervals, such as course updates.
    • Error Handling: Uses Sentry for reporting errors in job execution.

2. Sentry (sentry-ruby, sentry-rails, sentry-sidekiq)

  • Function: Provides error reporting and observability for both Ruby server-side code and client-side JavaScript.
  • Integration:
    • Error Logging: Sentry captures and logs errors across the system. It sends "envelopes" containing error sessions to Sentry via background jobs (handled by Sidekiq). The Sentry.capture_exception or Sentry.capture_message methods log errors and messages.
    • Contextual Data: The system provides extra context in logs, such as project info, revision IDs, or error counts (e.g., error_count for LiftWingApi or ReferenceCounterApi). This helps track specific actions or issues.
    • Admin Access: Only administrators can view Sentry logs, but developers can point their environment to Sentry to test logging.
    • Custom Error Reporting: Developers can create custom error reports, such as sending logs from course updates (log_update_progress) or user actions (FAQ query logging).
    • Error Handling Logic in lib/errors: Custom error handling logic tracks errors for designated processes e.g the course update process. Specific logs track the progress of updates (e.g., UpdateLogger.update_course) and ensure errors are reported with contextual information (e.g., sentry_tag_uuid).

3. New Relic (newrelic_rpm)

  • Function: Performance monitoring tool that tracks the application's performance, including metrics like queue latency and throughput.
  • Integration:
    • Queue Latency Monitoring: The NewRelicQueueLatencyLogger middleware tracks how long jobs have been in the queue (e.g., for Sidekiq workers). Latency is added as a custom attribute in New Relic for detailed performance analysis.
    • Performance Graphs: New Relic provides visual insights into system performance, such as latency spikes or processing bottlenecks.
    • Queue Info: The logger captures detailed info for specific jobs, such as course updates, adding context to performance metrics.
    • Development Overhead: Running New Relic in developer mode adds overhead, so it is typically enabled in production for performance analysis.

4. Performance Monitoring Gems

  • rack-mini-profiler: Provides real-time performance profiling of pages. It shows how much time each part of the page load process takes.
  • stackprof: Provides more in-depth performance profiling, allowing the analysis of CPU time spent on specific parts of the code.

5. Webpack

  • Function: Manages bundling and source maps for JavaScript assets.
  • Integration:
    • Source Maps: In production, Webpack generates source maps (devtool = source map) to aid debugging of JavaScript errors. This ensures that even minified code can be traced back to the original source.
    • Development Configuration: For development, cheaper source maps are used (eval-cheap-source-map) for faster builds and better debugging during development.

https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&action=edit&section=new for users to drop comments about bugs / observations / errors.

Current general sites to check system status and verify the overall health and performance of the wiki environment, and find out about major outages. (primarily for developers) To be used as reference in building UI

[B] Proposed Features:

Main TaskSubtasksIssueImplementation
Reducing Noise in Sentry1. Adjusting Sentry Alerts to Reduce NoiseDifficult to track series of failures affecting update systems- Configure Sentry Alerts: Adjust alerts to reduce noise following best practices.
2. Suppressing Benign Sentry TCP ErrorsEncountering TCP connection errors in development can clutter logs and mislead beginners during setup- Modify Sentry Configuration: Add logic to check whether the environment is development and whether the sentry_dsn environment variable is set. Can then either: initialize sentry if only sentry_dsn is set or specifically capture and suppress the logging of getaddrinfo in dev
- Add a note in docs that explains Sentry is optional in development and how to configure or skip it since only admins have access to Sentry server normally.
3. Managing Storage of Error RecordsStoring individual error records without a deletion strategy risks filling the database and contributes to noise on Sentry- Set Retention Policies: Configure Sentry to retain events based on age or volume.
- Document and implement retention policies. ✔️ (Issues deleted after 3 months automatically)
4. Preventing Excessive Sentry Logging from ReferenceCounterApiNon-200 responses flood Sentry logs- Determine and Configure Log Thresholds: For example, log only the first occurrence of a specific non-200 response within a 5-minute window.
- Use Sentry Tagging: Group logs by error type or response code.
5. Enhancing Handled Errors in SentryCurrent error handling is too generic (You can find an example of 'handled' errors in replica.rb)- Expand TYPICAL_ERRORS: Include additional network errors to be handled.
- Filter Handled Errors in Sentry: Implement a mechanism to filter out handled errors so they do not clutter Sentry logs.
6. Setting Custom Event(s) for Systematic Course Update FailuresConsistent failures in updating specific courses may go unnoticed- Track Pre- and Post-Update Statistics: Log and compare course stats before and after updates.
- Custom Sentry Event for Course Errors: Aggregate errors related to UpdateCourseStats are aggregated by course, not just by error type.
Enhancing Sentry Configuration1. Setting Additional Sentry ConfigurationsSentry is not fully configured for optimal tracking and error logging as seen in sentry.rb- Set Up Additional Configurations such as transaction tracing, breadcrumbs, and cron job monitoring.
- Use Source Maps: Upload generated source maps by webpack to Sentry if not already doing so.
2. Handling Bugs Without Explicit ErrorsRare bugs may occur without explicit errors in the JavaScript console or network requests- Implement additional input/response validation to reduce the chance of unexpected results.
- If invalid input / responses are captured, trigger custom Sentry events capturing relevant data.
- Log issues directly to the console in development/staging to avoid production noise.
3. Preventing Server Running out of Resources Failure ModeFailure mode solved by disabling manual updates. Not a perfect fix however as multiple users report that the dashboard did not update stats even after days of updating.- Set up a Sentry alert for Sidekiq tasks that take an abnormally long time or fail to complete, include details in the alert such as program name, expected completion time, elapsed time, so administrators can investigate.
Optimizing Performance1. Implementing Dynamic Queue Sorting for UpdatesCurrent queue sorting behavior is static, relying only on program length- Implement Dynamic Queue Sorting: Prioritize jobs based on average update time instead of fixed characteristics like program length.
- Dynamically adjust the load across different queues to prevent a bottleneck, especially in the long queue.
- Implement a smarter sorting algorithm that recalculates average update times regularly; configure Sidekiq to support dynamic priority-based processing across different queues.
- Expand new relic’s Queue Latency Monitoring logic in the nr_sidekiq_queue_latency.rb to be more robust
2. Preventing Server Running Out of Memory Failure ModeUpdate processes could run out of memory, potentially causing crashes. Killed processes due to memory exhaustion do not generate Sentry logs- Set Up Memory Monitoring: Use New Relic and define memory thresholds.
- Implement Custom Sentinel Process: Monitor for unexpected terminations and alert when they occur.
- (Optional) Automatic Process Management: Implement Linux OOM killer and log shutdown events. Note that it is important to terminate the processes gracefully; application-level shutdown tools.
3. Monitoring Update Scheduling Queue HealthOutage caused by a backlog in the update scheduling queue- Implement Automated Monitoring: Set up Sidekiq and New Relic metrics to track queue health.
- Automatic Query Recovery: Implement mechanisms to clear or retry jobs if stuck.
4. Optimizing New Relic Configuration for Global ServerExceeding the data ingestion limit in New Relic too early in the month prevents proper performance monitoring for the rest of the period.- Modify the configuration to optimize which data gets sent to New Relic.
- Focus on capturing essential metrics, balancing between detailed data and the allowed quota.
Building a status UI to present system status and performance information1. Design Process: Define the visual and functional layout of the Status Info UI.- User Research and Requirement Analysis: Convert user needs into user and system requirements and Audit existing metrics and prioritize the importance of available and future metrics ✔️
- Develop a new site architecture ✔️
- Wireframing and Prototyping ✔️
2. Development: Implement the UI and integrate it with the backend data sources. ✔️- Frontend Development ✔️
- Backend Integration ✔️
3. Deployment ✔️TBD
4. Testing and Maintenance : Ensure the reliability of the UI and plan for ongoing updates. ✔️- Unit and Integration Testing + Other Tests✔️
- Maintenance: Maintain comprehensive documentation for developers and end-users
Improving Documentation1. Write documentation about how the app is deployed in Cloud VPS and ToolforgeT375642 - It would be very useful to have a wiki page somewhere with more complete and accurate information about the deployment, how to debug issues, restart the app, etc. We have a fair amount of documentation in the codebase, but it's spread out and not geared towards letting others easily hop in and fix production problems. The app runs in Cloud VPS, in the globaleducation project. There is also the Toolforge tool wikiedudashboard which is running the code from WikiEduDashboardTools ✔️
Additional Tasks1. Addressing Concerns About Decreasing PageviewsMultiple users expressed concern over decreasing pageviews- Create a tooltip explaining that pageview fluctuations are normal as they are just estimates based 30-day✔️
- Integrate the tooltip into the UI where pageviews are displayed ✔️
2. Enhancing User-Friendly Error HandlingTODOs in lib/errors/rescue_errors.rb. to improve handling for HTML errors.- Add more user-friendly handling for HTML errors.
- Review current error handling logic.
- Implement user-friendly error messages.

Note: All tasks are subject to change and approval by the project mentor.

Current Initiatives in the Project

-The Status UI
Based on a discussion with the mentor and insights gathered from user research using the Program and Events dashboard talk forum (https://meta.wikimedia.org/w/index.php?title=Talk:Programs_%26_Events_Dashboard&action=edit&section=new), I developed an initial set of requirements for the system UI that include the prioritized data points and functionality users need to monitor system status and performance effectively. I then designed a site architecture map and categorized the proposed metrics to provide an intuitive starting point for data exploration:

Draft of WikiEduDashboard Proposed Status UI-light.png (2×3 px, 874 KB)

Google Drive Link: https://drive.google.com/drive/folders/1RwcY-5Hg8QntjiTjhn7Oe0dOOuJrwG6R
To support these requirements, I also created a prototype for the system status page in Canva, using dummy data to showcase functionality. This prototype features two alternative layouts: a widget-based design and a full-page format. Both formats aim to offer users clear, immediate access to key metrics and drill-down options for more detailed insights.
Both designs are intended as starting points and are adaptable to future feedback and refinements.

Prototype
Status UI Widget View

Proposed Prototype of Status UI Widget View.png (768×1 px, 91 KB)

or alternatively, Status UI Full View
Proposed Prototype of Status UI Full View.png (768×1 px, 61 KB)

Timeline

Summarized Timeline capturing the main milestones across 3 week segments:

WeeksDescription
Weeks 1 - 3Initial research, documentation, start of Sentry noise reduction, and UI wireframing and prototype creation.
Weeks 4 - 6Start development stage of UI, start of Sentry storage management, and performance optimizations.
Weeks 7 - 9Continue with queue optimizations, Sentry alerts, and Prepare for UI Deployment.
Weeks 10 - 12Complete UI testing and deployment, Add final improvements to Sentry / New Relic, Monitor improvements to ensure all working fine .
Week 13Final documentation and finishing touches + project report.

Week 6 (Jan 13 - 17)

  • Finish Sentry storage management and document retention policy setup
  • Begin performance optimization, focusing on dynamic queue sorting.
  • Start setting up memory monitoring.
  • Milestone: Draft on performance optimization strategies, documented progress on Sentry setup.

Week 7 (Jan 20 - 24)

  • Continue dynamic queue sorting and implement job/task duration alerts in Sentry.
  • Further UI development; expand UI for detailed performance views.
  • Milestone: Mid-development UI and feedback on progress with queue optimizations.

Week 8 (Jan 27 - Jan 31)

  • Complete job/task duration alerts in Sentry.
  • Implement new handled error detection logic and improve Sentry tagging.
  • Continue UI refinement, add tooltips and user-oriented details.
  • Milestone: Review of improved alerts functionality and Refined UI.

Week 9 (Feb 3 - 7)

  • Finalize UI improvements (tooltips, filtering/breakdown selectors).
  • Gather final feedback on UI before Deployment
  • Milestone: UI refinement and feedback before Deployment.

Week 10 (Feb 10 - 14)

  • Start UI Deployment Stage ; Decide on tools and methods + Update Documentation
  • Enhance systematic course update failure handling in Sentry.
  • Milestone: Review of current Sentry Improvements, Deployed UI.

Week 11 (Feb 17 - 21)

  • Refine memory monitoring and automatic query recovery.
  • Finalize the dynamic queue sorting implementation.
  • Final UI touches based on performance insights + Add unit / integration tests to UI code
  • Milestone: Review of improvements, Documented performance insights and final UI adjustments.

Week 12 (Feb 24 - Feb 28)

  • Monitor added configurations to Sentry / New Relic and status UI and improve all as needed
  • Finish adding tests to UI code
  • Milestone: Monitoring results, Finished tests for UI code.

Week 13 (Mar 3 - 7)

  • Final testing, bug fixes, documentation and finishing touches.
  • Project completion and internship end on Mar 7 2025
  • Milestone: Final project report and documentation of completed project to mentors and community.

(Note: All public and general holidays are assumed to be observed and all timelines are subject to change and approval by the project mentor. I am flexible and can accommodate any changes as needed.)

WikiEduDashboard Contribution Period Contributions

  1. Merged PR Adds an additional string to the DELETED_REVISION_ERRORS array and renames a cassette fixes this issue.
  2. Documented Debugging Process GitHub Issue, Phabricator Issue: T377898
  3. Merged PR Feature to allow Programs to be Deleted before Removing Campaigns on Programs & Events Dashboard fixes this issue.
  4. Merged PR Fixing layout bug on Articles Edited table fixes this issue.
  5. Merged PR Adding Tooltip for Article Views Metric fixes this issue.
  6. Merged PR Handles Truncated API responses that lead to revisions being null and causing a 'NoMethodError' fixes these issues: WikidataDiffAnalyzerGem Issue, WikiEdu Issue.

Event Timeline

Aklapper added a project: Outreachy (Round 29).

@empty-codes: Hi, please provide a link to the corresponding task for context. Thanks!

empty-codes claimed this task.
empty-codes triaged this task as Low priority.
empty-codes updated the task description. (Show Details)

@Ragesoss Hello, please I would like to request some feedback on my proposal. Are my tasks feasible / on the right path? Thank you in advance!

@empty-codes: Hi, please provide a link to the corresponding task for context. Thanks!

@empty-codes: Hi, please provide a link to the corresponding task for context. Thanks!

Oh sorry, you mean the parent outreachy project, the proposal is for?

This looks like a solid timeline.

empty-codes renamed this task from Draft Proposal for Review to Outreachy Round 29 Proposal for ‘Improve observability of Wiki Education Dashboard’ project..Oct 29 2024, 2:12 PM
empty-codes updated the task description. (Show Details)
empty-codes renamed this task from Outreachy Round 29 Proposal for ‘Improve observability of Wiki Education Dashboard’ project. to Outreachy Round 29: Proposal for ‘Improve observability of Wiki Education Dashboard’ project..Oct 29 2024, 2:30 PM
empty-codes updated the task description. (Show Details)

Week 1 (Dec. 9 - 13) Update

Tasks Completed:
  • Completed Community bonding tasks: Created wiki meta user page, Wrote my first blog, Updated my Phabricator task, Joined Zulip
  • Started Analysis of Sentry Implementation
  • Started Documentation on P&E Tools for Cloud
  • Created GitHub branch for UI and a draft PR using site architecture map as reference
Learnings:
  • What 'core values' are and what I think mine are
  • Learnt about CloudVPS and Toolforge in detail; Understood how the dashboard servers are setup and deployed
  • Expanded my knowledge of Sentry concepts and configuration options
  • Explored Rails MVC structure and Sidekiq APIs
Milestone Achieved ✔️

Week 2 (Dec. 16 - 20) Update

Tasks Completed:
  • Created first MVP of UI
  • Completed Analysis of Sentry Implementation
  • Implemented actions to reduce noise in Sentry
  • Wrote 2nd blog post on the theme 'Everyone Struggles'
Learnings:
  • How rails maps MVC structure; haml syntax
  • Understood what git rebase does and when it should be used
  • Learnt various ways to filter out data in Sentry
Milestone Achieved ✔️
This comment was removed by empty-codes.

Week 3 (Dec. 23 - 27) Update

Tasks Completed:
  • Continued work on UI; wrote tests (specs) for controller and service class
  • Monitored Sentry dashboards and archived more un-actionable issues
  • Created syntax to add filters to Sentry SDK Configuration
Learnings:
  • Sentry configuration keywords like before_send, config.excluded_exceptions
  • Sentry event structure:
{ 'level': 'error', 'exception': { 'values': [{ 'module': 'exceptions', 'type': ' ', 'value': ........
  • More Ruby and Rails Concepts and Syntax
Milestone Achieved ✔️

Week 4 (Dec. 30 - Jan 3rd) Update

Tasks Completed:
  • Continued work on UI; Updated queue status to be displayed based on set latency thresholds
  • Completed Documentation efforts with 2 Merged PRs (#6082, #6084):
  • Noise reduction work in Sentry: Debugged some recurrent Sentry errors
Learnings:
  • Learnt about all(?) the APIs and toolforge tools the Dashboard uses; Improved my knowledge of the Dashboard Infrastructure overall
  • More about writing tests for code, ergo TDD
  • Explored debugging tools like binding.pry, binding_of_caller ......
Milestone Achieved ✔️

Week 5 (Jan 6th - Jan 10th) Update

Tasks Completed:
  • Wrote 3rd blog post
  • Discussed expectations for the UI with mentor
  • Focused on sentry noise reduction work:
    • Fixed NoMethodError causing noise
    • Debugging other issues
Learnings:
  • HTTP Libraries in Ruby; Faraday and Net::HTTP and their usecases
  • StandardError superclass for errors
  • Applied load balancing concepts irl
  • Understood the concept of 'replication lag' concerning WMF Cloud replica databases
Milestone Achieved ✔️

Week 6 (Jan 13th - Jan 17th) Update

Tasks Completed:
  • Focused on reducing noise in Sentry work:
    • Found possible causes and their corresponding fixes for a 500 Internal Server Error - Merged PR
    • Investigated 2 issues that turned out to be caused by external sources (not our code) and documented my findings
    • Refactored error handling logic by creating a custom error class and using it in relevant classes - PR
Learnings:
  • CGI escape and unescape methods
  • The concept of promises in Javascript
  • that rescue_from is a method provided by Rails controllers and ActiveSupport::Concern and it isn't automatically available in plain Ruby classes
  • How auto loading / eager loading works in relation with Zeitwerk configs.
Milestone Achieved ✔️

Week 7 (Jan 20th - Jan 24th) Update

Tasks Completed:
  • Focused on benign React Errors such as NoMethodErrors, TypeErrors, NotFoundErrors on Sentry:
    • Investigated and documented each issue (error occurrence) and reported my findings to my mentor daily
    • Opened a Draft PR for the fixes, currently found fixes for 8 errors
Learnings:
  • Google Translate sometimes manipulates the DOM and ends up causing: NotFoundError: Failed to execute 'removeChild' on 'Node': The node to be removed is not a child of this node. This blog post I found explains it in detail.
  • Web crawlers often encounter errors that normal users won't see
  • Debugging errors where the stack trace shows only minified js, usually from the react-dom file is very time consuming
Milestone Achieved ✔️

Week 8 (Jan 27th - 31st) Update

Tasks Completed:
Learnings:
  • Sentry's new UI has the option to add comments, it's very helpful for feedback
  • When rspec mocks methods / classes, their response(s) are usually nil, unless explicitly stubbed
  • Importance of double checking your code + usefulness of .inspect
  • When something is nil, rails / ruby interpolates it as an empty string
  • How to filter stuff out in sentry, more about sentry event structures, code that is deprecated in sentry.....
Milestone Achieved ✔️

Week 9 (Feb 3rd - 7th) Update

Tasks Completed:
  • Created draft PR for work on adding filters to sentry sdk configuration
  • Starting debugging a mediawiki api error
Learnings:
  • A lot of big codebases have unclear / not up-to-date documentation
  • More about writing tests for sentry
  • Regex patterns and how to use them
Milestone Achieved ✔️

Week 10 (Feb 10th - 14th) Update

Tasks Completed:
  • Continued working on draft PR for work on adding filters to sentry sdk configuration
  • Updated WikiApi::PageFetchError handling for the StudentGreetingChecker class, see commit
  • Opened Draft PR for MediawikiApi::ApiError fix
  • Continued monitoring Sentry dashboards
Learnings:
  • Apache's default 'LimitRequestFieldSize' is 8190 bytes
  • When you face error 414: uri / url too long, it's best to switch to a POST request
  • That errors can pop up from anywhere; browsers, external servers, extensions, APIs, and 3rd party dependencies
  • .djvu is a file extension

Milestone Achieved ✔️

Week 11 (Feb 17th - 21st) Update

Tasks Completed:
  • Continued monitoring Sentry dashboards
  • Investigated MediawikiApi::HttpError - unexpected HTTP response (414) error
  • Learnt how to convert mediawiki GET requests to POST
Learnings:
  • Adding http_method: :post to query parameters is the easiest way to convert a request to POST
  • You can still send parameters in a POST request in the query url but ideal way is the request body

Milestone Achieved ✔️

Week 12 (Feb 24th - 28th) Update

Tasks Completed:
  • Merged PR for 414 error fix: #6212
  • Started researching authenticated mediawiki request
  • Started documenting overall observations about the dashboard system
Learnings:
  • Time flies really fast

Milestone Achieved ✔️

Week 13 (Mar 3 - 7th) Update

Tasks Completed:

I am really proud of what I have achieved during this internship:

and I am thankful to my mentor and the community. I look forward to making more contributions and doing more impactful work.
It has been a very good run, thank you.