Page MenuHomePhabricator

Possible statsv corruption?
Closed, ResolvedPublic

Description

I'm seeing a few odd top-level metrics being created in Graphite that don't belong.

MediaWiki_timing_changesListSpecialPage_backendResponse_Recentchanges
ve_mwTarget_behavior_firstTransaction
media_thumbnail_client_1024                            MediaWiki_timing_changesListSpecialPage_backendResponse_Recentchangeslinked   ve_mwTarget_behavior_lastTransactionTillSaveDialogOpen
media_thumbnail_client_1280                            MediaWiki_timing_changesListSpecialPage_ready_Recentchanges                   ve_mwTarget_behavior_saveDialogClose
media_thumbnail_client_1920                            MediaWiki_timing_changesListSpecialPage_ready_Recentchangeslinked             ve_mwTarget_behavior_saveDialogOpenTillReview
media_thumbnail_client_600                             MediaWiki_timing_structuredChangeFilters_backendResponse_Recentchanges        ve_mwTarget_behavior_saveDialogOpenTillSave
media_thumbnail_client_800                             MediaWiki_timing_structuredChangeFilters_backendResponse_Recentchangeslinked  ve_mwTarget_performance_system_activation
MediaWiki_echo_overlay_api                             MediaWiki_timing_structuredChangeFilters_ready_Recentchanges                  ve_mwTarget_performance_system_apiLoad
MediaWiki_echo_overlay_ooui                            MediaWiki_timing_structuredChangeFilters_ready_Recentchangeslinked            ve_mwTarget_performance_system_restbaseLoad
MediaWiki_echo_unseen_click                            ve_mwTarget_performance_user_reviewError
MediaWiki_RevisionSlider_event_ini                     ve_mwTarget_performance_user_saveError_unknown
MediaWiki_RevisionSlider_timing_init                   ve_mobile_performance_system_apiLoad
MediaWiki_RevisionSlider_timing_initFetchRevisionData  ve_mobile_performance_system_restbaseLoad

The underscore is normally used to substitute invalid characters. However, in this case it looks like basically something somewhere just replaced path separator dots with underscores. And the metrics all come from StatsV, so it seems like it's probably not related with individual metric producers.

Looking at the actual values being logged here, they seem fairly infrequent. So overall it seems like things are working, but something somewhere is causing these sometimes to be logged in a corrupted way.

https://graphite.wikimedia.org/render?target={media,MediaWiki,ve}_*.count&from=now-70d&to=now&width=800&height=800&hideLegend=false&lineMode=staircase&areaMode=all

Last 70 days

Event Timeline

Krinkle created this task.Mar 12 2018, 9:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 12 2018, 9:41 PM
Krinkle moved this task from Inbox to Radar on the Performance-Team board.Mar 19 2018, 8:09 PM
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
fdans triaged this task as Low priority.Apr 12 2018, 5:29 PM

The following aggregate query indicates we lose upto 100K-500K events on some days. And have lost 6.7 million events in total since February.

https://graphite.wikimedia.org/render?target=legendValue(alias(summarize(sumSeries(transformNull({media,MediaWiki,ve,eventlogging}_*.sum,0)),%2224h%22,%22sum%22),%22daily%22),%22total%22)&from=20171018&to=20180413&width=1000&height=500

total: 6,724,005 events lost

Not sure I totally understand the problem. What is an example of a good metric name and a corrupted one here? You are saying that metric names are coming in with dots in the name as they should, and sometimes something is replacing those dots with underscores, but not all of the time?

Not sure I totally understand the problem. What is an example of a good metric name and a corrupted one here? You are saying that metric names are coming in with dots in the name as they should, and sometimes something is replacing those dots with underscores, but not all of the time?

Indeed. The same metrics both work and not work at the same time. Basically, any top-level metric in Graphite that contains an underscore is result of this problem. I confirmed each of those comes only from statsv and does use dots in the emitting code.

Examples:

eventlogging_client_errors_NavigationTiming_nonCompliant
..
MediaWiki_timing_structuredChangeFilters_backendResponse_Recentchanges
..
ve_mwTarget_behavior_saveDialogClose
..
ve_mwTarget_performance_system_restbaseLoad
Milimetric raised the priority of this task from Low to Medium.Oct 22 2018, 3:46 PM
Ottomata claimed this task.Oct 23 2018, 3:10 PM
Ottomata added a project: Analytics-Kanban.

Wow huh, I don't know what could cause this, but as far as I can tell, it isn't happening anymore. Any objections to just closing?

Krinkle closed this task as Resolved.Oct 29 2018, 9:29 PM

Looking at graphite1001 now:

krinkle at graphite1001.eqiad.wmnet in /var/lib/carbon/whisper
$ ls -d -1 *_*

eventlogging_client_errors_NavigationTiming_nonCompliant
media_thumbnail_client_1024
media_thumbnail_client_1280
media_thumbnail_client_1920
media_thumbnail_client_320
media_thumbnail_client_600
media_thumbnail_client_800
MediaWiki_echo_overlay_api
MediaWiki_echo_overlay_ooui
MediaWiki_echo_unseen_click
MediaWiki_RevisionSlider_event_init
MediaWiki_RevisionSlider_timing_init
MediaWiki_RevisionSlider_timing_initFetchRevisionData
MediaWiki_timing_changesListSpecialPage_backendResponse_Recentchanges
MediaWiki_timing_changesListSpecialPage_backendResponse_Recentchangeslinked
MediaWiki_timing_changesListSpecialPage_ready_Recentchanges
MediaWiki_timing_changesListSpecialPage_ready_Recentchangeslinked
MediaWiki_timing_structuredChangeFilters_backendResponse_Recentchanges
MediaWiki_timing_structuredChangeFilters_backendResponse_Recentchangeslinked
MediaWiki_timing_structuredChangeFilters_ready_Recentchanges
MediaWiki_timing_structuredChangeFilters_ready_Recentchangeslinked

Looking inside each of these, the last write to any of these metrics was 28 April 2018. I've deleted them again. Let's close this.

Mentioned in SAL (#wikimedia-operations) [2018-10-30T06:57:50Z] <Krinkle> graphite1001: Remove Graphite data from corrupted names under media_* and ve_* (T189530)

Mentioned in SAL (#wikimedia-operations) [2018-10-30T06:58:38Z] <Krinkle> graphite1004: Remove Graphite data from corrupted names under media_* and ve_* (T189530)

Mentioned in SAL (#wikimedia-operations) [2018-10-30T07:01:38Z] <Krinkle> graphite2001: Remove Graphite data from corrupted names under media_* and ve_* (T189530)