Page MenuHomePhabricator

Impact analysis for Haproxykafka data loss
Closed, ResolvedPublic

Description

In T400039 we learnt that We lost about 2 weeks of logs from cp5017 and about 10 weeks of logs from cp3071." We estimate that other hosts could have been affected by the same problem but they recovered while these ones did not.

We are trying to assess if this caused an impact on pageview traffic

Event Timeline

Requested DPE for an estimate of the loss last week. @Ahoelzl asked me to reach out to SRE.
SRE will help us get an estimate for the traffic that was not sent to webrequest and possible proportion of pageviews.

Using the up metric in Prometheus we can estimate the duration of the downtime for haproxykafka on the affected cache nodes:

  • For cp5017, the query sum_over_time((up{instance="cp5017:9341"} == bool 0)[90d:30s]) * 30 returns 1.14 million seconds, or approximately 13.1 days.
  • For cp3071, the query sum_over_time((up{instance="cp3071:9341"} == bool 0)[90d:30s]) * 30 returns 6.36 million seconds, or approximately 73.6 days.

I attempted to use Turnilo webrequest_sampled_live (https://w.wiki/ExKR) to get a more granular view of traffic per CDN host, but it appears that breakdown is no longer available:

image.png (523×1 px, 38 KB)

As a fallback, using pageviews_daily, which reports 780,594,326 total pageviews across the CDN for the last day, and assuming the 56 cache_text hosts are the only ones reporting pageviews, we estimate an average of:

  • 13,939,184 pageviews per CDN server per day

Using this as a baseline:

  • Estimated pageviews lost for cp5017: 13.9M * 13.1 days ≈ 182.6 million pageviews
  • Estimated pageviews lost for cp3071: 13.9M * 73.6 days ≈ 1.026 billion pageviews

These are upper-bound estimates and assume full loss of logging for the entire duration and even distribution of traffic.

Thanks @Vgutierrez !
Can you please remind me the time period of this loss for the two nodes? from the og task I see 2025-06-30 to 2025-7-07 and then 2025-07-15 to 2025-07-21, which isnt correct given that cp3071 had 10 weeks of loss.

Sure, prometheus shows that cp5017 was impacted from 2025-06-30 to 2025-07-07, recovered and then again from 2025-07-15 to 2025-07-21. cp3071 was impacted from 2025-05-11 to 2025-07-23

image.png (695×887 px, 50 KB)

thanks @Vgutierrez ! it seems like we had a 2% loss of pageviews based on the estimates provided.
is it possible to know which countries were affected by the traffic loss ? I guess regionally it would be ESEAP (cp5017.eqsin.wmnet) and NWE (cp3071.esams.wmnet).

Hi @Vgutierrez , is it possible to know which countries were affected by the traffic loss ?

sorry for the delay @Mayakp.wiki.

esams is the main DC for the following countries:

Afghanistan
Aland Islands
Angola
Armenia
Austria
Azerbaijan
Belarus (Probenet Data)
Belgium
Benin
Bosnia and Herzegowina
British Indian Ocean Territory
Bulgaria
Burundi
Congo, Democratic Republic of the
Congo (Inferred from DR Congo as probe was close to border)
Côte d'Ivoire (Inferred from Ghana for closest probe)
Croatia (local name Hrvatska)
Denmark
Estonia
Europe region (misc)
Faroe Islands
Finland
Georgia
Germany
Ghana
Greece
Guernsey
Guinea (Inferred from Ghana)
Holy See (Vatican City State)
Hungary
Iceland
Iran (Islamic Republic of)
Iraq
Ireland
Isle of Man
Israel
Jersey
Jordan
Kyrgyzstan
Latvia
Liberia (Inferred from Ghana)
Lithuania
Luxembourg
Macedonia, the Former Yugoslav Republic of
Moldova, Republic of
Monaco
Montenegro
Namibia
Netherlands
Norway
Poland
Romania
Russian Federation
San Marino
Serbia
Sierra Leone (Inferred from Ghana)
Slovakia (Probenet Data)
Slovenia
Somalia (Inferred from Kenya for closer/better probes)
South Sudan
Svalbard and Jan Mayen Islands
Sweden
Syrian Arab Republic
Tajikistan
Togo
Turkmenistan
Uganda
Ukraine
United Kingdom
Uzbekistan

and eqsin is the main DC for the following countries:

Asia-Pacific region (misc)
Australia
Bangladesh
Bhutan
Brunei Darussalam
Cambodia
China
Christmas Island
Cocos (Keeling) Islands
Guam
Hong Kong
India
Indonesia
Japan
Kiribati
Korea, Republic of
Lao People's Democratic Republic
Macao
Malaysia
Maldives
Marshall Islands
Mauritius
Micronesia, Federated States of
Mongolia
Myanmar
Nauru
Nepal
New Caledonia
Northern Mariana Islands
Oman
Pakistan
Palau
Philippines
Qatar
Réunion
Singapore
Sri Lanka
Taiwan
Thailand
Timor-Leste
Tonga
Tuvalu
United States Minor Outlying Islands
Viet Nam
Mayakp.wiki closed this task as Resolved.EditedAug 22 2025, 7:58 PM
Mayakp.wiki claimed this task.
Mayakp.wiki moved this task from Doing to Needs sign-off on the Movement-Insights board.

We have assessed that the impact of this data loss is low, and wouldnt need a correction or estimation in our metrics.
Moreover, with all the heuristic changes and backfilling the data during this period (see T395934 ) we can't do much anyway. We will note this for YoY comparison or any future needs.

Sheet for rough estimates