Page MenuHomePhabricator

NEW BUG REPORT <"Domain" field issue: some domains have trailing dots>
Closed, ResolvedPublicBUG REPORT

Description

Data Platform Engineering Bug Report or Data Problem Form.

Please fill out the following
Please ensure you set priority

What kind of problem are you reporting?

  • Access related problem
  • Service related problem
  • Data related problem
For a data related problem:
  • Is this a data quality issue? Yes
  • What datasets and/or dashboards are affected? wmf.unique_devices_per_domain_monthly
  • What are the observed vs expected results?

Observed results: I ran the following query

SELECT 
  SUM(uniques_estimate) as monthly_uniques,
  year,
  month,
  domain
FROM wmf.unique_devices_per_domain_monthly
WHERE year=2025
GROUP BY 
  year, 
  month,
  domain

In the results, I found that within the domain field some instances with trailing dot (".org.") -- this results in duplicate rows for that domain, but with different values. See screenshots below. Some of these domains with the trailing dot have no associated data (or rather, they have unique devices = 0); but some of them do have numerical data (see, e.g., the en.m.wikipedia screenshot below).

Expected results: We would expect a single row of data per domain (per month & year).

It would be great if (1) the pipeline was fixed the normalize the trailing dot, and (2) the existing data were corrected to merge the entries.

Screenshot 2025-06-03 at 11.54.33 AM.png (181×526 px, 26 KB)

Screenshot 2025-06-03 at 11.55.46 AM.png (207×626 px, 36 KB)

Screenshot 2025-06-03 at 11.57.10 AM.png (202×638 px, 33 KB)

Event Timeline

CMyrick-WMF triaged this task as Low priority.

(Please let me know if need me to upload higher res screenshots.)

@BBlack and @KOfori any chance this is related to Varnish upgrade?

@Ahoelzl please evaluate for fixes:

  1. Pipeline fix: Normalize domains by stripping trailing dots during data processing
  2. Historical correction: Merge the existing split data retroactively
  3. Validation: Add checks to prevent similar issues

This was a known change when we migrated from Varnish to HAProxy. We decided not normalize the hosts, to keep the data as close as possible from the source.
My assumption is that hits where domains has a trailing dot are most probably bots. I'd be super happy to be proven wrong and asked to normalize though :)

Agree, that they look like bots.
We are working on an improvement to the bot detection pipeline currently.
Maybe after the update, these requests won't exist any more.
Nevertheless, it makes sense to normalize, since we are planning to backfill unique devices numbers soon.

JAllemandou claimed this task.

I confirm this bug has been fixed with T401666.
I'm closing it, feel free to reopen as needed.