Page MenuHomePhabricator

AQS 2.0: Device Analytics query timestamp hour issue
Closed, ResolvedPublic

Description

Summary: the existing production AQS 1.0 code does various timestamp string manipulations during validation. In AQS 2.0, we mirrored part of this, but not all of it, resulting in AQS 2.0 using an incorrectly formatted timestamp for Device Analytics queries. This can, if we choose, be as simple as a one-line fix (see the end of this very long task description).

Details:

The Cassandra tables queried by AQS store timestamps as strings. In some cases, these strings are stored as YYYYMMDD and in other cases they are stored as YYYYMMDDHH. Relevant to this task, the table queried by Device Analytics uses YYYYMMDD. (There are other cases as well, but they're not relevant to this task.)

The AQS 1.0 code uses common functions for timestamp validation and manipulation. These functions accept various options, which control their behavior. In particular, for the case of unique devices, the AQS 1.0 code makes this call:

aqsUtil.validateStartAndEnd(rp, {
        // YYYYMMDD dates are allowed, but need an hour to pass validation
        fakeHour: true,
        // YYYYMMDDHH dates are also allowed, but the hour should be stripped
        // to match how cassandra stores records
        stripHour: true,
    });

This invokes this function, which in turn invokes this function. The two relevant bits of code (which are split between the two functions) are:

const hour = opts.fakeHour ? '00' : timestamp.substring(8, 10);

and

if (opts.zeroHour || opts.stripHour) {
        rp.start = rp.start.substring(0, 8);
        rp.end = rp.end.substring(0, 8);
}

Observe that the code first adds a "fake hour" of "00" to the timestamp, and then later strips it back off. This results in a YYYYMMDD timestamp.

When we created AQS 2.0, we started with tables in the Page Analytics part of the dataset, where the hour is stored in Cassandra in YYYMMDDHH format. We therefore added this code in the aqsassist ValidateTimestamp function:

	if len(param) == 8 {
		timestamp = fmt.Sprintf("%s00", param)
	} else {
		timestamp = param
	}

This code adds the fake "00" hour if necessary, to make the timestamp into a YYYYMMDDHH format. But we did not add corresponding code to strip that fake hour back off after validation, likely because Page Analytics didn't need that, and we didn't realize that other datasets were formatted differently.

Fix:

There are several ways we could fix this, including:

  • adding more options to aqsassist.ValidateTimestamp, similar to the existing AQS 1.0 code, to give callers fine-grained control over timestamp validation and manipulation. This is powerful, but could be confusing, and also involves the most code changes.
  • requiring callers to pass timestamps in the length they want them returned. aqsassist.ValidateTimestamp would check the length of the input and return an output of equal length (or an error, if the input was malformed). This is simpler, but a little subtle and it might not be obvious to callers what was happening. I'm also not sure it would handle all cases we'll encounter in the rest of the services.
  • always return a YYYYMMDDHH format, and make the caller strip the HH part if that's not appropriate for a given situation

The last one is my favorite. It involves adding literally one line to unique-devices_handler.go. It keeps the aqsassist.ValidateTimestamp contract simple and straightforward: the same thing is always returned, and if particular callers want to manipulate that, they can.

However, I'm open to discussion, happy to hear thoughts from @SGupta-WMF or others, and will be glad for this to be fixed, regardless of how it is done.

Event Timeline

BPirkle edited projects, added AQS2.0 (Sprint 10); removed AQS2.0.

@BPirkle Thank you for the detailed analysis and suggestions :) .
While I am tempted to go for approach one which is -

adding more options to aqsassist.ValidateTimestamp, similar to the existing AQS 1.0 code, to give callers fine-grained control over timestamp validation and manipulation

But I realise this can be confusing in the long run and is needed only for device-analytics now .

So , I think its better for us to go for the last option and strip the HH part in the unique-devices_handler.go, which would be fast and easy to do .

The change is quite straight forward but with the updated date ranges , we are getting more data from the db it seems . So the tests have to be updated and corrected .
This task will therefore take a little longer than expected. Thanks!

Change 920658 had a related patch set uploaded (by Sg912; author: Sg912):

[generated-data-platform/aqs/device-analytics@main] Bug: T336216

https://gerrit.wikimedia.org/r/920658

QA: it is expected that the endpoint response will change for some calls - that's the whole point of this fix. Specifically, the first item in the results will become the second item, and a new first item will appear in the response. This should match the behavior of the existing AQS 1.0 response for an equivalent API call.

Test status: QA PASS

Prod: test with an hour interval for daily granularity

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210501",
         "devices":65224992,
         "offset":13722719,
         "underestimate":51502273
      }
   ]
}

Local AQS 2.0: test with an hour interval for daily granularity
"http://localhost:8089/metrics/unique-devices/en.wikipedia/all-sites/daily/2021050101/2021050101"

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210501",
         "devices":65224992,
         "offset":13722719,
         "underestimate":51502273
      }
   ]
}

Prod: test with a day interval for daily granularity

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210501",
         "devices":65224992,
         "offset":13722719,
         "underestimate":51502273
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210502",
         "devices":69336341,
         "offset":14612490,
         "underestimate":54723851
      }
   ]
}

Local AQS 2.0: test with a day interval for daily granularity

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210501",
         "devices":65224992,
         "offset":13722719,
         "underestimate":51502273
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210502",
         "devices":69336341,
         "offset":14612490,
         "underestimate":54723851
      }
   ]
}

Prod: test with more than a day interval for daily granularity

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210501",
         "devices":65224992,
         "offset":13722719,
         "underestimate":51502273
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210502",
         "devices":69336341,
         "offset":14612490,
         "underestimate":54723851
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210503",
         "devices":71881893,
         "offset":15226411,
         "underestimate":56655482
      }
   ]
}

Local AQS 2.0: test with more than a day interval for daily granularity

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210501",
         "devices":65224992,
         "offset":13722719,
         "underestimate":51502273
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210502",
         "devices":69336341,
         "offset":14612490,
         "underestimate":54723851
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"daily",
         "timestamp":"20210503",
         "devices":71881893,
         "offset":15226411,
         "underestimate":56655482
      }
   ]
}

Prod: test with a month interval for monthly granularity
'https://wikimedia.org/api/rest_v1/metrics/unique-devices/en.wikipedia/all-sites/monthly/20210501/20210601'

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"monthly",
         "timestamp":"20210501",
         "devices":837775205,
         "offset":356510693,
         "underestimate":481264512
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"monthly",
         "timestamp":"20210601",
         "devices":804483345,
         "offset":346193130,
         "underestimate":458290215
      }
   ]
}

Local AQS 2.0: test with a month interval for monthly granularity

{
   "items":[
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"monthly",
         "timestamp":"20210501",
         "devices":837775205,
         "offset":356510693,
         "underestimate":481264512
      },
      {
         "project":"en.wikipedia",
         "access-site":"all-sites",
         "granularity":"monthly",
         "timestamp":"20210601",
         "devices":804483345,
         "offset":346193130,
         "underestimate":458290215
      }
   ]
}