After reviewing a long while all these documentation (and other documents related during last weeks) I wanted to say that I was able to run a mediawiki container, configure Metrics Platform and test all the functions that currently work with Metrics Platform (submitInteraction, submitClick, dispatch and submit). I also learned how to test WikimediaEvents and it seems to be an important extension to run instruments right now. I think existing documentation is good but the process was a little hard and, I think, the reason is that the knowledge is too much spread out through a lot of different pages/documents. And outdated and new documents coexist and that creates extra confusion because you don't know which is the right path. I have been trying the right function with the wrong schema/configuration and vice versa several times. As I said, I think all those pages are really good but, in addition to that, we'd need more pages like https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To (I think this one is a bit outdated because it doesn't use Metrics Platform but the structure is interesting because it shows the full lifecycle of an event).
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Tue, Nov 28
Mon, Nov 27
Wed, Nov 22
Tue, Nov 21
The last change remove all the code about the partial migration of the instrument. It reverts all the changes made previously in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/799353/
Mon, Nov 20
Thu, Nov 16
Wed, Nov 15
Tue, Nov 14
Let's try it a different tack:
What would you be proposing if the dataset in question did contain PII? How would you propose to solve the problems you've articulated here?
I'm not trying to be obtuse, I realize that this dataset doesn't contain PII, but it cannot be true that there is no other way. Ultimately what I want to get to is a) does this use-case warrant an exception, and if so b) why? Exceptions are Bad™ and should be avoided, so why should we make one here —and perhaps more importantly— what is the criteria (i.e. what would we use in subsequent requests to decide whether to do so again).
In addition to what I said above (or instead of), could it make sense to have a only-read access to the cassandra cluster (without the purpose of fetching or using directly that data to populate our local test env). We could use all we learn about the data to improve our mock/synthetic data generator scripts and go ahead of errors/surprises related to the data.
Mon, Nov 13
The last change is just about removing a sample value we added to the config.yaml file to force a redeployment of the image for this service. We didn't realize we need that field empty to be able to run all services and cassandra in a docker compose project to run our QA test suite properly.
In T350882#9327876, @Eevans wrote:In T350882#9322001, @Sfaci wrote:... We take the opportunity to use it to populate the local test env but, what we try to do fetching this data is to understand how it's structured and which edge cases we can expect to find. In fact, most of the time we are just reacting to an unexpected situation. For example:
- With the last sample data you provided us, we have found that the top_pageviews dataset contains the information about the articles in two different fields (articles and articlesJSON). For some rows the data is in the articles field and, for others, the data is in articlesJSON. It's something really unexpected and something we didn't know. Data from 2022-01 to 2022-10 seems to be the particular case. We got a "unexpected end of JSON input" error when requesting in that date range, and it was produced because we were looking for the data in the wrong field. The fix is to have an if sentence in the code to try to look for that value in the first field and, if it doesn't exist there, to try with the other. I guess that is due to some error when ingesting data to cassandra or something similar but it's something that I never imagined without taking a look to the real dataset.
- Some days ago I reached out to you to ask for data about mediarequest_top_files dataset. Doing that we learned how filepaths are stored in cassandra. Some punctuation marks are URL-decoded and others are stored as they are (or viceversa, or something similar. I don't remember it well). That was something we didn't know and I don't know if there had been another way to know it. I mean something like this: Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm
In both of these cases, wouldn't it be better to use the code (legacy aqs and/or analytics) to suss out the contract, rather than using queries to reverse-engineer them on a reactive basis?
The new csv for top_pageviews dataset is already included in the aqs-docker-cassandra-test-env
The service is already deployed and routed to production and it's working fine. Thanks @hnowlan!!!!!
Fri, Nov 10
First of all I wanted to say that I appreciate your support (helping us with the data and trying to improve all this) and I totally agree with you. We know this is not a best practice and, even not having any privacy concerns with that data, the best way to deal with it is not fetching data just to populate our local test env. In fact, our purpose is not really that. We take the opportunity to use it to populate the local test env but, what we try to do fetching this data is to understand how it's structured and which edge cases we can expect to find. In fact, most of the time we are just reacting to an unexpected situation. For example:
- With the last sample data you provided us, we have found that the top_pageviews dataset contains the information about the articles in two different fields (articles and articlesJSON). For some rows the data is in the articles field and, for others, the data is in articlesJSON. It's something really unexpected and something we didn't know. Data from 2022-01 to 2022-10 seems to be the particular case. We got a "unexpected end of JSON input" error when requesting in that date range, and it was produced because we were looking for the data in the wrong field. The fix is to have an if sentence in the code to try to look for that value in the first field and, if it doesn't exist there, to try with the other. I guess that is due to some error when ingesting data to cassandra or something similar but it's something that I never imagined without taking a look to the real dataset.
- Some days ago I reached out to you to ask for data about mediarequest_top_files dataset. Doing that we learned how filepaths are stored in cassandra. Some punctuation marks are URL-decoded and others are stored as they are (or viceversa, or something similar. I don't remember it well). That was something we didn't know and I don't know if there had been another way to know it. I mean something like this: Angkor_-_Zentrum_des_K%C3%B6nigreichs_der_Khmer_(CC_BY-SA_4.0).webm
Thu, Nov 9
I have been taking a look at this data and it seems that the dataset has two fields, articles and articlesJSON, and some times one of them is empty and the other is filled with data and viceversa.
It's something I think we didn't expected to find but I have taken a look at AQS 1.0 code and there is an if sentence to see which field has really the data. It seems that's the issue. Another data issue we didn't see.
The data we need to debug this issue is already available at T350882: Query additional sample data for AQS testing
It seems we have a fix for the registered_user editor's endpoint. The fixed service is already running in staging environment. The following is a sample request that works fine:
Wed, Nov 8
Can you provide more details about the specific errors you got?
I think I have found the piece of code we have to fix. I think the issue is related to the fact that our test env differs a bit from the production one. I'm not totally sure but I think a field (other_tags) is mapped as a string but it should be an array (it contains only one value but it's an array of strings in production). That would explain why it's running locally but the data is not found in production.registered-users is the only endpoint that uses that field to filter and the code is working properly in our test env but not in production.
In the meantime, while we try to figure out what's the issue about, I have requested new data to add to our test env so we can debug what's happening in the service for the specific dates that are failing.
Hi again @Eevans!
I'm sorry for bothering again. I'm here to ask for new data. We need some new one about pageviews to debug for a bug we have in production (it seems to be related with some specific date for specific dates). And endpoint is failing for a specific range of date and we need to add a couple of years to the script to fetch that data.
Both edit and editor have been affected by this issue:
Tue, Nov 7
Oct 31 2023
Oct 30 2023
I have moved this task to "In code review" because Surbhi and I have made some comments that we think need to be reviewed at this pending MR: https://gitlab.wikimedia.org/repos/generated-data-platform/aqs/aqs_tests/-/merge_requests/28
It seems this bug is just about the data we have available in our druid-test-env. Taking a look to the dataset we found that field had a different value in our test-env. That's why we have pushed a new version of our dataset with that value changed.
After pulling this change from the test-env repo, QA testing can be restarted.
Oct 22 2023
@Ladsgroup Keep in mind that these tests have been run locally to test the fix before deploying to production.
The fix is done and merged and these tests are showing that it's working fine, but the service hasn't been deployed yet. Hopefully we'll do that next Monday. We'll ping you through this ticket as soon as it's done.
Oct 20 2023
In T347899#9268234, @Lokal_Profil wrote:In T347899#9258560, @Sfaci wrote:Just wondering, for example, why this item File:)(_-_Flickr_-_Time.Captured..jpg is included as "is not correct". I think we already understand the issue and that case is not matching with the failure pattern (this is a combination of some punctuation marks because not all of them are store at the same way in the datasets) and, when I request:
https://wikimedia.org/api/rest_v1/metrics/mediarequests/per-file/all-referers/all-agents/%2Fwikipedia%2Fcommons%2F0%2F00%2F)(_-_Flickr_-_Time.Captured..jpg/monthly/20230101/20231001
I get a good response:
"items": [ { "referer": "all-referers", "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg", "granularity": "monthly", "timestamp": "2023010100", "agent": "all-agents", "requests": 16 }, { "referer": "all-referers", "file_path": "/wikipedia/commons/0/00/)(_-_Flickr_-_Time.Captured..jpg", "granularity": "monthly", "timestamp": "2023020100", . . . . . .In some cases you are adding the prefix File: but I think that is not part of the filepath, right? In that case the one that exists is )(_-_Flickr_-_Time.Captured..jpg instead of File:)(_-_Flickr_-_Time.Captured..jpg`. Is that the reason to be included as "is not correct"?
Please, correct me if I wrong
Thanks!!For )(_-_Flickr_-_Time.Captured..jpg I think the underlying issue might be that it is unclear which characters are expected to be encoded or not. In my case (where I got 404's for files with parenthesis) I now see that I had encoded the parenthesis.
geo-analytics service only accepts requests about wikipedia projects (non-wikipedia projects are not available) and that details was missing in the documentation for AQS 2.0.
We'll take the opportunity to fix that as well.