Page MenuHomePhabricator

Analyse a small sample of the most often used query patterns on WDQS
Open, Needs TriagePublic

Description

Our contractors would like to analyse the queries that the WDQS is most busy with. For this, they will need to get a sample of those, knowing that they want a not-detailed look into the exact queries but rather to find the patterns that are reproduced the most.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Mar 23, 11:25 AM

Specification:

  • fetch a random sample of SPARQL queries from /sparql and /bigdata/namespace/wdq/sparql paths in the wmf.webrequest;
  • provide exploratory data analysis to sort out the most frequently used exactly identical queries;
  • perform simple feature engineering to characterize the queries;
  • perform query clustering, characterize large clusters, and then study the distribution of query frequency across the clusters;
  • report.
GoranSMilovanovic added a comment.EditedThu, Mar 26, 11:39 PM

Thu 26 Mar 2020 11:35:59 PM UTC

  • a sample of SPARQL queries from wmf.webrequest was obtained by randomly sampling 1% of all queries that were sent out to WDQS on each day from 2020-03-01 to 2020-03-20;
  • the sample is now cleaned by removing all http_status == 4** (client-side errors; I guess we are not interested in malformed queries), checked for consistency, and URLdecoded() so that it encompasses only SPARQL code in the uri_query field;
  • next steps:
    • inspecting identical queries for any specifics;
    • exploratory data analysis;
    • feature engineering: describing queries from their SPARQL language constituents.

Following the cleaning operations the sample size is 1,599,546 queries, of which 1,089,981 are unique.

The fields that we keep for all further analyses from wmf.webrequest are the following:

  • dt
  • uri_query
  • http_status
  • cache_status
  • time_firstbyte
  • response_size
  • agent_type
  • content_type
  • access_method
GoranSMilovanovic added a comment.EditedFri, Mar 27, 11:44 AM

Fri 27 Mar 2020 11:16:16 AM UTC

  • incorrect HiveQL sampling fixed (the first sample encompassed only queries from the first hour of each day from 2020-03-01 to 2020-03-20);
  • the new sample, encompassing approximately 1% of all queries from 2020-03-01 to 2020-03-20, encompasses 1,095,712 queries and 815,542 unique queries.
  • Meeting with @Jakob_WMDE today, very helpful:
    • learned about some JS libraries that parse SPARQL
    • this might help me a lot to improve the current feature engineering.

Current status:

  • parsed SPARQL; completing the initial feature engineering phase and then
  • validating the features against something useful to optimize, say server response time,
  • most probably with XGBoost.

If the validation turns out to be any successful, the next step will be to perform query clustering + characterize the clusters and report.

I will be most probably need several feature engineering - ML validation - clustering loops before I figure out the most useful query analysis approach.

GoranSMilovanovic added a comment.EditedTue, Mar 31, 9:35 PM

Current status:

  • parsed SPARQL; initial, approximate feature engineering phase completed;
  • NEXT: validating the features by optimizing server response time (time_firstbyte from wmf.webrequest) by
  • XGBoost with cross-validation;
  • goal: select the number of features, varying by % of queries that used a particular feature, starting from .25%;
  • experiments ongoing on stat1005.