Page MenuHomePhabricator

Analyse a small sample of the most often used query patterns on WDQS
Closed, ResolvedPublic

Description

Our contractors would like to analyse the queries that the WDQS is most busy with. For this, they will need to get a sample of those, knowing that they want a not-detailed look into the exact queries but rather to find the patterns that are reproduced the most.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 23 2020, 11:25 AM

Specification:

  • fetch a random sample of SPARQL queries from /sparql and /bigdata/namespace/wdq/sparql paths in the wmf.webrequest;
  • provide exploratory data analysis to sort out the most frequently used exactly identical queries;
  • perform simple feature engineering to characterize the queries;
  • perform query clustering, characterize large clusters, and then study the distribution of query frequency across the clusters;
  • report.
GoranSMilovanovic added a comment.EditedMar 26 2020, 11:39 PM

Thu 26 Mar 2020 11:35:59 PM UTC

  • a sample of SPARQL queries from wmf.webrequest was obtained by randomly sampling 1% of all queries that were sent out to WDQS on each day from 2020-03-01 to 2020-03-20;
  • the sample is now cleaned by removing all http_status == 4** (client-side errors; I guess we are not interested in malformed queries), checked for consistency, and URLdecoded() so that it encompasses only SPARQL code in the uri_query field;
  • next steps:
    • inspecting identical queries for any specifics;
    • exploratory data analysis;
    • feature engineering: describing queries from their SPARQL language constituents.

Following the cleaning operations the sample size is 1,599,546 queries, of which 1,089,981 are unique.

The fields that we keep for all further analyses from wmf.webrequest are the following:

  • dt
  • uri_query
  • http_status
  • cache_status
  • time_firstbyte
  • response_size
  • agent_type
  • content_type
  • access_method
GoranSMilovanovic added a comment.EditedMar 27 2020, 11:44 AM

Fri 27 Mar 2020 11:16:16 AM UTC

  • incorrect HiveQL sampling fixed (the first sample encompassed only queries from the first hour of each day from 2020-03-01 to 2020-03-20);
  • the new sample, encompassing approximately 1% of all queries from 2020-03-01 to 2020-03-20, encompasses 1,095,712 queries and 815,542 unique queries.
  • Meeting with @Jakob_WMDE today, very helpful:
    • learned about some JS libraries that parse SPARQL
    • this might help me a lot to improve the current feature engineering.

Current status:

  • parsed SPARQL; completing the initial feature engineering phase and then
  • validating the features against something useful to optimize, say server response time,
  • most probably with XGBoost.

If the validation turns out to be any successful, the next step will be to perform query clustering + characterize the clusters and report.

I will be most probably need several feature engineering - ML validation - clustering loops before I figure out the most useful query analysis approach.

GoranSMilovanovic added a comment.EditedMar 31 2020, 9:35 PM

Current status:

  • parsed SPARQL; initial, approximate feature engineering phase completed;
  • NEXT: validating the features by optimizing server response time (time_firstbyte from wmf.webrequest) by
  • XGBoost with cross-validation;
  • goal: select the number of features, varying by % of queries that used a particular feature, starting from .25%;
  • experiments ongoing on stat1005.
GoranSMilovanovic added a comment.EditedApr 6 2020, 11:16 AM

Current status:

  • pilot/research experiments completed:
    • research phase:
    • model server response times from the features extracted as atomic elements of the SPARQL queries in the sample;
    • experimented with various feature selections (size of the feature vocabulary);
    • model: XGBoost for regression, RMSE optimization;
    • results: everything between approx. R = .72 (test data set) and R = .91 (train data set) can be achieved;
  • first "usable" model:
    • goal: categorize unusually long server response times (> upper inner fence, Q3 + 1.5*IQR - "mild outliers");
    • method: XGBoost optimization of logistic loss (i.e. say Binomial Regression from an ensemble of Decision Trees); using 1,000 features only (the full feature set is >2M features);
    • result: accuracy 92% on both train and test data (approx. a 50:50 split of 1M queries in the sample).

NEXT steps:

  • running full CV cycles across learning rate, tree depth, taking best iterations in n-fold CVs only;
  • singling out the most reliable model;
  • attempt to predict extreme outliers (> upper inner fence, Q3 + 3*IQR - "extreme outliers");
  • reporting until Wednesday, 2020/04/09;
  • clustering queries from the most important features in server response time optimization (if necessary - to discuss with the team).

What we have learned thus far:

  • we can optimize server response time from feature engineering obtained from parsing SPARQL;
  • we can scale this approach and train models that work with considerably more than 1M of queries (sample size used now);
  • we need to think carefully about the deployment of a future ML optimization system for WDQS if that is the road to be taken.
  • Update Mon 06 Apr 2020 04:54:47 PM CEST: modeling extreme outliers on server response time (based on time_firstbyte from wmf.webrequest): 95% accuracy on both train and held out test data set.
  • Note: consider re-formulating the problem as a multi-class labeling + softmax function optimization.
GoranSMilovanovic added a comment.EditedApr 9 2020, 10:27 PM

Update Thu 09 Apr 2020 10:19:24 PM UTC:

  • XGBoost w. gbtree on a binary classification problem ("typical" vs. "extreme outlier" server response times) cross-validation started on stat1005;
  • using 9 data sets with varying number of features (<100 - 2000);
  • splitting test from train data for each data set;
  • additionally setting xgboost internal cross-validation controls too;
  • cross-validating across: learning rate (eta, 4 levels), subsample (rows, 4 levels) parameter to build trees, max_depth (how deep trees are allowed, 4 levels);
  • number of iterations set to monotonically decrease with eta;
  • keeping colsample_bytree (proportion of features used to build each tree) fixed at .5;
  • setting max_delta_step to 1 - documented to be useful to prevent overfitting in highly unbalanced designs in binary classification (as ours is);
  • optimization objective: logarithmic loss;
  • model selection: ROC Analysis -> AUC.

Resource consumption: 32 cores, approx. 15Gb RAM.
Approximate running time guesstimate: 24 - 30h.

GoranSMilovanovic added a comment.EditedApr 15 2020, 7:57 AM

@darthmon_wmde @WMDE-leszek

Update wed, 15. apr 2020. 09:56:39 CEST

  • First report on modelling results, to be discussed in a meeting 10:00 CEST today.

Update Thu Apr 16 10:21:32 UTC 2020:

  • following the meeting with thephp.cc yesterday:
    • The modelling approach will change from more predictive to more explanatory, i.e. the variables that could not be used for prediction (cache_status, for example) will be encompassed in order to study the WDQS responses in more detail;
    • A model to estimate the effect of the number of concurrently running queries on the server response time will be developed;
    • All planned improvements of the currently used family of XGBoost models (feature engineering fixes in the first place) will be implemented;
    • Next meeting to be scheduled for sometime near the end of the following week.
Gehel added a comment.Apr 16 2020, 1:37 PM

A few additional notes:

  • There is probably better / more useful information published as part of the new events published directly from WDQS. Check directly with @dcausse or @JAllemandou if needed.
  • complete query time is probably a better predictor of resource usage than TTFB (results are streamed, computation still happens after first byte).
  • it's fairly easy to add metrics to the events from WDQS, maybe we should add the current concurrency levels, and maybe the current CPU load. Open a ticket for us if you think that might be useful.
GoranSMilovanovic added a comment.EditedApr 20 2020, 8:43 PM

@Gehel First of all, thank you for all the insights that you have brought into the discussion thus far.

There is probably better / more useful information published as part of the new events published directly from WDQS.

Indeed the schema that you have referred to seems to be more suitable to work with than wmf.webrequest. Please, if you, or @dcausse or @JAllemandou can help me clarify the following:

  • What is the difference between event.wdqs_internal_sparql_query and event.wdqs_external_sparql_query in the WMF Data Lake?
  • Could you confirm that the query_time field represents "complete query time" as described by @Gehel in T248308#6062499 (above)?

Thanks.

it's fairly easy to add metrics to the events from WDQS, maybe we should add the current concurrency levels, and maybe the current CPU load. Open a ticket for us if you think that might be useful.

Let's see what can be inferred from the variables present in the new schemata now. Then we can decide together if it would be beneficial for us to have any new metrics added there.

GoranSMilovanovic added a comment.EditedApr 24 2020, 2:59 AM

Update Fri 24 Apr 2020 04:01:17 AM CEST and in respect to T248308#6062005:

  • A new sample of approximately 1M SPARQL queries was drawn from the new events schema : event.wdqs_external_sparql_query sugested by @Gehel; the most important improvement should be the presence of the query_time field which is a more precise measure of the WDQS processing time then the time_firstbyte variable from the previously used wmf.webrequest schema
  • Sampling: approx. 1% of all queries per day, 2020/04/01 - 2020/04/21
  • Besides the SPARQL queries themselves, the following variables were collected from the event.wdqs_external_sparql_query table:
    • dt - timestamp
    • format - JSON, XML, etc. - the desired output format
    • http.method - GET, POST, etc.
    • http.status_code - HTTP status code, not used in the analysis (but it could be used as a criterion);
    • backend_host - backend host
    • datacenter - data center
    • query_time - WDQS query processing time;
    • unfortunately, the event.wdqs_external_sparql_query does not record cache status - we have initially planned to use this feature in the analysis (see: T248308#6062005)
  • Goal: generate a list of the most critical features that influence query processing time from what can be extracted from the available SPARQL queries
  • Feature engineering procedures were slightly improved: still analyzing SPARQL as if it was a natural language + new procedures are less error-prone than the ones previously used
  • Modelling approach: same as before,
    • split query_time to derive a binary criterion: "typical processing time" vs "extreme outlier processing time",
    • optimize w. XGBoost, GBTree booster - sequential decision trees w. automatic feature selection,
    • find out what are the most important features that can help sort out the two classes of queries by length of processing time,
    • run a linear model to determine which of the selected features influence the processing times positively (making them longer - slower processing) or negatively (making them shorter - faster processing).
  • RESULTS:
    • None of the variables obtained from schema and not derived from SPARQL queries feature engineering procedures are selected as important for the "typical processing time" vs "extreme outlier processing time" dichotomy (except: hour of the day, derived from the dt timestamp);
    • Essentially there are no differences in comparison to the results reported in our April 15 meeting: an accuracy of approx. 92% can be achieved with a True Positive Rate (TPR) of approx. 60% and a very low False Positive Rate (FPR) + the model can be improved by selecting a proper decision threshold as exemplified in the report in T248308#6057950;
    • Running a linear model with the selected features against query_time in seconds as a criterion enables us to see what features influence the WDQS processing time positively or negatively: the results are summarized in the following table:

  • and the columns in the table are the following: feature - an extracted feature from SPARQL, weigth - regression coefficient. Because each query is characterized by a count of uses of each particular feature it mentions at all, the coefficients mean the following: each additional use of a particular feature in a query increases its processing time by the value of the coefficient. For example, row number 96, feature: nchar, which represents the length of the query in characters, tells that each additional character in the query contributes to an increase in 0.0867218 seconds in query processing time. Another example, row number 68, feature: __vars__ which represents the number of unique variables instantiated in a query, tells as that with an addition of an additional variable the query processing time jumps for 5.15947 seconds. This interpretation of the regression coefficients does not hold only for the features of the form f_ds_hour_5 - this particular one represents the fact that the query was run between 5 and 6 in the morning UTC. We also find f_ds_hour_1, f_ds_hour_22, and similar in the table. The value of the coefficient for these variables tells us how much additional seconds of query processing time obtains *relative to the first hour of the day* (i.e. f_ds_hour_0, which presents a baseline and thus is not found in the table). Be aware of the fact that the linear model is very imprecise in this case; I would interpret the coefficients by their sign only (e.g. slowing down or speeding up the processing; every negative coefficients means that the presence of the respective feature *reduces* the processing time).
    • A question was risen in our April 15 meeting: what is the effect of the number of concurrently run queries on the query processing time in general? Answer: the effect is miserable, close to zero.
  • The way I see it now:
    • if this selection of features that characterize the WDQS processing time can help us develop whatever optimization system we plan to develop:
    • it takes approx. three hours to run the whole procedure (ETL, Feature Engineering, XGBoost Model) on our stat100* servers;
    • the Future Engineering process (tons of regex) is the bottleneck and should be migrated to Spark (as well as the code for production should migrate from R to Python);
    • because we feature engineer the pragmatics of SPARQL usage too (i.e. we extract not only abstract pieces of code but particular variable names, comments, etc) which most certainly vary quite a lot with time, we should retrain at least once weekly to derive a new feature watchlist;
    • Can these results be improved? - improvements are always possible but consider the improvement/(time x resources) ratio first...
GoranSMilovanovic added a comment.EditedApr 24 2020, 11:02 AM
  • Fri 24 Apr 2020 meeting w. thephp.cc follow-up:
  • @Addshore I will provide additional datasets (feature frequency distribution, feature coverage across the SPARQL queries, etc) here, sometime during the day;
  • thephp.cc suggestion: modelling the WDQS processing time while taking into account per server concurrency (i.e. the number of queries run at the same time), and not concurrency across the whole system globally, very much makes sense.
  • @Gehel: once again, thank you very much for clarifying the details of the event.wdqs_external_sparql_query schema (especially important was the cache behavior related info).
  • @darthmon_wmde @WMDE-leszek At this point, I would put any additional work here just as much as it is needed to consolidate the reports, brush-up a detail or two, and wrap-up what we have now in a concise manner. Before taking any further steps I would suggest that is better to have some decision on this approach: a go (we will it use it to optimize WDQS in the future somehow), or a no go (we've learned something, but the optimization will rely on a different background). Take into your consideration that experiments like the ones that I have been conducting in the previous days are costly in terms of both time and resources (computational time and memory on our stat100* servers: I've been using stat1005 + my own server for computations to avoid overloading our infrastructure and save some time). If the decision is a go - we need to discuss the real-world implementation (and that will almost certainly mean Python/Pyspark development). Thanks.

@Addshore

  • queries_vocabulary.csv - all features extracted from approx. 1M SPARQL queries, 1 - 21. April 2020; statistic: total feature frequency (including multiple occurrences of the same feature in a query);
  • queries_coverage.csv - all features extracted from approx. 1M SPARQL queries, 1 - 21. April 2020; statistics: 1. number of unique queries in the sample that made use of a particular feature; 2. % of unique queries in the sample that made use of the respective feature.

Google Drive: datasets

Thanks a lot, @GoranSMilovanovic for all your hard work!

@darthmon_wmde @WMDE-leszek At this point, I would put any additional work here just as much as it is needed to consolidate the reports, brush-up a detail or two, and wrap-up what we have now in a concise manner. Before taking any further steps I would suggest that is better to have some decision on this approach: a go (we will it use it to optimize WDQS in the future somehow), or a no go (we've learned something, but the optimization will rely on a different background). Take into your consideration that experiments like the ones that I have been conducting in the previous days are costly in terms of both time and resources (computational time and memory on our stat100* servers: I've been using stat1005 + my own server for computations to avoid overloading our infrastructure and save some time). If the decision is a go - we need to discuss the real-world implementation (and that will almost certainly mean Python/Pyspark development). Thanks.

I would say that, for now, we should wait a bit and decide first how we tackle the suggestions coming from thePHP.cc's report in order to prioritise the steps and, how you nicely put it, "discuss the real-world implementation".

Update Mon 27 Apr 2020 10:10:23 PM UTC:

Final reports

  • Here goes the Part A of the Final Report which encompasses the Exploratory Data Analysis (EDA) only, encompassing: (1) the characteristics of the sample of SPARQL queries used in this study, (2) the overview of the number of queries run per (a) day of week, (b) hour of day, (c) WMF Datacenter/Host, (d) HTTP method of request, (e) server HTTP response code, and (f) the desired output format, (3) the mean and median WDQS query processing times across the mentioned (a) - (f) variables, and (4) the distributions of WDQS processing times across WMF Datacenter/Hosts and output format.

Summary

  • The eqiad data center is receiving tons of queries in comparison to codfw.
  • The XML output format seems to take much more to process in comparison to JSON and text/plain (except for we really have only few observations of text/plain in the sample).
  • The distributions of the WDQS processing time across the crucial variables (WMF Datacenter/Host, Output format) are highly skewed towards short processing times - so we really need to focus on the outliers seriously (as already did in the ML approach).

Next:

  • share the dataset of most frequently observed SPARQL queries at the WDQS endpoint;
  • share Part B of the Report: optimizing the WDQS processing times w. XGBoost and features parsed from SPARQL (nothing new, all covered in our meetings with thephp.cc, just a wrap-up).

Update Mon 27 Apr 2020 10:31:05 PM UTC:

The most frequently observed SPARQL queries dataset

  • Selection criteria: the query was observed >= 50 times in the WDQS endpoint sample (approx. 1M queries, 2020/04/01 - 2020/04/21).
  • For each query we report the mean WDQS processing time, the median WDQS processing time, and the standard deviation of processing time;
  • the dataset is sorted in descending order of mean WDQS processing time;
  • the Percent column stands for the % of the total number of queries in the sample represented by the respective (repeatedly observed) query and does not sum up to 100% because, again, we report only on the queries that were observed 50 or more times on the endpoint.

Here goes the dataset:

Columns:

  • uniqueSparqlId - the unique ID of the query - never mind, I need it for some join operations on data frames;
  • sparql - the query itself
  • Num_Observations - how many times was the query observed in the sample;
  • mean_query_time - the mean WDQS processing time for this query
  • median_query_time - the median WDQS processing time for this query
  • stdev_query_time - the standard deviation of the WDQS processing time for this query
  • Percent - explained above.
GoranSMilovanovic added a comment.EditedApr 28 2020, 2:26 AM

Update Tue 28 Apr 2020 02:17:33 AM UTC

Here goes the update report on SPARQL feature selection via XGBoost:

  • The model performance was improved mainly by (a) improving upon the feature engineering process (currently: not great, not terrible), and (b) controlling for a highly imbalanced design (i.e. the number of queries with "typical" processing times heavily outnumber the number of queries with "extreme" processing times in the sample) by switching from XGBoost control parameters (like scale_pos_weight) to a manually implemented Downsampling strategy;
  • I have switched from a definition of "extremely long processing time" as an extreme outlier to a definition which takes it to be a *mild outlier*: it poses a more difficult binary classification problem but still we get significant improvements (spot the difference between the Hit and False Alarm rate):
  • model accuracy is around 85%;
  • Hit rate (or True Positive Rate) is around 72%, and
  • False alarm rate (or False Positive Rate) is about 13%.

NOTE. The results are somewhat better when the no train/test split takes place (that happens in model selection only): acc = 86.1%, TPR = 75.4%, FPR = 12.5%, all at the decision threshold of .11.

The list of critical SPARQL features (plus what has been extracted as a feature from event.wdqs_external_sparql_query is found in *Section 4. Selected features*.

Good night.

UPDATE Wed 29 Apr 2020 11:04:48 AM UTC

  • Entered linear model results: whether the features influence response time in a positive or negative direction (Section 4.2 The direction of influence of the most important features).

@WMDE-leszek @darthmon_wmde @Lydia_Pintscher @Addshore @Gehel @Samantha_Alipio_WMDE

This could be useful for tomorrow's discussion on repeated queries:

Columns:

  • uniqueSparqlId: forget it
  • sparql: a SPARQL query
  • Cluster: a cluster to which the respective query belongs to.

What is this:

  • The most frequently observed 3,000 SPARQL queries
  • from an approx. 1M sample of queries observed at the WDQS endpoint
  • were selected and than grouped into clusters
  • by relying on their similarity across the features that describe them;
  • the features were previously selected by XGBoost (see above: all the struggle to find a set of features that can predict the query processing time).

How can this be used?

  • Similar queries belong to the same cluster (column: Cluster in the dataset)
  • once again: we are looking at the 3,000 most frequently repeated queries;
  • so we get to observe structurally similar queries that were often used -->
  • --> ideas on how to approach their optimization
  • (e.g. pre-compute a set of results that matches many structurally similar queries --> serve from an API, Elastic, something...)

@WMDE-leszek @darthmon_wmde Do we need anything else here in the foreseeable future?

I believe we pause with this analysis for now. So I guess we could close this task to simplify your task bookkeeping @GoranSMilovanovic ?

GoranSMilovanovic closed this task as Resolved.May 19 2020, 7:36 PM

@WMDE-leszek Res, non verba.

GoranSMilovanovic reopened this task as Open.Jul 14 2020, 7:54 PM
  • Re-opening the task to address the question of automated vs. non-automated SPARQL queries observed at the WDQS end-point.
  • Reference: WMDE in-house email and Google Meet discussions with @darthmon_wmde and @Lydia_Pintscher.
  • First step: analyze the frequency distribution of the user_agent field (string) from wmf.webrequest where queries are SPARQL.
  • Second step: proceed towards a working definition of a non-automated WDQS query.

First step: analyze the frequency distribution of the user_agent field (string) from wmf.webrequest where queries are SPARQL.

I suggest you use events instead fo webrequest: event.wdqs_internal_sparql_query and event.wdqs_external_sparql_query.

I have done some work emcompassing user-agent frequency analysis and I 'm in the process of writing the findings for this end of week.

@JAllemandou Please see T248308#6080150. I also see that event.wdqs_external_sparql_query encompasses the user_agent_map so yes I will go for it and not for wmf.webrequest.

I have done some work emcompassing user-agent frequency analysis and I 'm in the process of writing the findings for this end of week.

Nice, thank you, please share your work with us.

If we are both already analyzing the WDQS responses, why don't we meet (w. @Lydia_Pintscher @darthmon_wmde and other interested parties) to exchange and discuss our findings?

@JAllemandou However...

0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> select user_agent_map from event.wdqs_external_sparql_query where year = 2020 and month = 5 and day = 1 limit 10;
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
INFO  : Compiling command(queryId=hive_20200715080808_74c55818-8e11-44d6-94b7-cfb516d24001): select user_agent_map from event.wdqs_external_sparql_query where year = 2020 and month = 5 and day = 1 limit 10
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:user_agent_map, type:map<string,string>, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20200715080808_74c55818-8e11-44d6-94b7-cfb516d24001); Time taken: 0.082 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20200715080808_74c55818-8e11-44d6-94b7-cfb516d24001): select user_agent_map from event.wdqs_external_sparql_query where year = 2020 and month = 5 and day = 1 limit 10
INFO  : Completed executing command(queryId=hive_20200715080808_74c55818-8e11-44d6-94b7-cfb516d24001); Time taken: 0.0 seconds
INFO  : OK
user_agent_map
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Windows","os_major":"10","os_minor":"-","browser_major":"81","browser_family":"Chrome","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
{"os_family":"Other","os_major":"-","os_minor":"-","browser_major":"-","browser_family":"Other","device_family":"Other","wmf_app_version":"-"}
10 rows selected (0.159 seconds)

Tested for '2020/06/01` too, with similar results.

SELECT
    http.request_headers['user-agent'],
    user_agent_map,
    count(1) as c
FROM event.wdqs_external_sparql_query
WHERE year = 2020 and month = 5 and day = 1
GROUP BY
    http.request_headers['user-agent'],
    user_agent_map
ORDER BY c DESC
LIMIT 100;
GoranSMilovanovic added a comment.EditedJul 21 2020, 8:30 PM

@Lydia_Pintscher

Let's see if there is anything interesting here [shared dataset]

Data:

  • it is produced from a sample of SPARQL queries from event.wdqs_external_sparql_query,
  • time span from 20. June - 20. July,
  • sampling: 1% of queries observed each day were randomly selected,
  • final dataset has 2,097,070 queries;
  • the user_agent fields and query processing times were extracted to produce the ref_user_agent_sample.csv dataset;
  • only user_agents who appear at least twice in the sample were kept.

Columns in ref_user_agent_sample.csv:

  • count: how many times was this user_agent observed in the sample?
  • mean_time - mean query processing time for this user_agent
  • median_time - median query processing time for this user_agent
  • min_time - minimum query processing time for this user_agent
  • max_time - maximum query processing time for this user_agent
  • range_time - maximum minus minimum query processing time for this user_agent
  • percent_count: this user_agent accounts for how many % of the total queries in the sample (note: when user_agents who did not appear more than twice were filtered out);

+ the dataset is sorted in decreasing order of count.

I will now test a larger sample (order of magnitude+) to see how much would it cost us to process it.

@Lydia_Pintscher There is absolutely no correlation between

(a) how often does a particular user_agent value appears, and
(b) the mean, or median WDQS processing time for that user_agent's SPARQL queries.

We can search for particular user_agents with high average query processing time, or study the variability of query processing times per user_agents, however... as a feature in a predictive model... probably not.

Again, I will take a look at a larger sample of queries.

@GoranSMilovanovic I finally published a wiki page with most of the results I found: https://wikitech.wikimedia.org/wiki/User:Joal/WDQS_Traffic_Analysis
Sorry for the delay ...

@JAllemandou Awesome! You did a nice EDA here + you've analyzed both event.wdqs_external_sparql_query and event.wdqs_internal_sparql_query - while I've focused only on the external source in my previous analyses...

So, we do need ML to be able to predict query processing time after all:

  • if you take a look at my Report in T248308#6087571
  • you will find out that many features like query length, concurrency, etc. actually do contribute to query processing time,
  • when combined in XGBoost; however, and exactly like your analyses show us,
  • taken in isolation from other candidate features they do not show significant correlations with query processing times themselves.

Q. I remember you've mentioned somewhere - in a doc shared with @Addshore, I guess - that you've used Apache Jena AQR to parse the queries, probably to obtain algebraic representations of SPARQL and extract some features from it; do we have Jena installed somewhere on the stat100* servers?

Maybe we should meet to discuss our analyses at some point - and if you find some time.

@GoranSMilovanovic I have indeed done some analysis using Apache Jena parser to extract algebraic representation of queries. Not yet to the level of completion I like though. I'll be on holidays until August 15th starting tonight - let's discuss when I come back?

@JAllemandou Superfine. Enjoy your holidays!

@Lydia_Pintscher

Let's see if there is anything interesting here:

{F31943519}

There is definitely something interesting there! (We should get this list once a quarter or so to find new uses of our data ;-)
Thanks!

@Lydia_Pintscher You're welcome.

We should get this list once a quarter or so to find new uses of our data

It is perfectly doable. Let's discuss this on Monday and see what data and statistics precisely do we want to have reported regularly.

@Lydia_Pintscher We forgot to mention this task in our recent 1:1. In the meantime, I've tested a 10% daily queries sample and the statistics of the smaller, previously used 1% daily queries sample, turn out to be quite representative. However, if tabulation - e.g. counts and average query response times, and similar, per user agent - is really all that we need here, then we do not need to sample anything at all, just let PySpark do it in the Analytics Cluster and follow everything up to some amount of time in the past.

Change 617863 had a related patch set uploaded (by GoranSMilovanovic; owner: GoranSMilovanovic):
[analytics/wmde/WD/WD_HumanEdits@master] T248308

https://gerrit.wikimedia.org/r/617863

Change 617863 merged by GoranSMilovanovic:
[analytics/wmde/WD/WD_HumanEdits@master] T248308

https://gerrit.wikimedia.org/r/617863

GoranSMilovanovic closed this task as Resolved.Aug 18 2020, 10:01 AM