Compare the performance of sparql queries between the full graph and the subgraphs
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Jan 15 2024, 8:45 AM

Description

By using IGUANA we should compare how the queries perform when running against the full graph vs the wikidata main graph and scholarly subgraphs.

Metrics to collect are based upon IGUANA's standard metrics as defined at https://iguana-benchmark.eu/docs/3.3/usage/metrics/ .

Query mixes per hour (expresses total queries contained within query set, considering for example complex nested queries)
Queries per second
Number of queries per hour
Average queries per second
Number of queries

Queries that succeeded against the "full" and "main" (non-scholarly) graphs on wdqs1022 and wdqs1023 will be used. There are two passes for these queries:

Sampled from bots
Sampled from full traffic.

AC:

An analysis report is available

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Resolved	Gehel	T352538 [EPIC] Evaluate the impact of the graph split
Resolved	dr0ptp4kt	T355037 Compare the performance of sparql queries between the full graph and the subgraphs

Event Timeline

dcausse created this task.Jan 15 2024, 8:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 15 2024, 8:45 AM

dcausse renamed this task from Com to Compare the performance of sparql queries between the full graph and the subgraphs.Jan 15 2024, 8:49 AM

dcausse added a project: Wikidata-Query-Service.

dcausse updated the task description. (Show Details)

dcausse added a parent task: T352538: [EPIC] Evaluate the impact of the graph split.Jan 15 2024, 8:53 AM

Maintenance_bot added a project: Wikidata.Jan 15 2024, 9:29 AM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Jan 15 2024, 3:17 PM

Gehel edited projects, added Discovery-Search (Current work); removed Wikidata-Query-Service.

Gehel set the point value for this task to 8.Jan 15 2024, 4:31 PM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

dr0ptp4kt subscribed.Jan 16 2024, 5:05 PM

dr0ptp4kt claimed this task.Jan 25 2024, 10:32 PM

dr0ptp4kt updated the task description. (Show Details)

For the first pass, the following configuration is being used for an hour long test conducted from stat1006 with config file wdqs-split-test.yml as follows.

datasets:
  - name: "split"
connections:
  - name: "baseline"
    endpoint: "https://wdqs1022.eqiad.wmnet/sparql"
  - name: "wikidata_main_graph"
    endpoint: "https://wdqs1024.eqiad.wmnet/sparql"

tasks:
  - className: "org.aksw.iguana.cc.tasks.impl.Stresstest"
    configuration:
      timeLimit: 3600000
      warmup:
        timeLimit: 30000
        workers:
          - threads: 4
            className: "SPARQLWorker"
            queriesFile: "queries_for_performance.txt"
            timeOut: 5000
      queryHandler:
        className: "DelimInstancesQueryHandler"
        configuration:
          delim: "### BENCH DELIMITER ###"
      workers:
        - threads: 4
          className: "SPARQLWorker"
          queriesFile: "queries_for_performance.txt"
          timeOut: 60000
          parameterName: "query"
          gaussianLatency: 100

metrics:
  - className: "QMPH"
  - className: "QPS"
  - className: "NoQPH"
  - className: "AvgQPS"
  - className: "NoQ"

storages:
  - className: "NTFileStorage"
    configuration:
      fileName: result.nt

queries_for_performance.txt is based on the following basic code, which says to get queries known to work against both the full graph and the main (non-scholarly) graph and returning similar results, so as to reduce garbage input and somewhat better control the parameters of the test.

scala> val wikidata = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_classified.parquet")
scala> val full = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_classified.parquet")
scala> val joined5 = wikidata.as("w").join(full.as("f")).where("w.id = f.id and w.success = true and  w.success = f.success and w.resultSize = f.resultSize and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### BENCH DELIMITER ###"))).distinct
scala> joined5.repartition(1).write.option("compression", "none").text("queries_for_performance_2024_01_25.txt")
scala> :quit

$ hdfs dfs -copyToLocal hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_25.txt/part-00000-6b8caed3-3a4d-4cb2-bf74-6bbcd7af0478-c000.txt ./queries_for_performance.txt
$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar iguana-3.3.3.jar wdqs-split-test.yml

The IGUANA build is based on https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/4 .

Dropping in a screenshot from Grafana from this first pass and made a copy of result.nt to result.nt.001. Re-running to see that server behavior is similar.

Screenshot 2024-01-25 at 7.43.14 PM.png (1×3 px, 1013 KB)

Gehel moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Jan 26 2024, 8:38 AM

Now, a screenshot showing the re-run. And then a screenshot showing them side-by-side. This is just for the visual, and the data produced from IGUANA (what is in the .nt output that we can convert to a handy CSV) should be more telling.

Next up, I'll randomize the order of the queries and do it again.

Screenshot 2024-01-26 at 10.19.36 AM.png (2×3 px, 1004 KB)

Screenshot 2024-01-26 at 10.20.48 AM.png (2×3 px, 1 MB)

Now, the screenshot from the randomized order queries. I'll run one more time to see that comparable output is achieved. Those were produced with the following. This latest output file has been moved to result.nt.003.

scala> val joined6 = wikidata.as("w").join(full.as("f")).where("w.id = f.id and w.success = true and  w.success = f.success and w.resultSize = f.resultSize and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0, seed=42)
scala> joined6.count // matches same as joined5.count
scala> joined6.repartition(1).write.option("compression", "none").text("queries_for_performance_randomized_2024_01_26.txt")
scala> :quit
$ hdfs dfs -copyToLocal hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_randomized_2024_01_26.txt/part-00000-131df78f-da7a-4ffc-aad4-9874342165ca-c000.txt ./queries_for_performance_randomized.txt 
$ sha1sum queries_for_performance.txt queries_for_performance_randomized.txt
$ # they're different
$ diff queries_for_performance.txt queries_for_performance_randomized.txt | wc -l
$ # they're very different
$ cp wdqs-split-test.yml wdqs-split-test-randomized.yml
$ # changed pointers to query file to be queries_for_performance_randomized.txt
$ bash start-iguana.sh wdqs-split-test-randomized.yml
$ mv result.nt result.nt.003

{F41720432}

Now a screenshot from the re-run of the randomized order queries, followed by a screenshot showing the two runs on the randomized order queries side by side.

Screenshot 2024-01-27 at 6.36.58 AM.png (1×3 px, 951 KB)

Screenshot 2024-01-27 at 6.38.45 AM.png (1×3 px, 1 MB)

Here were the data produced by IGUANA once piped through the CSV utility introduced in https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/3/diffs with a command of the following form (for the attentive reader, note that I had to rename the originally produced files to have an .nt extension to make the underlying Jena libraries not throw an exception).

/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -e result.003.nt > result.003.execution.csv

run	endpointLabel	taskStartDate	successfullQueries	successfullQueriesQPH	avgqps	queryMixesPH
non-randomized 1	baseline	2024-01-25T22:18:57.753Z	15538	17512.446990539123	32.60590311357346	0.9895715087607575
non-randomized 1	wikidata_main_graph	2024-01-25T23:19:56.948Z	16773	19125.484555828807	33.86191297163505	1.0807190233276154
non-randomized 2	baseline	2024-01-26T01:47:41.634Z	15893	17955.609618256018	32.97284513274341	1.0146131897076351
non-randomized 2	wikidata_main_graph	2024-01-26T02:48:41.047Z	16780	19145.810254441058	34.085209300591515	1.0818675625496446
randomized 1	baseline	2024-01-26T16:51:54.091Z	15180	17068.107622599186	32.88786330044905	0.9644633340452725
randomized 1	wikidata_main_graph	2024-01-26T17:52:52.903Z	15929	17969.809300477013	33.91560037068121	1.0154155676372838
randomized 2	baseline	2024-01-26T19:37:30.811Z	15211	17054.882354485933	33.00710905229813	0.9637160170924978
randomized 2	wikidata_main_graph	2024-01-26T20:38:29.989Z	16084	18210.142239149543	34.14020362715409	1.0289960015341326

Keep in mind that a delay between was introduced in the configuration for these "stress tests" (a "stress test" here means that the execution of the queries goes continuously for the specified time interval at its concurrency and delay spec). This was to more closely model what a somewhat busy, but not completely saturated, WDQS node might experience, although we should be mindful that the server specs are a bit different between these test servers and the WDQS hosts used for serving end user WDQS production requests. When interpreting a value like avgqps, remember that this is akin to what might happen if queries were executed serially without delay if it were possible to hold JVM performance constant for such request patterns (do note that this is generally not possible to guarantee, so caveats abound; in other words it's entirely possible that avgqps could degrade in reality).

The successfullQueriesQPH metric is probably the most interesting one. It's suggestive of about a 5%-6% speed advantage for the smaller "main" graph versus a fully populated "full" graph for this query mix when conditions model a somewhat busy WDQS node (again, remember that server spec is a bit different between the SUT and production nodes so there is a caveat). During the very first run the performance difference was more pronounced - closer to 10% - but subsequent runs were closer to the 5%-6% level for wikidata_main_graph versus baseline.

Additional basic summary statistics upon the data from with per-query CSV exports (using the -q flag) against the .nt files to come. Note that in Andrea's previous analysis these sorts of statistics (as well as some tweaks to get somewhat finer precision via BigDecimal instead of Double types) were incorporated directly into the Java source of IGUANA - see https://github.com/dice-group/IGUANA/compare/main...AndreaWesterinen:IGUANA:main for changes up to June 13, 2022 against current main branch of IGUANA; n.b., to future readers you may need to re-correlate the code changes when IGUANA upstream changes. But, I opted to make fewer changes to our fork (i.e., I didn't merge Andrea's fork into our fork, even if there is some dependency similarity in the POMs) as this data can be determined in Spark summary stat calls.

We may be interested in how to take forward some of the enhancement opportunities for IGUANA upstream (Andrea's fork and David's CSV utilities in our fork, for example) should we see the need for more IGUANA work later, but then again we may not do that as our needs are narrower.

Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds basically with the value calculated up above). The underlying CSV data don't bear actual queries (the .nt files from which they're generated do), but rather rows of this form:

endpointLabel,task,queryId,totalTime,success,failed,timeouts,resultSize,unknownException,wrongCodes,qps,penalizedQPS
baseline,http://iguana-benchmark.eu/resource/1706221131/1/1,http://iguana-benchmark.eu/resource/1989023647/sparql0,53.592,2,0,0,1,0,0,37.319002836244216,37.319002836244216

No big surprises here. The "per-query" behavior was similar between nodes. The "main" graph skewed somewhat faster over the full range of queries with one exception: the absolute fastest singular query for "randomized 1" run was slightly faster on the "baseline" full graph...but generally, everything else skewed faster for the "main" graph otherwise.

Per-query theoretical throughput (queries per second for given query)

Run	Endpoint Label	Mean	Median	Standard Deviation	Max (fastest)	99% (very fast)	95%	75%	50%	25%	1% (pretty slow)	Total w/ success
non-randomized 1	baseline	32.6059031135735	34.3489164537969	19.4337414434464	120.235661897318	76.8554161439261	55.7841155056668	49.1222469015747	34.3489164537969	14.3159887843845	0.0564663314012815	15538
non-randomized 1	wikidata_main_graph	33.8619129716351	35.7193884840691	20.2056740376445	148.610491900728	81.2789060283893	57.5619081064873	50.4922999242615	35.7193884840691	15.098897780462	0.0625188498260728	16773
non-randomized 2	baseline	32.9728451327437	34.7318699638788	19.5908672232246	128.890893858348	74.7142465127726	56.1419267909274	49.7404172035179	34.7318699638788	14.6689672498449	0.0569581938930891	15893
non-randomized 2	wikidata_main_graph	34.0852093005914	36.106296938186	20.1931723422722	130.25921583952	82.0449565998977	57.6139754652666	50.625221485344	36.106296938186	15.378937007874	0.0622306059862422	16780
randomized 1	baseline	32.8878633004489	34.6404323125952	19.8757923913207	136.072935093209	79.2782608462366	56.2164107372478	49.4926998267755	34.6404323125952	14.1755500113404	0.0557895216707227	15180
randomized 1	wikidata_main_graph	33.9156003706814	35.7091844022282	20.2748013579631	132.082948091401	81.0501654381498	57.747392312958	50.5101525406606	35.7091844022282	15.079658294943	0.0574330048487202	15929
randomized 2	baseline	33.007109052298	34.5670904661201	19.8511760316909	133.904659882163	81.4017649973028	56.1335754963953	49.6176341128876	34.5670904661201	14.3154187934342	0.0538090222917457	15211
randomized 2	wikidata_main_graph	34.1402036271541	36.0958706323996	20.2577595936201	134.156157767641	83.0239310934627	57.7850982363029	50.5292946486809	36.0958706323996	15.7122834258313	0.0589775584599512	16084

A run is in progress for 78K+ queries from a set of 100,000 random queries. It should be done in under 10 hours from now.

scala> val full_random = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified.parquet")

scala> val wikidata_random = spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_random_classified.parquet")

scala> full_random.count
res0: Long = 100000 

scala> wikidata_random.count
res6: Long = 100000  

scala> val joined11 = wikidata_random.as("w").join(full_random.as("f")).where("w.id = f.id and w.success = true and  w.success = f.success and w.resultSize = f.resultSize and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0, seed=42)

scala> joined11.count
res0: Long = 78862

scala> joined11.repartition(1).write.option("compression", "none").text("queries_for_performance_2024_01_31.txt")

scala> :quit

$ hdfs dfs -copyToLocal hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_31.txt/part-00000-29c4e72d-800d-4148-b804-8e428ee71e9e-c000.txt ./queries_for_performance_file_renamed_randomized_2024_01_31.txt

$ bash start-iguana.sh wdqs-split-test-randomized-2024-01-31.yml

start-iguna.sh previously ran from stat1006, but this time around it's running from stat1008 in order to use more RAM for the larger query mix.

Here's the output from the latest run based upon a larger set of queries from a random sample of WDQS queries.

$ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -e result.nt > result.execution.csv
$ cut -f1,3,5,6,7,9 -d"," result.execution.csv | sed 's/,/|/g'

endpointLabel	taskStartDate	successfullQueries	successfullQueriesQPH	avgqps	queryMixesPH
baseline	2024-01-31T23:20:44.567Z	319857	136612.71246575614	18.83670491311007	1.732300885924224
wikidata_main_graph	2024-02-01T04:23:01.613Z	331473	147674.12233239523	19.55930142298825	1.8725637484770261

Here's the screen capture from Grafana.

Screenshot 2024-02-01 at 10.17.28 AM.png (1×3 px, 1 MB)

The non-scholarly "main" wikidata_main_graph (wdqs1022) window completed more queries. During the end of its window, a larger garbage collection pass occurred on the baseline ("full") graph node (wdqs1024) - where the window had closed much earlier. This happened to coincide with a small window of temporary failing queries observed, which is aggregated across all three nodes, "main", "full", and "scholarly". Depending on how one looks at this this corresponds to maybe 1.5%-3.0% of the wikidata_main_graph's run time, so I wouldn't read too much into this.

To repeat the verbiage from the earlier runs...

Following below are "per-query" summary stats. I actually just put this together by bringing CSV data into Google Sheets for now - all of the columns are calculated upon the "per-query" rows (but you'll see how the Mean corresponds basically with the value calculated up above). The underlying CSV data don't bear actual queries (the .nt files from which they're generated do), ...

The CSV data were generated with the following command:
/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar org.aksw.iguana.rp.analysis.TabularTransform -q result.nt > result.query.csv

Run	Endpoint Label	Mean	Median	Standard Deviation	Max (fastest)	99% (very fast)	0.95	0.75	0.5	0.25	1% (pretty slow)	Total w/ success
randomized 1	baseline	18.8367049131101	14.6999663404689	16.3589173757083	127.433177227691	59.009472115968	50.5734395961334	30.3470335487675	14.6999663404689	4.97164300568995	0	319857
randomized 1	wikidata_main_graph	19.5593014229883	16.0982853987134	16.5098295290687	121.141149629509	58.9613256488317	51.0426872548935	31.751311031492	16.0982853987134	5.37249826361878	0	331473

Although the max and 99th percentile queries were just ever so slightly faster on the baseline "full" graph, more generally things were faster on the non-scholarly "main" graph. The performance difference is obvious but not dramatic.

Here's the content of wdqs-split-test-randomized-2024-01-31.yml, comments removed for brevity. The main difference in this configuration file from the earlier presented one is five hours allowed per graph, to accommodate a larger query mix, and the updated filename pointing to the larger query mix based on the set of queries from the random sample.

datasets:
  - name: "split"
connections:
  - name: "baseline"
    endpoint: "https://wdqs1022.eqiad.wmnet/sparql"
  - name: "wikidata_main_graph"
    endpoint: "https://wdqs1024.eqiad.wmnet/sparql"

tasks:
  - className: "org.aksw.iguana.cc.tasks.impl.Stresstest"
    configuration:
      timeLimit: 18000000
      warmup:
        timeLimit: 30000
        workers:
          - threads: 4
            className: "SPARQLWorker"
            queriesFile: "queries_for_performance_file_renamed_randomized_2024_01_31.txt"
            timeOut: 5000
      queryHandler:
        className: "DelimInstancesQueryHandler"
        configuration:
          delim: "### BENCH DELIMITER ###"
      workers:
        - threads: 4
          className: "SPARQLWorker"
          queriesFile: "queries_for_performance_file_renamed_randomized_2024_01_31.txt"
          timeOut: 60000
          parameterName: "query"
          gaussianLatency: 100

metrics:
  - className: "QMPH"
  - className: "QPS"
  - className: "NoQPH"
  - className: "AvgQPS"
  - className: "NoQ"

storages:
  - className: "NTFileStorage"
    configuration:
      fileName: result.nt

@dr0ptp4kt thanks! is the difference in the number of successful queries only explained by the improvement in query time or are there some improvements in the number of queries that timeout as well?

In T355037#9508760, @dcausse wrote:

@dr0ptp4kt thanks! is the difference in the number of successful queries only explained by the improvement in query time or are there some improvements in the number of queries that timeout as well?

Good question! It appears to be related to query time.

Looking at this latest run, for example, there were no recorded timeouts according to the CSV of the IGUANA .nt (🤔).

Taking things on a head-to-head basis for identical queries between the endpoints, here's what we see for the difference in speed for wikidata_main_graph minus baseline. It's unsurprising in a way given the distribution shown in the prior Phabricator comment, but it is another way of knowing that, under this parameters of this test anyway (a controlled flow and not maximum load test), about 70% of the queries noted as successful seemed to be faster when run against the wikidata_main_graph. Note that about 16% of the queries hit wrongCodes / failed, which are discussed after the table.

Per-query wikidata_main_graph QPS minus baseline QPS	descriptor
0.722596509877809	average
0.244672300065055	median
79.4339558877256	100% max (i.e., wikidata_main_graph's biggest winner)
21.0654641024791	99%
6.88080533343067	95%
1.38414473312972	75%
0.244672300065055	50%
0.013982881368447	42%
0	41%
0	26%
-0.00701117502390231	25%
-0.215374628998983	20%
-0.598658931613195	15%
-1.41867399989265	10%
-4.16152316076897	5%
-18.0068429593504	1%
-80.2800161266253	0% min (i.e., baseline's biggest winner)

About 58% of queries titled toward wikidata_main_graph, and about 25% tilted toward baseline, and 58/(58+25) is about 0.7. The stuff where the difference is negligible probably don't matter that much. Yet, there's a bit more detail to consider in IGUANA's conception here...

For the sake of completeness, and because this may be interesting to consider later on or to contextualize the QPS distributions in the prior Phabricator comment: looking at a different class of issues, let's suppose that we use wrongCodesas a proxy for things that could have gone wrong.wrongCodes and failed map to each other in the CSV, and their QPSes land as 0 for these (penalizedQPS, not included in the tables above, lands by default as 0.01666666667 for these records, but this is close enough to 0 if we wanted to look at it that way). These sorts of records thus drive down summary mean, median, and so on. As an aside, in terms of actual time (totalTime), these wrongCodes ones occupy very little time.

Endpoint Label	count wrongCodes	sum wrongCodes	count failed	sum failed	count timeout	count QPS < 1.0	count QPS < 5.0	count QPS < 20.0	count QPS < 80.0	count QPS < 200.0
baseline	12784	61863	12784	61863	0	12973	19750	46667	78649	78710
wikidata_main_graph	12786	64129	12786	64129	0	12934	18909	44848	78642	78710

The wrongCodes resulted for basically the same queries across both endpoints (and each row of the CSVs is a per-query row, as we know).

The ratio of wikidata_main_graph to baseline for sum wrongCodes (1.036629326091525) is very close to the ratio up above for successfullQueries (1.036316228814752).

I summarized at https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Graph_split_IGUANA_performance . When we have a mailing list post during the next week or so, we'll want to move this to be a subpage of the target page of the post.

AndrewTavis_WMDE subscribed.Feb 6 2024, 5:18 PM

Daniel_Mietchen subscribed.Apr 6 2024, 1:07 AM

I actually just added a link to https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update#See_also . Marking this here ticket as resolved after noticing it was still open.

	F41740308: Screenshot 2024-02-01 at 10.17.28 AM.png
	Feb 1 2024, 5:03 PM

	F41722573: Screenshot 2024-01-27 at 6.38.45 AM.png
	Jan 27 2024, 12:54 PM

	F41722569: Screenshot 2024-01-27 at 6.36.58 AM.png
	Jan 27 2024, 12:54 PM

	F41720006: Screenshot 2024-01-26 at 10.20.48 AM.png
	Jan 26 2024, 4:25 PM

	F41720004: Screenshot 2024-01-26 at 10.19.36 AM.png
	Jan 26 2024, 4:25 PM

	F41718197: Screenshot 2024-01-25 at 7.43.14 PM.png
	Jan 26 2024, 1:58 AM

Compare the performance of sparql queries between the full graph and the subgraphsClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Compare the performance of sparql queries between the full graph and the subgraphs
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...