Page MenuHomePhabricator

Perform more constraint type checks in PHP before falling back to SPARQL
Closed, ResolvedPublic2 Story Points

Description

WikibaseQualityConstraints can check “type” and “value type” constraints in PHP or in SPARQL: first it attempts to check whether item A is an instance of item B (or its subclasses) by loading the statements of A, and its “instance of” item(s), and their “subclass of” items, etc. – but if a certain, configurable limit of entities is exceeded, this check is aborted and we fall back to asking the Wikidata Query Service instead via a SPARQL query.

The tracking we added in T204715: Add more tracking for constraint checks reveals that, currently, the PHP checks are much faster than the SPARQL checks (both the successful ones and the ones that fall back to SPARQL), and the majority of successful checks are in PHP rather than in SPARQL – see the “Type Check Runtimes” and “Type Check Distribution” panels. I think we should therefore raise the limit on the PHP checks, to give them more time before falling back to SPARQL. (The limit is the WBQualityConstraintsTypeCheckMaxEntities setting, configured in wmf-config/InitialiseSettings.php.)

I think the “Type Check Distribution” panel could be used to validate the change: preferably, the ratio of successful PHP checks to successful SPARQL checks should increase, I believe.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The only thing we have right now is the number of checks that stayed within the threshold vs. the number of checks that exceeded it. Do you think we should add additional tracking before deploying the config change?

No, the config change should not be harmful, it might just no be as effective as "double the limit" might seem. This would be possible to tell from the distribution. But we should be able to deploy the change now and see whether it brings us from ~70% to ~80% or to ~90% of checks handled by PHP.

Config change is reviewed and will be deployed with Monday’s EU SWAT, so moving out of the review column for now.

Change 476822 merged by jenkins-bot:
[operations/mediawiki-config@master] Perform more PHP constraint checks before falling back

https://gerrit.wikimedia.org/r/476822

Mentioned in SAL (#wikimedia-operations) [2018-12-03T12:19:53Z] <lucaswerkmeister-wmde@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:476822|Perform more PHP constraint checks before falling back (T209504)]] (duration: 00m 48s)

Change 478630 had a related patch set uploaded (by Michael Große; owner: Michael Große):
[operations/mediawiki-config@master] Perform even more PHP constraint checks before falling back

https://gerrit.wikimedia.org/r/478630

Hm, perhaps it would be good to have some tracking on the number of entities processed in PHP before we increase the limit further. What do you think?

Michael added a comment.EditedDec 10 2018, 2:23 PM

@Lucas_Werkmeister_WMDE
Good idea. Number of entities processed might be interesting. But also how the number of constraint checks per checked entity is distributed. Maybe we can calculate/track how some key aspects of the distribution change, e.g. median, third quartile, 95th percentile and max ? That should give us a better idea where to put the limit.

Yes, if we track the number of entities, we should be able to get different percentiles in Grafana. I’ll upload a patch for this, hopefully we can get it merged in time for this week’s train.

Change 478701 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseQualityConstraints@master] Track number of entities checked in PHP

https://gerrit.wikimedia.org/r/478701

Change 478701 merged by jenkins-bot:
[mediawiki/extensions/WikibaseQualityConstraints@master] Track number of entities checked in PHP

https://gerrit.wikimedia.org/r/478701

Lucas_Werkmeister_WMDE changed the task status from Open to Stalled.
Lucas_Werkmeister_WMDE removed Lucas_Werkmeister_WMDE as the assignee of this task.

Waiting for train deployment. Then we can go ahead with the next config change either on Thursday or next week.

Lucas_Werkmeister_WMDE changed the task status from Stalled to Open.

New tracking has been deployed and added to Grafana (“PHP Type Check Entities” panel), so we should be able to see how much these numbers move up when we bump the limit.

I won’t be around during today’s Morning SWAT, and there are no deploys on Fridays, so I’m afraid the config change will have to wait until next week. (On the plus side, that means we collect more data for the current config.)

New tracking has been deployed and added to Grafana (“PHP Type Check Entities” panel), so we should be able to see how much these numbers move up when we bump the limit.

I see now that this gives us only the distribution among the number of checks on entities that are successfully completed in PHP, i.e. those that have <= 20 checks. However, what would be much more useful to know, would be the distribution of checks that exceeded that limit and were handled by SPARQL.

Also, while I saw a small change in the type check distribution from the change we did, I didn't the see any change in the type check runtime. What kind of change are we expecting there?

However, what would be much more useful to know, would be the distribution of checks that exceeded that limit and were handled by SPARQL.

Well, that’s the “Type Check Runtimes” and “Type Check Distribution” panels… which information exactly do you want to have? The number of entities checked doesn’t make sense in SPARQL.

What kind of change are we expecting there?

In the “Type Check Runtime” panel, I would expect the times of PHP checks to increase slightly, and the times of SPARQL checks probably wouldn’t change (their rate should decrease, but we don’t currently plot that). However, the overall time of “type constraint” and “value type constraint” checks, as shown in the “Constraint Check Upper Runtime” panel (first panel on the dashboard), should hopefully decrease somewhat. (Perhaps we should the overall runtime of constraint checks with those two types into a separate panel, and show there not just the upper runtime but also some percentiles.)

In the “Type Check Runtime” panel, I would expect the times of PHP checks to increase slightly, and the times of SPARQL checks probably wouldn’t change (their rate should decrease, but we don’t currently plot that).

Isn't this one the rate? Which, by the way, could have a better color choice to make it more accessible.

However, what would be much more useful to know, would be the distribution of checks that exceeded that limit and were handled by SPARQL.

Well, that’s the “Type Check Runtimes” and “Type Check Distribution” panels… which information exactly do you want to have? The number of entities checked doesn’t make sense in SPARQL.

I would like to change this configuration to a value so that about 90% of entities are successfully checked in PHP and see if that changes the timings the way we want.
Currently, we are checking about 65%-70% of entities in PHP. I would like to understand if that 20% that I would like to shift to PHP comes from entities with typically 20-40 checks each or from entities with 20-110 checks each. Then we can adjust the config accordingly.

Also, as you mentioned in a comment further above, the total number of entities checked would be interesting.

Isn't this one the rate? Which, by the way, could have a better color choice to make it more accessible.

No, sorry, I meant the rate of SPARQL checks over time, not compared to PHP checks. (Though that rate should also shift towards PHP checks, yes – I mentioned that in the task description.)

The colors are picked by Grafana, I didn’t do anything special there.

Currently, we are checking about 65%-70% of entities in PHP. I would like to understand if that 20% that I would like to shift to PHP comes from entities with typically 20-40 checks each or from entities with 20-110 checks each. Then we can adjust the config accordingly.

I’m afraid we just don’t have that information. When we fall back to SPARQL, we just ask the query service “is X a subclass of Y?” – there’s no way to know how many entities it looked at to arrive at the answer.

Change 478630 merged by jenkins-bot:
[operations/mediawiki-config@master] Perform even more PHP constraint checks before falling back

https://gerrit.wikimedia.org/r/478630

Mentioned in SAL (#wikimedia-operations) [2018-12-18T12:17:13Z] <zfilipin@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:478630|Perform even more PHP constraint checks before falling back (T209504)]] (duration: 00m 46s)

@Lucas_Werkmeister_WMDE Interestingly, and independently from today's deployment, https://www.wikidata.org/wiki/Special:ConstraintReport/Q41 is broken o.O

Addshore moved this task from incoming to in progress on the Wikidata board.Dec 19 2018, 12:47 PM

@Lucas_Werkmeister_WMDE Interestingly, and independently from today's deployment, https://www.wikidata.org/wiki/Special:ConstraintReport/Q41 is broken o.O

It looks like that might have just been a runtime issue and this page renders for me now.

Addshore closed this task as Resolved.
Addshore claimed this task.

The change cn very clearly be seen

  • Increase in type checks being performed in PHP
  • Decrease in SPARQL type checks
  • Increase in median time for SPARQL query success (as the smaller ones are now being done in PHP)
Restricted Application added a project: User-Addshore. · View Herald TranscriptDec 20 2018, 11:21 AM

Well, none of those changes actually tell us if that’s a good thing or not, I think :) but the overall mean and median times for “type” and “value type” checks also went down noticeably, which is great.

That said, I’m still not sure if we shouldn’t increase the limit even further? Even now, type checks running into the limit are still much faster than SPARQL checks (roughly 20 vs. 140 ms).

Well, none of those changes actually tell us if that’s a good thing or not, I think :) but the overall mean and median times for “type” and “value type” checks also went down noticeably, which is great.

That said, I’m still not sure if we shouldn’t increase the limit even further? Even now, type checks running into the limit are still much faster than SPARQL checks (roughly 20 vs. 140 ms).

There were not specific acceptance criteria so I was simply validating that "Perform more constraint type checks in PHP before falling back to SPARQL" had been done.

I also think we could increase the limit more.

Addshore reopened this task as Open.Dec 21 2018, 4:52 PM
Addshore removed Addshore as the assignee of this task.

Moving back to stalled as we want to increase this more, but we can't do that until the new year!

It's the new year. Should we unstall this?

It's the new year. Should we unstall this?

I don't think deployments really start again until next week, so we can action this quite yet.

Next week we could bump the value from 20 to 30? Thoughts @Lucas_Werkmeister_WMDE ?
If that still prove to be too small we can continue increasing it throughout next week.

@Addshore

From wmf-config/InitialiseSettings.php:

'wgWBQualityConstraintsTypeCheckMaxEntities' => [
	'default' => 1000,
	'wikidatawiki' => 100,
],

I can bump to 150 as a similar 50% increase to what you suggested.

Change 483185 had a related patch set uploaded (by Tarrow; owner: Tarrow):
[operations/mediawiki-config@master] Increase PHP constraint check entities to 150

https://gerrit.wikimedia.org/r/483185

Change 483185 merged by jenkins-bot:
[operations/mediawiki-config@master] Increase PHP constraint check entities to 150

https://gerrit.wikimedia.org/r/483185

Mentioned in SAL (#wikimedia-operations) [2019-01-09T17:39:07Z] <tarrow> That last one was SWAT: [[gerrit:483185|T209504 Increase PHP constraint check entities to 150]]

It looks like we could probably bump this up another level, or a few more levels?

It looks like we could probably bump this up another level, or a few more levels?

Could we have a graph showing the changes over time in Q21510865_ValueTypeChecker and Q21503250_TypeChecker from the Constraint Check Mean Runtime Graph? As I understand, these are the metrics we really care about, and we should maybe stop when we see that they get worse again.

Addshore claimed this task.
Addshore added a subscriber: Tarrow.

I have made https://grafana.wikimedia.org/d/q3H7iFwmz/t209504 specifically for this task.
I have also added the deployments that we have already done.
I'll try poking this value some more today

Change 484370 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgWBQualityConstraintsTypeCheckMaxEntities 300

https://gerrit.wikimedia.org/r/484370

Change 484370 merged by jenkins-bot:
[operations/mediawiki-config@master] wgWBQualityConstraintsTypeCheckMaxEntities 300

https://gerrit.wikimedia.org/r/484370

Mentioned in SAL (#wikimedia-operations) [2019-01-15T09:10:07Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: wgWBQualityConstraintsTypeCheckMaxEntities 300, T209504 (duration: 00m 46s)

The change from 150 -> 300 didn't really seem to impact performance in any direction.
Perhaps we have found the sweet spot?
Ping @Lucas_Werkmeister_WMDE @Michael @Tarrow

It feels absurd that loading 300 entities from the database and deserializing them is faster than a single SPARQL query, but if that’s the way it is… looking at the last 90 days on that dashboard, the mean, p75 and p95 type check runtime went down significantly when the we bumped the value from 20 to 100 and haven’t changed very much since then, so perhaps that’s the best we can do. Let’s wait a few more days to see the results of the latest config change, but after that I guess we can close this task.

Hm, looking at the last 2 days on the new dashboard (permalink) I feel like the runtime slightly deteriorated since the last config change? But there’s also some odd-looking patterns there, so perhaps we should observe for a bit longer.

Hm, looking at the last 2 days on the new dashboard (permalink) I feel like the runtime slightly deteriorated since the last config change? But there’s also some odd-looking patterns there, so perhaps we should observe for a bit longer.

Yup, looking at more data than when I wrote T209504#4880277 now it does look worse.
Perhaps we should tweak it down a little @Lucas_Werkmeister_WMDE ?

Probably, but I’d wait for a few days just to see more data, it’s not a catastrophic increase by any means.

Okay, 300 is definitely too much, the mean runtime is now worse than before we started tweaking the limit (see last 90 days). Let’s go back to 150, that still seems to have been marginally better than 100.

Change 485628 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[operations/mediawiki-config@master] Decrease WBQualityConstraintsTypeCheckMaxEntities

https://gerrit.wikimedia.org/r/485628

Change 485628 merged by jenkins-bot:
[operations/mediawiki-config@master] Decrease WBQualityConstraintsTypeCheckMaxEntities

https://gerrit.wikimedia.org/r/485628

Mentioned in SAL (#wikimedia-operations) [2019-01-22T10:15:37Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T209504 Decrease WBQualityConstraintsTypeCheckMaxEntities from 300 to 150 (duration: 00m 47s)

It doesn't look like it's decreasing again after going back to 150... Is it possible for the increase being related to T204031: Deploy regular running of wikidata constraint checks using the job queue instead of with the maximum number of entities checked?

That’s possible, but I’m not sure why that should happen. Doing more checks shouldn’t really affect the mean or average time of individual checks… unless we’re already overloading the wdqs-internal cluster and it’s taking longer to respond to queries?

I think the limit is more or less fine, for now. Let’s finish the job queue rollout first. Should we take another look at this afterwards, or leave it alone?