Page MenuHomePhabricator

Look into truthy nt dump performance
Closed, ResolvedPublic

Description

Yesterday @ArielGlenn and I were looking into the performance of the various entity dumpers:

  • full json creation is taking about 23-26h
  • full ttl creation is taking about 24-30h
  • truthy nt creation is taking about 36-48h

The truthy nt dumps are quite slow (and notably slower than the other dump types), thus we should profile them and see what we can do.

This is currently awaiting deployment of 56906993f95067ec156cf3412f2dabaefce282ad and subsequent evaluation.

Event Timeline

hoo created this task.Sep 27 2017, 9:08 AM

Change 381229 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[purtle@master] Skip simple strings in N3Quoter::escapeLiteral

https://gerrit.wikimedia.org/r/381229

Mentioned in SAL (#wikimedia-operations) [2017-09-28T15:22:48Z] <hoo> Ran scap pull on mwdebug1001 after experiments with T176844

thiemowmde assigned this task to hoo.Sep 28 2017, 4:34 PM
thiemowmde moved this task from incoming to in progress on the Wikidata board.
thiemowmde moved this task from Proposed to Review on the Wikidata-Former-Sprint-Board board.
thiemowmde added a subscriber: thiemowmde.

Change 381229 merged by jenkins-bot:
[purtle@master] Skip simple strings in N3Quoter::escapeLiteral

https://gerrit.wikimedia.org/r/381229

hoo added a comment.EditedSep 29 2017, 11:55 AM

I just benchmarked this some more:

I did 6 alternating runs with nt and ttl with --sharding-factor 8 --shard 0 --limit 1000 on mwdebug1001:
(38.077+33.798+33.044+34.293+33.142+34.518)/(23.859+24.070+23.178+23.632+23.666+23.450) = 1.458

After that I applied the optimizations and did 6 alternating runs with each format again:
(29.819+29.264+30.078+30.423+29.261+29.209)/(23.116+23.053+23.216+23.193+22.974+23.688) = 1.279

Ratio non-optimized/ optimized version:
(38.077+33.798+33.044+34.293+33.142+34.518)/(29.819+29.264+30.078+30.423+29.261+29.209) = 1.162

Ratio first/ second ttl run (in a perfect world this would be 1):
(23.859+24.070+23.178+23.632+23.666+23.450)/(23.116+23.053+23.216+23.193+22.974+23.688) = 1.019

hoo added a comment.Sep 29 2017, 2:46 PM

Using sql wikidatawiki "SELECT page_title FROM (SELECT page_title, page_namespace FROM page ORDER BY page_id DESC LIMIT 1100000) AS 1Mpages WHERE page_namespace = 0 ORDER BY rand() LIMIT 2500" I got a random list of 2500 recently created Items.

I just re-did the above tests with --list-file /tmp/2500-ids.txt (no shard set):

Non-optimized nt vs. ttl:
(38.450+38.485+39.022+38.650+38.813+38.278)/(31.397+30.718+30.874+30.857+31.300+31.566) = 1.240

Optimized nt vs. ttl:
(35.546+34.241+34.222+34.829+34.638+34.891)/(30.747+30.314+30.968+30.761+30.607+31.159) = 1.129

Ratio non-optimized/ optimized version:
(38.450+38.485+39.022+38.650+38.813+38.278)/(35.546+34.241+34.222+34.829+34.638+34.891) = 1.112

Ratio first/ second ttl run (in a perfect world this would be 1):
(31.397+30.718+30.874+30.857+31.300+31.566)/(30.747+30.314+30.968+30.761+30.607+31.159) = 1.012

hoo added a comment.EditedSep 29 2017, 4:47 PM

With PHP 5.5.9-1ubuntu4.22 on snapshot1005:

6 alternating runs with --sharding-factor 5 --shard 0 --limit 2000 (ttl vs. nt):
(62.110+65.767+63.446+63.018+66.504+64.367)/(43.654+44.974+47.964+44.570+47.304+48.088) = 1.393

Optimized nt vs. ttl:
(56.080+60.318+58.797+55.146+60.437+65.327)/(42.851+44.759+41.383+43.726+44.448+43.175) = 1.368

Ratio non-optimized/ optimized version:
(62.110+65.767+63.446+63.018+66.504+64.367)/(56.080+60.318+58.797+55.146+60.437+65.327) = 1.082

Ratio first/ second ttl run (in a perfect world this would be 1):
(43.654+44.974+47.964+44.570+47.304+48.088)/(42.851+44.759+41.383+43.726+44.448+43.175) = 1.062

Given the huge error in the last number, these results are not very meaningful.

hoo moved this task from Doing to Done on the Wikidata-Former-Sprint-Board board.Oct 25 2017, 3:54 PM
hoo updated the task description. (Show Details)
Lydia_Pintscher closed this task as Resolved.Nov 23 2017, 1:33 PM