Page MenuHomePhabricator

Look into truthy nt dump performance
Closed, ResolvedPublic

Description

Yesterday @ArielGlenn and I were looking into the performance of the various entity dumpers:

  • full json creation is taking about 23-26h
  • full ttl creation is taking about 24-30h
  • truthy nt creation is taking about 36-48h

The truthy nt dumps are quite slow (and notably slower than the other dump types), thus we should profile them and see what we can do.

This is currently awaiting deployment of 56906993f95067ec156cf3412f2dabaefce282ad and subsequent evaluation.

Event Timeline

Change 381229 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[purtle@master] Skip simple strings in N3Quoter::escapeLiteral

https://gerrit.wikimedia.org/r/381229

Mentioned in SAL (#wikimedia-operations) [2017-09-28T15:22:48Z] <hoo> Ran scap pull on mwdebug1001 after experiments with T176844

thiemowmde moved this task from incoming to in progress on the Wikidata board.
thiemowmde moved this task from Proposed to Review on the Wikidata-Former-Sprint-Board board.
thiemowmde subscribed.

Change 381229 merged by jenkins-bot:
[purtle@master] Skip simple strings in N3Quoter::escapeLiteral

https://gerrit.wikimedia.org/r/381229

I just benchmarked this some more:

I did 6 alternating runs with nt and ttl with --sharding-factor 8 --shard 0 --limit 1000 on mwdebug1001:
(38.077+33.798+33.044+34.293+33.142+34.518)/(23.859+24.070+23.178+23.632+23.666+23.450) = 1.458

After that I applied the optimizations and did 6 alternating runs with each format again:
(29.819+29.264+30.078+30.423+29.261+29.209)/(23.116+23.053+23.216+23.193+22.974+23.688) = 1.279

Ratio non-optimized/ optimized version:
(38.077+33.798+33.044+34.293+33.142+34.518)/(29.819+29.264+30.078+30.423+29.261+29.209) = 1.162

Ratio first/ second ttl run (in a perfect world this would be 1):
(23.859+24.070+23.178+23.632+23.666+23.450)/(23.116+23.053+23.216+23.193+22.974+23.688) = 1.019

Using sql wikidatawiki "SELECT page_title FROM (SELECT page_title, page_namespace FROM page ORDER BY page_id DESC LIMIT 1100000) AS 1Mpages WHERE page_namespace = 0 ORDER BY rand() LIMIT 2500" I got a random list of 2500 recently created Items.

I just re-did the above tests with --list-file /tmp/2500-ids.txt (no shard set):

Non-optimized nt vs. ttl:
(38.450+38.485+39.022+38.650+38.813+38.278)/(31.397+30.718+30.874+30.857+31.300+31.566) = 1.240

Optimized nt vs. ttl:
(35.546+34.241+34.222+34.829+34.638+34.891)/(30.747+30.314+30.968+30.761+30.607+31.159) = 1.129

Ratio non-optimized/ optimized version:
(38.450+38.485+39.022+38.650+38.813+38.278)/(35.546+34.241+34.222+34.829+34.638+34.891) = 1.112

Ratio first/ second ttl run (in a perfect world this would be 1):
(31.397+30.718+30.874+30.857+31.300+31.566)/(30.747+30.314+30.968+30.761+30.607+31.159) = 1.012

With PHP 5.5.9-1ubuntu4.22 on snapshot1005:

6 alternating runs with --sharding-factor 5 --shard 0 --limit 2000 (ttl vs. nt):
(62.110+65.767+63.446+63.018+66.504+64.367)/(43.654+44.974+47.964+44.570+47.304+48.088) = 1.393

Optimized nt vs. ttl:
(56.080+60.318+58.797+55.146+60.437+65.327)/(42.851+44.759+41.383+43.726+44.448+43.175) = 1.368

Ratio non-optimized/ optimized version:
(62.110+65.767+63.446+63.018+66.504+64.367)/(56.080+60.318+58.797+55.146+60.437+65.327) = 1.082

Ratio first/ second ttl run (in a perfect world this would be 1):
(43.654+44.974+47.964+44.570+47.304+48.088)/(42.851+44.759+41.383+43.726+44.448+43.175) = 1.062

Given the huge error in the last number, these results are not very meaningful.

hoo updated the task description. (Show Details)