Page MenuHomePhabricator

Investigate duplication of strings in wb_terms table for wikidatawiki
Closed, ResolvedPublic

Description

  • Duplication across all strings (labels, desc, aliases)
  • Duplication split by type
    • label
    • description
    • aliases
  • Duplication between labels and aliases

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 7 2019, 9:30 AM
Addshore updated the task description. (Show Details)Mar 7 2019, 9:31 AM

For labels descriptions and aliases we already have the value..

[2018-12-06 19:10:16] <joal> addshore: TL;DR - total-bytes is ~45G over ~100M unique strings. Usefull-bytes (minus duplication) is 1 order of magnitude smaller: 4G, with ~25M duplicates - The idea of creating an indirection table could be very valuable :)

Exact analysis ran on 2018-12-06:

val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20181001")
val base_rdd = df.select("labels", "descriptions", "aliases").rdd
val strings = base_rdd.flatMap(r => {
  r.getMap[String,String](0).values ++
  r.getMap[String,String](1).values ++
  r.getMap[String,Seq[String]](2).values.flatMap(l => l)
})

val grouped_strings = strings.map(s => (s, 1)).reduceByKey(_+_)


val total_bytes = grouped_strings.map(t => t._1.getBytes.length * t._2).sum()
val duplicate_bytes = grouped_strings.map(t => t._1.getBytes.length * (t._2 - 1)).sum()

println(f"Total bytes for strings: $total_bytes%15.0f")
println(f"Total duplicate bytes for strings: $duplicate_bytes%15.0f")
println(f"Usefull bytes for strings: ${total_bytes - duplicate_bytes}%15.0f")

//Total bytes for strings: 45,724,033,674
//Total duplicate bytes for strings: 41,630,588,801
//Usefull bytes for strings: 4,093,444,873
// Usefull is 1 order of magnitude less than used

// Triple check usefull bytes for strings:
grouped_strings.map(_._1.getBytes.length).sum() == (total_bytes - duplicate_bytes)
// true


// How many unique strings?
grouped_strings.count()
// 98,524,732

// How many string with 1 instance?
grouped_strings.filter(t => t._2 == 1).count()
// 72,584,179
// Leaving 25,940,553 unique strings having multiple instances

// --> If we go for table-indirection, we'll need ~100M longs (4 bytes)
// --> 400,000,000 bytes  - 1 order of magnitude less than unique string size
Addshore closed this task as Resolved.Mar 7 2019, 11:16 AM
Addshore claimed this task.

Running from /user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204

StringsUnique StringsOne occurrence strings
Labels, Descriptions & Aliases1,996,735,054106,498,26871,574,791
Labels & Aliases342,328,98389,450,63257,859,429
Labels279,406,85777,753,05348,276,975
Descriptions1,654,406,07117,180,87813,867,657
Aliases62,922,12613,765,85711,584,411

Raw results from a notebook with P8168:

-------------------LAD----------------------------------
Total bytes for strings:     55403285421
Total duplicate bytes for strings:     50665680015
Useful bytes for strings:      4737605406
Total strings:      1996735054
Total unique strings:       106498268
Total one occurrence strings:        71574791
-----------------------------------------------------
----------------------LA-------------------------------
Total bytes for strings:      8602346166
Total duplicate bytes for strings:      4564380973
Useful bytes for strings:      4037965193
Total strings:       342328983
Total unique strings:        89450632
Total one occurrence strings:        57859429
-----------------------------------------------------
------------------------L-----------------------------
Total bytes for strings:      7764983650
Total duplicate bytes for strings:      4023237872
Useful bytes for strings:      3741745778
Total strings:       279406857
Total unique strings:        77753053
Total one occurrence strings:        48276975
-----------------------------------------------------
-------------------------D----------------------------
Total bytes for strings:     46800939255
Total duplicate bytes for strings:     46098575317
Useful bytes for strings:       702363938
Total strings:      1654406071
Total unique strings:        17180878
Total one occurrence strings:        13867657
-----------------------------------------------------
------------------------A-----------------------------
Total bytes for strings:       837362516
Total duplicate bytes for strings:       506426114
Useful bytes for strings:       330936402
Total strings:        62922126
Total unique strings:        13765857
Total one occurrence strings:        11584411
-----------------------------------------------------
Restricted Application added a project: User-Addshore. · View Herald TranscriptMar 7 2019, 11:16 AM