Page Menu
Home
Phabricator
Search
Configure Global Search
Log In
Files
F28340637
raw.txt
No One
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Authored By
Addshore
Mar 7 2019, 10:03 AM
2019-03-07 10:03:04 (UTC+0)
Size
1 KB
Referenced Files
None
Subscribers
None
raw.txt
View Options
val df = spark.read.parquet("/user/joal/wmf/data/wmf/mediawiki/wikidata_parquet/20190204")
val base_rdd = df.select("labels", "descriptions", "aliases").rdd
//val base_rdd = df.select("labels", "aliases").rdd
//val base_rdd = df.select("labels").rdd
//val base_rdd = df.select("aliases").rdd
//val base_rdd = df.select("descriptions").rdd
val strings = base_rdd.flatMap(r => {
r.getMap[String,String](0).values ++
r.getMap[String,String](1).values ++
r.getMap[String,Seq[String]](2).values.flatMap(l => l)
})
val grouped_strings = strings.map(s => (s, 1)).reduceByKey(_+_)
val total_bytes = grouped_strings.map(t => t._1.getBytes.length * t._2).sum()
val duplicate_bytes = grouped_strings.map(t => t._1.getBytes.length * (t._2 - 1)).sum()
// Triple check usefull bytes for strings:
grouped_strings.map(_._1.getBytes.length).sum() == (total_bytes - duplicate_bytes)
// true
val all_strings = strings.count()
// How many unique strings?
val unique_strings = grouped_strings.count()
// 98,524,732
// How many string with 1 instance?
val oneoc_strings = grouped_strings.filter(t => t._2 == 1).count()
// 72,584,179
// Leaving 25,940,553 unique strings having multiple instances
// --> If we go for table-indirection, we'll need ~100M longs (4 bytes)
// --> 400,000,000 bytes - 1 order of magnitude less than unique string size
println(f"-----------------------------------------------------")
println(f"Total bytes for strings: $total_bytes%15.0f")
println(f"Total duplicate bytes for strings: $duplicate_bytes%15.0f")
println(f"Useful bytes for strings: ${total_bytes - duplicate_bytes}%15.0f")
//Total bytes for strings: 45,724,033,674
//Total duplicate bytes for strings: 41,630,588,801
//Usefull bytes for strings: 4,093,444,873
// Usefull is 1 order of magnitude less than used
println(f"Total strings: $all_strings%15.0f")
println(f"Total unique strings: $unique_strings%15.0f")
println(f"Total one occurrence strings: $oneoc_strings%15.0f")
println(f"-----------------------------------------------------")
File Metadata
Details
Attached
Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
7160210
Default Alt Text
raw.txt (1 KB)
Attached To
Mode
P8168 (An Untitled Masterwork)
Attached
Detach File
Event Timeline
Log In to Comment