As a way to get familiar with the data, please provide quantitative information over the dataset using spark in a notebook (probably using python as it facilitates making charts).
The data can be found in:
There are multiple snapshot date available, as well as multiple wikis (wikidata and commons). Just pick one date with wikidata data :)
In hive or spark-sql:
use discovery; show partitions wikibase_rdf;