Tue, Feb 20
Some ideas of improvement:
- Use parquet file format instead of default hive format
- Store data daily instead of hourly (~10Mb per hour in hive format, meaning ~250Mb per day, plus parquet compaction --> Should be good)
- Add request_count instead of keeping distincts
- Prevent errors in happending data using OVERWRITE
- Enforce number of files hive output (default is way too many, therefore small, therefore inefficient)
Updated version of the code below:
@Lydia: You should split by editor type. The editors you are talking about are, I think what we call in Wikistats 2 registered-users editors.
Please let us if I'm wrong!
Mon, Feb 19
@bmansurov No worries :) The whole point of this two things is to work for 'every' wiki :)
@bmansurov and @diego : Data is available up to 2018-01 included at hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snaphsot=2018-01.
I think we're not going to put more effort into productionization as of now, but new imports can be done.
@bmansurov : There is an example command line in the header-comment of the XmlConverter file.
Little reminder: these two patches deal with huge datasets (2TB of bz2 compressed XML and 18TB of snappy compressed parquet). My wish is really for them to be productionized so that the data they import/compute is not duplicated.
Took longer that I reminded but it's done: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01
Looks like that patches work :)
Thu, Feb 15
After more thoughts, looks like the current need only needs to cron-check tha new data flows in regularly and email if not.
Accumulators and reports of execution might come in a second round (after using spark2 and having better understood some of its benefits)
@bmansurov : This doc is a it outdated. It should work but there is better tooling now.
First round of discussion with the team:
- Things we agre on:
- using multiple datasources in druid (snapshots) seems the way to go to facilitate rollbacks (naming convention could follow our snapshots: YYYY-MM)
- Data quality checks using old/new datasources in druid seems also interesting for both data quality and cache warming.
- Thing still to be discussed: How do we swap from old datasource to new datasource in AQS when we think it's ready (or when the other way around when we rollback). Multiple ideas:
- Use cassandra as a key/value config store (no deploy needed, change can be pushed via API, but we use a ''data'' store to config)
- Use a dedicated file with its own repo (deploy needed)
- Use another conf system (etcd, zookeeper...) -- Again another tool ....
Mon, Feb 12
Fri, Feb 9
@bmansurov: The last snapshot I realized was beginning of 2017-06 (named 2017-05, since the last full month is May 2017). It's available in two formats:
- hdfs:///user/joal/wmf/data/raw/mediawiki/xmldumps/20170601 in xml (files are stored inside folder by wiki, formatted as hive partitions)
- hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2017-05 in parquet (same here, files inside folder to be accessible as partitions).
Thu, Feb 8
Tue, Feb 6
Mon, Feb 5
Wed, Jan 24
Fixed as of 2018-01-23.
@ezachte : Could you launh a backfill of 2018-01-01 to 2018-01-22 ?
Many thanks !
Jan 23 2018
Found the precise line: https://github.com/wikimedia/analytics-camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/partitioner/DefaultPartitioner.java#L67
Will patch partition checker accordingly.
@Pine: We have discusse this naming convention within the team a while ago, and decided to go for "g" instead of "b" for internationalization reasons.
"b" stands for billion, which is very Engish-centric. In French for instance, the word billion stands for a millionof millions (see https://en.wikipedia.org/wiki/Billion for more precision).
In order to mitigate this issue, we decided to pick the "prefix" of magnitude orders (see https://en.wikipedia.org/wiki/Order_of_magnitude#Uses), in which such a difference doesn't exist.
Jan 22 2018
Jan 19 2018
Jan 18 2018
Jan 12 2018
Plenty of possible different ways here. Listing the two that makes most sense to me:
Also submitted a PR to restbase: https://github.com/wikimedia/restbase/pull/941
Jan 11 2018
Jan 9 2018
Jan 5 2018
Jan 4 2018
Jan 3 2018
Eyeballing at pageviews for ENWP over the past 2 years (2016 and 2017) in original wikistats and in wikistats v2 seems to be coherent: ~8g pageviews per month (8 billion, g standing for giga == billion). And 8 * 23 = 184g, which is not far from 176.
Can you tell us more precisely the discrepencies you saw in ENWP?
Actually I made a mistake yesterday: this table is not available in hive. It is temporarily created, loaded into Druid, then deleted. I think documentation is not needed as if it was available. Do you agree @Nuria and @Milimetric ?
Dec 21 2017
Metrics name links and 'More info' links only work for Reading section. Any other section doesn't have pages as of now.
Do we create a section in our wiki with a page for each metrics? Or do we gather all of them in a specific page?
Dec 19 2017
Thanks for your ticket, it is very clear and well documented :)
I'll try to give you answers to some of the things you pointed:
- We know about the chart misalignment (T182817), we will work on correcting this (this is a real bug !).
- About data mismatch with https://stats.wikimedia.org/EN/ChartsWikipediaFA.htm, it's because stats.wikimedia.org doesn't include non-content pages in its stats. You can tick the checkbox for splitting by page type, untick the 'non-content' box, and the values should match the ones in wikistats a lot more closely :)
- About the difference in page-numbers you observe, they are due to redirects: the new-page metric (as well as the edited-pages one) doesn't include redirect pages. I have checked in databases: for fa-wiki in our last import there was ~3.8M pages, including ~1.5M redirects (leaving ~2.3M pages with text, whether in content or non-content namespaces) .
Dec 18 2017
Dec 15 2017
Thanks for this ticket.
As most data in wikistats-v2 is updated monthly, we decided toshow only last month top pageviews.
This might however change in the future :)
Dec 14 2017
I don't understand this bit ...
The time selector works well on any other page (with charts). The "top" page is special in that it doesn't show time, just a list - I assume this is why.
You can hwever trust other charts :)
Thanks @Kipala :), good catch ! Seems related to dates of the data. In Wikistats 2 data is pulled from 2015-10. The time selector at the back doesn't avtually means anything for that specific page ...
Dec 12 2017
@Ottomata : Super cool ! Many thanks :)
Dec 10 2017
Dec 8 2017
@Pchelolo: It has indeed happen.
The tak has been moved to done on our kanban, we'll resolve it after we finalize the discussion :)