Fri, Jun 11
I ran a query and found that the problem is due to some page-titles not being correctly decoded and seen as '?', and since the top metrics doesn't differentiate by page-id multiple pages are bundled together since they share the same wrong title:
spark.sql(""" SELECT page_title, count(1) as c from wmf.mediawiki_history where snapshot = '2021-05' and wiki_db = 'enwiktionary' and date(event_timestamp) >= '2021-05-01' and event_entity = 'revision' and page_namespace_is_content group by page_title order by c desc limit 10 """).show(10, false)
Tue, Jun 8
Mon, Jun 7
Tue, Jun 1
Mon, May 31
Wed, May 26
Actually this table is now production-style on the cluster, at path hdfs:///wmf/data/wmf/wikidata/item_page_link, or hive table wmf.wikidata_item_page_link.
It is released weekly and takes advantage of events for pages creation/deletion/moves to be as precise as possible (we have monthly snapshots of the page table, and get the current month info from events).
Tue, May 25
The problem I see with using a generic class in the QueryElem object is the conversion to parquet. I don't think it'll work out of the box, leading to having to devise our own conversion. Let's brainstorm on ideas on this, possibly in meeting to make it faster :)
May 21 2021
May 20 2021
ACk - doing so - thanks @elukey
@LSobanski You're absolutely right, this task is about documenting on our end the priorities and sizes of datasets to be backed up so that we can be better inform next steps (including potential implementation) later.
Closing this task :) Thanks fro the great work @AKhatun_WMF
Great ! Thanks for that :) Closing the ticket.
@AKhatun_WMF That's great! could you please provide some info on expected data-size in parquet (for daily data for instance)? Many thanks.
May 19 2021
The new field is in turnilo with data starting from May 18th 2021.
Resolving the task :)
Resolving for now - please reopen if needed :)
May 18 2021
Heya @EBernhardson, not having canary events in the refined data is expected: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/TransformFunctions.scala#L50
I nonetheless think that partitions should be added even if the dataset is empty - ping @Ottomata on this.
The clickstream algorithm reduces one step of redirects, meaning that if page A redirects to page B, views for page A are counted for page B. Multiple steps redirects are resolved 1 step, meaning that for instance: A -> B -> C and A is visited, the view counts for page B.
@diego: page-titles in clickstream use _ as separator, not space!
@hnowlan : Logs at DEBUG level (prepare our eyes, those are verbose!)
May 17 2021
And the feature request: https://github.com/apache/druid/issues/11264
Also: One way to get results is to set the time-grain to the value: original value. This makes calcite use the topN query (single field in group-by instead of two). You'll get daily values instead of monthly but at least you'll have values :)
@hnowlan : Here is a way to access failure logs from todays job (when host was down):
sudo -u analytics kerberos-run-command analytics yarn logs --applicationId application_1620304990193_40662 | less
@CDanis the patch for Druid is there - sorry for not having acted quicker.
TL;DR: This problem comes from how queries are translated from SQL to druid-query-plan. I don't have a solution for this :(
Hi @SNowick_WMF, I double checked the number of expected rows and got 11161, not 80633 as you mentioned.
May 12 2021
The plan looks great @razzi , and the comments as well!
My nits on some small things.
Hi @Pablo - Do you know in which DB the data is stored? if it is in the centralauth one we don' have it. This task should then become adding data from centralauth to the lake and the mediawiki_history dataset.
May 11 2021
Hi - I am trying to make this happen.
Data for the wikidata project is very big (many edits, and the itemquality model to be added to the other ones). Is it needed for you or can I not export this project (this would be all models for all edits of all projects except enwiki and wikidatawiki).
May 6 2021
May 4 2021
Thanks @kzimmerman for the heads up :)
On our side we don't forget the improvement of heuristics.
May 3 2021
Apr 29 2021
- Constant distinct IPs and user-agents hourly over a day (~180 ips, ~450 user agents, less during low-hours of circadian pattern)
- Despite being categorized as 'desktop' and 'mobile-wep', all the views are from mobile-web, with android being good citizen and sending detailed user-agent info and iOS not so much, doing its requests through Pandas-VPN on desktop site with not detailed user-agent.
- I looked some IPs from the set, and they are from different cloud/dedicated servers providers.
You say I will need to "start reworking some of your script to Airflow" – are there any help materials about what needs to be done?
Apr 27 2021
Hi @Urbanecm - Sorry for the late reply, I wanted to discuss with the team, and it happened yesterday.
Apr 21 2021
Apr 20 2021
Apr 19 2021
Follow up questions after having talked to the team:
- How frequent does the job need to be run, and new data released ?
- If not a one-off, would we have a process if you scripts change for us to review again?
Hi @Urbanecm , thank you for pinging us on this :)
The usual pattern for data publication is to ask for an approval through a security review.
I have quickly checked your code and it seems that the data you use to generate your dataset is already public - In that case the review might not even be needed.
I'll confirm with the team the procedure to follow, and will also ask for minor changes in our code (for instance accessing hive user table instead of the production-replica one, preventing to leak potential PII).
@kzimmerman Hi - Is this task something your team could look at? I have triple checked and confirm that at least a few of the listed pages show unnatural patterns:
- F5 Networks: https://pageviews.toolforge.org/pageviews/?project=en.wikipedia.org&platform=desktop&agent=user&redirects=0&range=latest-90&pages=F5_Networks
- Google Logo: https://pageviews.toolforge.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&redirects=0&range=latest-90&pages=Google_logo
I managed to have this working in my personal database. Can we sync on this via IRC @nettrom_WMF ?
something to note: Hive separate table metadata from storage. When using external tables in Hive, dropping the tables only deletes the metadata, not the data itself:
hdfs dfs -du -s -h /user/hive/warehouse/bd808.db/*
Info: There already is in the cluster a job doing TTL -> RDF conversion. The TTL dumps are imported weekly, and converted to blazegraph RDF once available.
The job is maintained by the Search Platform team (ping @dcausse ' :).
Apr 8 2021
Ping @kzimmerman on the above comment - Let's synchronize on who does what :)
Apr 7 2021
Thanks for your suggestion @Marostegui.
The global drift is not big (this month took 4h more than the previous one, less than 10% increase overall).
As discussed with @Milimetric there would be multiple options to try to make the overall process faster, but we are not going to prioritize this for now.
Let's close and reopen if needed.
I have implemented some more logic to get the files we need, so no real need to change here.
This task was more about things to keep in mind if for instance filenames change at some point :)
Feel free to close it if it's not useful. Thank you for your explanations :)
Thank you for the explanation @ArielGlenn.
Let me precise my 2 concerns (they are minor):
- job names are different for the same output in dumpstatus.json: for small wikis you should look at xmlstubsdump while for big you should look at xmlstubsdumprecombine (this is not easy to monitor all projects).
- filenames share the same pattern between different jobs, making it confusing to get data across multiple projects with a single job. For pages-meta-current, you should get PROJECT-DATE-pages-meta-current.xml.bz2 even if PROJECT-DATE-pages-meta-current*.xml*.bz2 exist, since small projects won't have the split files and you want all projects to match. For pages-meta-history you should get PROJECT-DATE-pages-meta-history*.xml*.bz2 as there is supposedly never both single files and split-by-pages files.
+1 on the approach (updating the task description for details)
Apr 6 2021
There you go @Qgil :)
Apr 1 2021
We could do:
- fifo - 5%
- default - 35%
- production - 50%
- essential - 10%
I confirm data is fixed for snapshot=2021-02 - Let's keep this open to remember monitoring next snapshot.