Tue, Feb 18
It's been done a few days ago :)
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/XMLDumps/Mediawiki_wikitext_current exists as well.
Mon, Feb 17
I tried to validate the approach of using normalized_host.project and normalized_host.project_family instead of pageview_info[project] for @Ladsgroup...
About the pageview query, @Nuria is right: partition predicates need to use single partition fields and simple definition (the same as what you do in comments) Otherwise the engine doesn't manage to get which partitions to read or not, and therefore reads all, and then filter based on their values.
Thu, Feb 13
After a (not so) quick audit, growth usage is due to the problem described in T245124.
Wed, Feb 12
Tue, Feb 11
Things to consider IMO while doing this work: how do we separate regular running jobs so that
- Host can handle them (computation/RAM)
- They are not subject to user jobs
- There is (at least some) separation of concerns
- We understand it (at least some as well)
Hi @ArielGlenn, I tested reading yowiki-20200101-pages-articles-multistream.xml.bz2 successfully :)
Thanks for the heads-up :)
Hi @leila ,
I have checked some rows, and there doesn't seem to be any parsing error as comment values are the same in mysql, sqooped-data and processed data for the rows I checked.
Also, when looking at revisions having happened the same day as the one holding the comment you mention, that comment pattern is recurrent:
spark.sql("select event_timestamp, revision_id, event_comment from wmf.mediawiki_history where snapshot = '2020-01' and wiki_db = 'yowiki' and event_entity = 'revision' and date(event_timestamp) = '2012-04-04'").show(1000, false)
Mon, Feb 10
Fri, Feb 7
Nothing to add to what Andrew said. Adding partitions with folders not following the pattern convention needs to be done 'manually' (can be done through a script, but with explicit single partition commands).
Thu, Feb 6
Hi again @EYener,
Partitions in hive are a SQL representation for folders. Adding a partition only tells hive that it should look into a folder to find files related to the values of the partition (for instance, information relative to partition(year=2019, month=12, day=02, hour=16) is in folder /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16) This scheme allows to filter the amount of data that is read for queries on big tables.
Something else to know is that the default way for hive to encode partition information in folder is field=value (for instance /my/table/year=2020/month=12/...). But hive also allows to link partitions with path not field-value structured (as in the first example).
Finally with that, adding a partition doesn't give you any information about data size as it is only linking partition values to a folder. The easiest to get some information on data sie without querying is to look at folder size. For instance on a stat machine:
hdfs dfs -du -s -h /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16 10.9 M 32.8 M /wmf/data/raw/eventlogging/eventlogging_CentralNoticeBannerHistory/hourly/2019/12/02/16
You can setup the database yourself: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive#Create_your_own_database
Also if you're new to hive, I suggest you have a look at the page above and related pages :)
you should actually be able to create the table yourself. I took the ooportunity to write https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive#Create_your_own_database .
You can also come and ping us on irc chan #wikimedia-analytics :)
- About the query:
use cps; show partitions centralnoticebannerhistory20191202;
Wed, Feb 5
Moving the zika-research folder to one-off is a good idea, as the latter already contains research datasets as you noticed. keeping it as-is is also not problematic. I let you decide @Miriam :)
Tue, Feb 4
Hi @Miriam :)
From stat1007, directories are synced from /srv/published/datasets to https://analytics.wikimedia.org/published/datasets/
There is more in the web URL than in the folder as data is synced from various places to the same output URL.
I think creating a reserarch subfolder, and then related folders for your project should be ok.
@Ottomata anything I missed?
Tue, Jan 28
Jan 24 2020
is this now possible via spark?
Hi @Iflorez - while my unserstanding of this is not precise enough, I can already say that the wikidata content is not yet present on the cluster in a productionized way.
So possibly the computation is feasible, but a productionized version of it is not, yet.
Jan 23 2020
Jan 22 2020
I suggested in an email using a parquet table to handle DataQuality values, and I think that would indeed help.
To read value-separated text file Hive uses org.apache.hadoop.mapred.TextInputFormat, a subclass of org.apache.hadoop.mapred.FileInputFormat.
The TextInputFormat implements only a small number of methods, and particularly it does not override the methods from FileInputFormat that contain the failing StopWatch. Those methods are getSplits (the one we experience failure with) and listStatus.
On the other hand, org.apache.parquet.hadoop.ParquetInputFormat is also a subclass of FileInputFormat, but it overrides the 2 methods above, so it shouldn't fail :)
Jan 21 2020
First step done: Having dask running on our cluster.
Patch merged upstream, will be released as part of commons-compress 1.20
Will follow up to try to force use that version once published.
@dr0ptp4kt Thanks for making me seing things correctly :)
I think you need to kdestroy from ssh session to clean this session, then kdestroy and kinit again in notebook terminal to start fresh.
In the meantime, might you be able to use bzip2 to decompress and recompress the problem file(s) so that you can get your import into hadoop done?
Jan 20 2020
Jan 17 2020
Jan 15 2020
Git tells me that we bumped guava from 12 to 18 when adding jsonschema loader.
IIRC the com.github.java-json-tools - json-schema-core dependency need jackson-coreutils which in turn needs guava, and @Ottomata and I experienced errors with version 12.
Jan 13 2020
Jan 10 2020
Jan 9 2020
Latest development on my end:
- Oozie worked with Luca's patch above
- spark-submit with python worked as well
- pyspark2 still fails
Jan 8 2020
The patch merged by @Nuria had a bug. I commented on the already merged patch on a solution. For the moment the job is not started.
Jan 7 2020
@Nuria: No email was sent as I discovered the problem before the SLA limit.
Current limit for mediawiki-history jobs is at 39 days (31 + 8), since sqooping was taking a lot longer before.
We probably could set it to 34 to be more reactive (sqoop has regularly finished in less than 24 hours in the past months).
My use-case is in https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/http.py, in which we currently use urllib.
I'll try to update the file to use requests.
No problem adding page_is_redirect. It should be documented and understood for what it is being my only concern: page_is_redirect is present on every event but its value only concerns the current revision of the page (last one).