Sat, Jun 16
I did another quick check this morning: there are some valid user-agent strings of length larger than 512 in our faulty hour (9 over 64). The other 55 are exactly the same, of length 2035.
I also have successfully parsed user-agents with a length-limit of 1024 over the faulty hour, and double checked how many user-agents would not have been parsed with various limits for another full day of raw webrequest:
- Total number of rows for that day: 3626986512
Fri, Jun 15
One problem is related to user-agent parsing for very long strings:
sudo -u hdfs spark2-shell --master yarn --conf spark.dynamicAllocation.maxExecutors=256 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
Thu, Jun 14
Tue, Jun 12
Fri, Jun 8
Tue, Jun 5
Mon, Jun 4
For Hive to support JSON files with 1 record per line, explicit import of the hcatalog jar in session is needed (see https://github.com/wikimedia/analytics-refinery/blob/master/hive/webrequest/create_webrequest_raw_table.hql#L19). I assume hue doesn't do it by default. Let's keep this ticket to see if we can do anything for this.
On a related but different matter: wmf_raw shouldn't be use by regular users. wmf is prefered, its data being stored as parquet etc.
Tue, May 29
@Gilles : Feel free to ping when you're in if you want some help on the data or the way to play with it.
Tue, May 22
Adding analytics tag :)
May 17 2018
May 15 2018
May 10 2018
May 8 2018
select snapshot, event_entity, count(*) from mediawiki_history where snapshot in ('2018-03', '2018-02', '2018-04') group by snapshot, event_entity;
May 2 2018
Apr 30 2018
Apr 27 2018
Apr 26 2018
Apr 25 2018
Apr 24 2018
@Nuria actions fixed the problem for data up to 2018-02. I restarted a job ending in 2018-03 as the problem is not related to snapshots but to wrong indexation while testing. Will follow up later today when back from day off.
@Milimetric and @Nuria: This problem is due to me having testing the mediawiki-reduced new job, without disabling the indexation part of it :( I used fake data to test the job, therefore fake data got indexed as well.
I'm super about that. I have launched a manual reindexation job, this should be fixed during the day.
Apr 23 2018
Hi @GoranSMilovanovic ,
The problem I see in your code is that you instanciate the dataframe as a R structure, and then convert it to spark.
The first steps involves creating the dataframe and loading the datagrame in R, which involves only the driver. Since your driver has 4g RAM, you have a memory error.
When dealing with big datasets, you should use spark reading functions (they don't load the full datasets in the driver).
df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
Maybe you could try that?
+1 for commenting the global check :)
Apr 20 2018
Due to https://gerrit.wikimedia.org/r/#/c/388265/ not having been reflected on wmf.mediawiki_user_history and wmf.mediawiki_page_history, we expected field definition issues.
After memory tricks from @elukey , both hadoop indexation and realtime indexation went fine (without any change - Incredible).
Let's plan on an update next week for the druid-analytics cluster.
Apr 19 2018
First step of testing confirmed on labs with druid 0.9.2:
- Indexation from hadoop
- Realtime indexation with tranquility
I am not fluent in sparkR but here are a few thoughts:
- The example config you wrote in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#SparkR_in_production_(stat100*_machines)_example looks super fine to me. I'm not sure about he dynamic allocation for SparkR though (see below).
- I would be interesting to check if Spark actually launches in Dynamic Allocation (looking at spark UI from yarn.wikimedia.org)
- Using a single executor for a machine learning algo implemented in MLLib sounds weird. Let's double chek the config as stated above.
Apr 18 2018
Anoher round of discussion with team:
- Quality checks should happen before data gets loaded into druid
- Since T155507, we now have statistics over the data generated by Mediawiki-history reconstruction job. The first layer of data quality checking should happen there (subtask: T192481)
- Another layer of data-quality check should be done over he mediawiki-history-reduced dataset. This implies keeping the data instead of deleting it after druid indexation (subtask: T192482). A new job step would then check data similarity between previous and current snapshot (subtask: T192483).
- With those checks satisfied, we are ok to index the data in druid, then cache-warming and datasource-swap should happen (no task yet).
Apr 16 2018
Apr 12 2018
Quick note: Knowing the domain of any project, it's relatively easy to extract the project-family and the language (if any).
Apr 6 2018
Apr 5 2018
Hi @Neil - Would wmf_raw.mediawiki_project_namespace_map saisfy the need ? This table is updated every month (snapshot partition) and is defined as explained here in github.
Mar 30 2018
@Milimetric : I have modifed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid#Delete_segments_from_deep_storage for better understanding of the issue you encountered.
I htink you were trying o delete data hat was still available in historical nodes - And druid doesn't let you do that.
I have first disabled the datasource in coordinator UI, then I used the command you pasted, with added parameter not to check for datasource availability (since I disabled it).
/srv/deployment/analytics/refinery/bin/refinery-drop-druid-deep-storage-data -d 1 -v mediawiki-geowiki-daily --no-datasource-check
It worked as far as I an tell:
hdfs dfs -ls /user/druid/deep-storage/mediawiki-geowiki-daily ls: `/user/druid/deep-storage/mediawiki-geowiki-daily': No such file or directory