I'm not sure if the spark-cassandra-connector can read a Java Truststore on HDFS! I'd go for an automated deployment of the trustore on every cluster host. For the moment it'll be enough as our prod jobs are launched fron the cluster (skein). It would probably also be good to have the truststore deployed on stat machines, to allow for manual runs. This should be enough for now, until we move launchers away from skein to use k8s - We'll revisit at that time (ping @BTullis :)
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Yesterday
Thu, Apr 18
Global execution times have been divided by 3 (10mins for 170 jobs). We are using a new launchers queue to launch small jobs and have scaled the airflow parallelization to 10 tasks. We can replicate this model to other jobs :)
I think @Ottomata 's idea is good: having another column makes it easy to keep the "monotonic" values, while still having a de-duplication key with the new field.
Some of the big files listed above are due to the dumps job not splitting files (for cawiki and cswiki for instance).
For the rest, big files come from big pages (many revisions and big text):
spark.sql(""" | SELECT | wiki_db, | page_id, | count(1) as revision_count, | sum(revision_text_bytes) as text_weight | from wmf.mediawiki_history | where snapshot='2024-03' | and event_entity = 'revision' | and event_type = 'create' | and not revision_is_deleted_by_page_deletion | group by wiki_db, page_id | order by text_weight DESC | limit 20 | """).show(100, false) +-----------+--------+--------------+------------+ |wiki_db |page_id |revision_count|text_weight | +-----------+--------+--------------+------------+ |enwiki |5137507 |1346638 |463589202714| |ruwiki |205407 |327734 |98327067901 | |frwiki |7846555 |213413 |95076531961 | |dewiki |9082349 |373094 |90919451888 | |commonswiki|1894972 |750375 |88226610099 | |enwiki |5149102 |411056 |76082914673 | |enwiki |36395484|400807 |74132557523 | |ruwiki |148254 |131666 |72615443749 | |enwiki |2535910 |505135 |70774375375 | |dewiki |7076401 |218421 |66236030573 | |enwiki |11424955|220512 |64807978487 | |enwiki |972034 |401277 |61837314055 | |enwiki |68479621|74790 |57447551495 | |enwiki |1470141 |411101 |55771136049 | |dewiki |6529924 |200046 |55555153937 | |zhwiki |84599 |577396 |54912852986 | |enwiki |34745517|378634 |50069312070 | |zhwiki |284591 |165702 |49405881699 | |ruwiki |15920 |202182 |49323471028 | |hewiki |13822 |216934 |48579444220 | +-----------+--------+--------------+------------+
Wed, Apr 17
The problem has been fixed.
The bug has been introduced when we migrated downstream jobs of the unique-devices tables to use the new iceberg tables. Druid loading of unique devices happens in 3 jobs for each unique-devices type (per-domain and per-project-family): 1 daily job for daily uniques, 1 monthly job for monthly uniques, and 1 monthly job to compact daily uniques into monthly segments, and that's this job that was causing issues.
The bug was about wrongfully using the first-of-the-month date parameter instead of the table day field as date for ingestion: data for every day of the month was labelled with 1st of the month.
Tue, Apr 16
Mon, Apr 15
Fri, Apr 12
Thu, Apr 11
Indeed we wish people to use Spark or Presto instead of hive, and this is good example as to why :)
Fri, Apr 5
Thank you so much @hashar for unblocking us!
I prefer the "by functionality" organization, for separating schema vs data code.
I think we need the 2 different functions to make the Iceberg one delete data before inserting. And actually this could be discussed as well: I think we wish to have this by default in the Iceberg write function - do you agree?
Thu, Apr 4
I have been wondering about how to organize this code.
I was not willing to replicate the DataFrameToHive pattern due to the apply function indeed, and also to not put code for both schema and data management in the same place, as we are trying to split them functionally.
Should we have 2 lib files, one for schema and one for data, for both Hive and Iceberg? Or one file doing both as it is now?
Done using the airflow variable.
I also sent a PR to have the defaults set: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/642
Sat, Mar 30
Good catch milimetric! Reviewing right now, will deploy this next week.
Wed, Mar 27
Thanks a lot @nshahquinn-wmf :)
In T356363#9665322, @BTullis wrote:Would we still want this integrated email functionality within refinery, when it's running under airflow?
Tue, Mar 26
We currently have use-cases doing this exactly that work. there must have been another issue than the pone described here. I think this ticket is invalid.
Done using airflow variable mechanism.
Mar 21 2024
I have run our script to list user content in our various machines, the result is below.
@AndrewTavis_WMDE , I let you review and let us know when you have copied stuff you wish to keep, so that we can delete the rest.
Mar 6 2024
Feb 29 2024
I think we're gonna use this ticket: https://phabricator.wikimedia.org/T262201
Nothing done on my end - possibly one of the 2 jobs failed for real?
Indeed, the job will not be affected with next month changes:
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blame/main/analytics/dags/clickstream/clickstream_monthly_dag.py#L66
We'll need to keep looking for when those change though :)
Feb 28 2024
We're gonna build a quickfix for next month sqoop to be successful (null values in dropped fields for some projects).
Feb 27 2024
Feb 22 2024
I don't think this change would really affect queries performance, but I'm in favor of doing it for the benefit of relieving some pressure from Kerberos.
Have we looked around to see if there are existing 'dataset' config formats/specs we can already use?
Feb 21 2024
Implementation plan:
- Add a new skip option in https://github.com/wikimedia/analytics-refinery/blob/master/bin/import-mediawiki-dumps#L29 to skip wikis from the wiki-list file the job reads.
- Use this new option to skip wikidatawiki in the puppet setup systemdtimer:
- https://github.com/wikimedia/operations-puppet/blob/7707b14401ffc97e0adc136850f670c826552049/modules/profile/templates/analytics/refinery/job/refinery-import-mediawiki-dumps.sh.erb
- https://github.com/wikimedia/operations-puppet/blob/7707b14401ffc97e0adc136850f670c826552049/modules/profile/manifests/analytics/refinery/job/import_mediawiki_dumps_config.pp
- https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/analytics/refinery/job/import_mediawiki_dumps.pp
Feb 12 2024
Thank you so much @Ladsgroup for the recap.
Feb 8 2024
Hi @Ladsgroup,
I have a question for you: have all the projects been migrated to using the new linktarget table for the pagelinks table, even if their columns have not been removed?
I'm asking this for us to adapt our sqoop jobs, as we're starting to experience issues (only testwiki this month).
Feb 7 2024
This has started, testwiki schema has changed.
I'd also like to talk about https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L622 as the linktarget table is considerate private now.
Feb 5 2024
Feb 1 2024
Data engineering team has written some code for our cassandra-loading jobs to be able to read a password from a file on HDFS:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/utils/WmfCassandraAuthConfFactory.scala
While that could be useful, the spark-thrift server doesn't support user impersonation. The StackOverflow ticket I have read points to https://github.com/apache/kyuubi. We could investigate this.
Jan 30 2024
Go for it :)
Jan 29 2024
Thanks a log for not forgetting about this ticket @mfossati :)
the Data Engineering team is on the road toward providing you with (hopefully) an easy enough way to configure data deletion for your datasets.
In the meantime, manual deletion every now and then should be enough.
I don't think it's worth investing time on this before the new system comes in (probably a few months).
Is that ok for you?
Does your tooling let you control size and throughput?
Jan 26 2024
I think this is a good idea :)
The smaller size shouldn't be an issue as we should not test scalability but functions.
Jan 24 2024
The task is old but the objective is still valid IMO.
We should talk to @Eevans about this.
Closing as the strategy is to migrate to Iceberg.
Jan 23 2024
Jan 19 2024
Jan 18 2024
Blocked on https://phabricator.wikimedia.org/T355352
Jan 11 2024
Jan 10 2024
@brennen has updated my rights on gitlab giving me ownership write on the project. problem solved.
Jan 9 2024
The steward subgroup is to support wikimedia stewards, and therefore not the correct place for our project. We decided to put it in ci-tools, even if it's probably a bit stretch :)
Dec 12 2023
So ya let's go with VCL!
+1
Dec 11 2023
Now output files dropped to 1k! 🎉
Dec 7 2023
Hi @brennen - I've been told you could the one to ask this question to: I'd like to create a new gitlab project for our global JVM POM file, reused globally at the foundation (therefore not under a team's name). I have identified the ci-tools subgroup and the stewards subgroup, and wondered if you thought the later would be good? Thanks
Dec 5 2023
You guys rock <3
I'm also eager to check if we run into parquet-decompression issues as I think could happen. Thanks a lot for running those experiments @xcollazo :)
In T346463#9383399, @WDoranWMF wrote:@JAllemandou how complex are the changes? Is it a quick patch to get in or do we need more discussion?
If we start having data about which webrequest hits are prefetch or not, we definitely would be able to investigate! I'm in favor of moving fast and passing this header through as a new webrequest field. No change would be needd in Gobblin, only in wmf_raw.webrequest and wmf.webrequest schemas, as well as refine_webrequest hql to forward the field.
Thanks for the ping @Jdforrester-WMF. Data-engineering has not been using statsd as far as I know. We have helped the performance team in some of its usage if I recall correctly, it was work done with @Krinkle and Gilles, but we have not been maintaining or using statsd.
Let's talk and see how statsd used nowadays, as doc is old and talks about Graphite: https://wikitech.wikimedia.org/wiki/Graphite#Data_sources, https://wikitech.wikimedia.org/wiki/Statsd