User Details
- User Since
- Jun 21 2021, 2:34 PM (172 w, 18 h)
- Availability
- Available
- LDAP User
- TChin
- MediaWiki User
- TChin (WMF) [ Global Accounts ]
Today
Oh jeez there's also @wikimedia/url-get that I just found
node-rdkafka-prometheus is now on gitlab. KafkaSSE looks like it's going to be a bit more difficult since the tests require a full kafka setup and I can't seem to even run it locally... although I guess I can skip getting CI to work for it for now
Also added node-rdkafka-prometheus to the list
Yesterday
Seems like we missed KafkaSSE during the GitLab migration
Sat, Oct 5
Wed, Oct 2
I did a manual ingestion and was able to see the tables on datahub if I access it directly through a url
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.cirrus_index,PROD)/Schema?is_lineage_mode=false&schemaFilter=
Tue, Oct 1
We might need to simply ingest all the tables
I can probably take a look at why the table match isn’t working, next thing we could try is providing a custom transfom function
Mon, Sep 16
Thu, Sep 12
I don't know how/where spark's appName is autogenerated, but for dags to use spark lineage we should make it required for them to also define a static appName or else there will be a new pipeline + task(s) for every dag run
Sun, Sep 8
- Eventstreams is currently deployed in the beta cluster successfully with service-utils
- It turns out that KafkaSSE uses bunyan, so logging is a bit weird since now it uses 2 formats
- Beta logstash does not seem to capture logging from stdout, so it only shows the logs from KafkaSSE. However, the eventstreams logs do exist when looking inside the docker container
Aug 29 2024
Aug 24 2024
Aug 21 2024
I created the database sandbox and sandbox_iceberg and also created an interlanguage_navigation table in sandbox:
Aug 20 2024
What should the databases be called and where should it live? Should we have seperate databases for hive and iceberg tables?
/wmf/data/wmf_test and /wmf/data/wmf_test_iceberg?
Yeah I think we should prioritize that.
It seems like right now, unless we upgrade to at least Spark 3.4 and Iceberg 1.4, we will not be able to use Datahub's spark lineage connector on iceberg tables
I ran a job using our regular prod configs just without iceberg tables. It ran successfully and outputted this:
Aug 19 2024
Update: Tried using spark 3.3.2 with this:
Aug 15 2024
I'm using the newer acryl-spark-lineage which works for datahub 0.13.3 https://datahubproject.io/docs/metadata-integration/java/acryl-spark-lineage
I can see that Iceberg for Spark 3.1 does not in fact have an icebergCatalog method but for > Spark 3.3 it does. Going to see if I can use the Spark 3.3 configs from the airflow dags repo
I ran a simple spark sql job on a statbox with:
sudo -u analytics-privatedata spark3-sql --jars ./acryl-spark-lineage-0.2.16.jar --conf "spark.extraListeners=datahub.spark.DatahubSparkListener" --master local[12] --driver-memory 8G --conf "spark.datahub.emitter=file" --conf "spark.datahub.file.filename=./il_lineage" -f il_test.hql
Aug 14 2024
Hmm I guess my problem is different and just lies in the non-standard way packages are installed.
heroku@38b1190a0d63:/workspace$ find /layers/fagiani_apt/apt | grep libGLESv2
/layers/fagiani_apt/apt/usr/lib/x86_64-linux-gnu/libGLESv2.so.2.1.0
/layers/fagiani_apt/apt/usr/lib/x86_64-linux-gnu/libGLESv2.so.2
heroku@38b1190a0d63:/workspace$ npm run scrape
browserType.launch:
╔══════════════════════════════════════════════════════╗
║ Host system is missing dependencies to run browsers. ║
║ Missing libraries: ║
║ libGLESv2.so.2 ║
╚══════════════════════════════════════════════════════╝
Aug 12 2024
Aug 1 2024
I think I'm experiencing a similar error. Suddenly started getting this on a scraping job I have:
2024-08-01T02:42:58+00:00 [test-scrape-pd5kq] â Host system is missing dependencies to run browsers. â 2024-08-01T02:42:58+00:00 [test-scrape-pd5kq] â Missing libraries: â 2024-08-01T02:42:58+00:00 [test-scrape-pd5kq] â libOSSlib.so â 2024-08-01T02:42:58+00:00 [test-scrape-pd5kq] ââââââââââââââââââ
Jul 29 2024
Jul 26 2024
Jul 25 2024
Jul 22 2024
Diffing the output of deeply merging stream defaults, all of the changes are either just adding job_name to a disabled hadoop ingestion config and adding analytics_hive_ingestion defaults to streams that have enabled hadoop ingestion
Jul 18 2024
Jul 6 2024
Jun 14 2024
I don't think so. The image suggestion work on Flink never progressed passed the original ticket.
Jun 11 2024
Getting rid of service-scaffold-node also means we should get rid of servicelib-node since it was created for service-scaffold node (this is the reason why service-scaffold node depends on packages that don't exist. This project never finished)
Jun 3 2024
May 29 2024
Don't forget that any CI that has a production deployment pipeline needs the repo to be added to trusted runners and also have their tags protected (Slack thread on protecting tags)
This might be harder than I thought. Creating a dummy google account to act as the receiver seems off the table. All of Google's APIs require OAuth or some manual way for the user to sign in. There is no way to make a pure bot account, and also no good way to automate login without being slapped by a ban.
May 28 2024
May 21 2024
Sounds good to me. service-scaffold-node was started to turn service-template-node into a group of libraries and is basically superseded by my effort to replace service-runner (T360924) which is mostly completed
May 15 2024
Would be nice to get a confirmation for archiving node-rdkafka-statsd since it'll progress T349118
May 13 2024
Apr 17 2024
Has this project been discussed across the WMF/Community?
It would be great if there was a RFC process, but there has at least been discussions about what to do with service-runner and this project is on the radar to the entirety of Data Platform Engineering and some people on the MW engineering team and the language team. It was also posted on slack on #engineering-all to give people a head's up just in case there was another team working on something similar. If there's one thing I'm sure about is that the consensus is that we need a replacement, whether or not this is it.
Apr 5 2024
Config store repo does CI checks for jsonschema correctness and config values against its jsonschema. The Datasets Config service repo has dockerized CI using Kokkuri and Blubber.
Mar 26 2024
Feb 27 2024
If it's to a point where we even need to use a new name, might as well break everything. I'd love to join in on the fun
Feb 11 2024
Looking at the logs, this seems to coincide with the redaction patch to eventstreams, but looking at the code I'm having a hard time finding where a memory leak could've happened... more confusing that it's just 1 or 2 pods hitting the limit
Jan 30 2024
Jan 22 2024
Using lz4 compression works but checking it with parquet-tools doesn't. I see something like compression: UNKNOWN (space_saved: -25%) Seems like a known issue.
Jan 5 2024
INSERT OVERRIDE with PARTITION also doesn't work anymore because Iceberg uses hidden partitioning so had to enable Spark's dynamic overwrite
https://iceberg.apache.org/docs/latest/spark-writes/#insert-overwrite
TIL when setting the compression codec to snappy, Iceberg doesn't end the files in hdfs with .snappy.parquet. I had to check if the format was correct using parquet-tools.
Dec 19 2023
Tested to see if the COALESCE hints still work in Iceberg by creating 2 tables and filling then with/without the hint. It still seems to work.
Dec 18 2023
Dec 16 2023
Tested on a stat machine with
CREATE EXTERNAL TABLE IF NOT EXISTS `aqs_hourly`( `cache_status` string COMMENT 'Cache status', `http_status` string COMMENT 'HTTP status of response', `http_method` string COMMENT 'HTTP method of request', `response_size` bigint COMMENT 'Response size', `uri_host` string COMMENT 'Host of request', `uri_path` string COMMENT 'Path of request', `request_count` bigint COMMENT 'Number of requests', `hour` timestamp COMMENT 'The aggregated hour. Covers from minute 00 to 59' ) USING ICEBERG PARTITIONED BY (days(hour)) ;
And
spark3-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 \ -f aqs_hourly_iceberg.hql \ -d source_table=wmf.webrequest \ -d webrequest_source=text \ -d destination_table=tchin.aqs_hourly \ -d coalesce_partitions=1 \ -d year=2023 \ -d month=12 \ -d day=3 \ -d hour=0
Dec 14 2023
Dec 11 2023
Dec 2 2023
Nov 14 2023
I think the per-image quota should probably be increased. I tested building a few projects locally and a project with NodeJS and 0 dependencies results in a built image that's 805.58 MB. One with only VueJS as a dependency bumps it up to 858.13 MB. I'm probably not going to be the last one who needs more than 200 MB of working space :/
Nov 13 2023
Example error:
step-export: 2023-11-13T05:41:56.835942824Z ERROR: failed to export: failed to write image to the following tags: [tools-harbor.wmcloud.org/tool-dpe-alerts-dashboard/tool-dpe-alerts-dashboard:latest: PATCH https://tools-harbor.wmcloud.org/v2/tool-dpe-alerts-dashboard/tool-dpe-alerts-dashboard/blobs/uploads/b62dd944-4fad-4ee8-b900-8409f7860d6c?_state=REDACTED: unexpected status code 413 Request Entity Too Large: <html> step-export: 2023-11-13T05:41:56.835973012Z <head><title>413 Request Entity Too Large</title></head> step-export: 2023-11-13T05:41:56.835976984Z <body> step-export: 2023-11-13T05:41:56.835979969Z <center><h1>413 Request Entity Too Large</h1></center> step-export: 2023-11-13T05:41:56.835983468Z <hr><center>nginx/1.18.0</center> step-export: 2023-11-13T05:41:56.836002364Z </body> step-export: 2023-11-13T05:41:56.836005027Z </html> step-export: 2023-11-13T05:41:56.836008032Z ] step-export: step-results: 2023-11-13T05:41:57.433667715Z 2023/11/13 05:41:57 Skipping step because a previous step failed
Oct 26 2023
Oct 11 2023
If we do introduce something, we should use JSDoc3 and follow what's happening on this ticket T138401
Oct 3 2023
Sep 29 2023
DeliveryGuarantee.AT_LEAST_ONCE: The sink will wait for all outstanding records in the Kafka buffers to be acknowledged by the Kafka producer on a checkpoint. No messages will be lost in case of any issue with the Kafka brokers but messages may be duplicated when Flink restarts because Flink reprocesses old input records.
Sep 28 2023
Unaligned checkpoints didn't work. Maybe it's because of data being moved around to new brokers and Kafka is too overloaded.
@bking Gabriele is currently on sick leave but yes let's try incrementing the helm chart version
Sep 19 2023
Aug 31 2023
Associated GitHub PR: https://github.com/wikimedia/jsonschema-tools/pull/48