User Details
- User Since
- Nov 1 2022, 1:30 PM (14 w, 9 h)
- Availability
- Available
- LDAP User
- Stevemunene
- MediaWiki User
- SMunene-WMF [ Global Accounts ]
Fri, Feb 3
Hi,
This was deployed and tested on staging environment and the records were still not created and the behavior was still generally the same. Been exploring the how and why we query results as we do as per this issue ,
authzIdentity is also mentioned here. Asking around on DataHub slack for suggestions.
Thu, Feb 2
Wed, Feb 1
Tue, Jan 31
Yes, adding the authzIdentity to be the same value set as the username. Leaving out the authentication for now and shall update based on results.
Mon, Jan 30
Did some more reading on JAAS user extractions specifically the authzIdentity and java.naming.security.authentication="simple" and it is likely that both config options are required. However, this still does not explain why the previously working JAAS ldap broke.
Fri, Jan 27
We implemented a change to enable OIDC provisioning and exact groups, as per a suggestion from the datahub slack channel. Which was
The user login -> corpUser generation happens when these properties are set to true: auth.oidc.jitProvisioningEnabled auth.oidc.extractGroupsEnabled
Our env vars look like this now:
Environment: SERVICE_IDENTIFIER: datahub-frontend-main JAVA_OPTS: -Xms512m -Xmx512m -Dhttp.port=9002 -Dconfig.file=/datahub/datahub-frontend/conf/application.conf -Djava.security.auth.login.config=/datahub/datahub-frontend/conf/auth/jaas-ldap.conf -Dlogback.configurationFile=/datahub/datahub-frontend/conf/logback.xml -Dlogback.debug=false -Dpidfile.path=/dev/null
Thu, Jan 26
Wed, Jan 25
Updated the Presto server acting as coordinator and the presto servers acting as worker nodes configs with the tuning configs below:
query.max-memory: 200GB query.max-memory-per-node: 20GB query.max-total-memory-per-node: 40GB task.concurrency: 48 # task.max-worker-threads is the Node vCPUs * 4 task.max-worker-threads: 192 node-scheduler.max-splits-per-node: 500 task.http-response-threads: 5000
After reviews, proceeded to merge the changes to production and ran the puppet agent sudo run-puppet-agent on the relevant hosts an-coord1001- Presto Server (acting as coordinator) an-presto100[1-5] - Presto Server (acting as worker)
Tue, Jan 24
Hi @fgiunchedi the 10 servers were taken out of the cluster due to challenges in joining the presto cluster documented here T325809 T325331 T323783
Mon, Jan 23
Hi @mmartorana this is still pending, how do I get the task back moving?
Wed, Jan 18
Found a discussion on the presto github revolving around a similar issue. The number of worker nodes that a cluster can support is limited by the resources (CPU and memory) available at the coordinator.
This was first discussed here Scale to larger clusters #10174 with the general suggestion being scaling the coordinator.
The suggestion was to introduce the Presto Disaggregated Coordinator discussed here Scaling The Presto Coordinator #13814 and here Design Disaggregated Presto Coordinators #15453 which provides a design for the feature. Also mentioned was an error similar to the one we were facing "In certain high QPS use cases, we have found that workers can become starved of splits, by excessive CPU being spent on task updates. This bottleneck in the coordinator is alleviated by reducing the concurrency, but this leaves the cluster under-utilized."
"Furthermore, because of the de-facto constraint that there must be one coordinator per cluster, this limits the size of the worker pool to whatever number can handle the QPS from conditions 1 and 2. This means it’s very difficult to deploy large high-QPS clusters than can take on queries of moderate complexity (such as high stage count queries)." This activity can be seen by viewing the presto stats, running queries vs abandoned queries here around the time T325331 was brought to attention.
Hi @odimitrijevic requesting approval for Security Issue Access Request.
Thanks.
Wed, Jan 11
On change 878128 Airflow 2.3.4 compatible configuration changes were made but were not properly tested since the test instance an-test-client1001.eqiad.wmnet is running airflow 2.1.4.
The changes made covered
connecting airflow instances to PostgreSQL T326195
handling the connection in a way compatible with airflow 2.3.4 upwards T315580
Since the changes made were only compatible in Airflow 2.3.0 upwards, pcc runs resulted in the error below.
Error: Evaluation Error: Error while evaluating a Resource Statement, Airflow::Instance[analytics_test]: has no parameter named 'sql_alchemy_schema' has no parameter named 'database' (file: /srv/jenkins/puppet-compiler/39047/change/src/modules/profile/manifests/airflow.pp, line: 222) on node an-test-client1001.eqiad.wmnet Error: Evaluation Error: Error while evaluating a Resource Statement, Airflow::Instance[analytics_test]: has no parameter named 'sql_alchemy_schema' has no parameter named 'database' (file: /srv/jenkins/puppet-compiler/39047/change/src/modules/profile/manifests/airflow.pp, line: 222) on node an-test-client1001.eqiad.wmnet Warning: Failed to compile catalog for node an-test-client1001.eqiad.wmnet: Evaluation Error: Error while evaluating a Resource Statement, Airflow::Instance[analytics_test]: has no parameter named 'sql_alchemy_schema' has no parameter named 'database' (file: /srv/jenkins/puppet-compiler/39047/change/src/modules/profile/manifests/airflow.pp, line: 222) on node an-test-client1001.eqiad.wmnet Error: Failed to compile catalog for node an-test-client1001.eqiad.wmnet: undefined method `to_resource' for nil:NilClass
Tue, Jan 10
Dec 20 2022
This change has been reverted and we are back to the original an-presto100[1-5] due to and incident where Superset: Presto backend: Unable to access some charts.
More details available in T325331
Dec 8 2022
batch restarting varnishkafka-webrequest.service in batches of 3 30 seconds in between
batch restarting varnishkafka-eventlogging.service to pick new certs.
stevemunene@cumin1001:~$ sudo cumin -b 3 -s 30 P:cache::kafka::eventlogging "systemctl restart varnishkafka-eventlogging.service" 48 hosts will be targeted: cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4037-4044].ulsfo.wmnet Ok to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit 48 PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [08:29<00:00, 10.62s/hosts] FAIL | | 0% (0/48) [08:29<?, ?hosts/s] 100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'systemctl restar...tlogging.service'. 100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. stevemunene@cumin1001:~$
batch restarting varnishkafka-statsv.service in batches of 3 30 seconds in between
stevemunene@cumin1001:~$ sudo cumin -b 3 -s 30 P:cache::kafka::statsv "systemctl restart varnishkafka-statsv.service" 48 hosts will be targeted: cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4037-4044].ulsfo.wmnet Ok to proceed on 48 hosts? Enter the number of affected hosts to confirm or "q" to quit 48 PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (48/48) [08:23<00:00, 10.49s/hosts] FAIL | | 0% (0/48) [08:23<?, ?hosts/s] 100.0% (48/48) success ratio (>= 100.0% threshold) for command: 'systemctl restar...a-statsv.service'. 100.0% (48/48) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Successfully restarted services varnishkafka-eventlogging.service varnishkafka-statsv.service varnishkafka-webrequest.service and verified ssl
Generate the certificates
root@puppetmaster1001:~# cergen --generate --force -c 'varnishkafka' --base-path=/srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d 2022-12-08 09:43:17,994 INFO cergen Generating certificates ['varnishkafka'] with force=True 2022-12-08 09:43:17,994 INFO Certificate(varnishkafka) Generating all files, force=True... 2022-12-08 09:43:17,996 INFO Certificate(varnishkafka) Generating certificate file /usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python3/dist-packages/urllib3/connection.py:362: SubjectAltNameWarning: Certificate for puppetmaster1001.eqiad.wmnet has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning 2022-12-08 09:43:19,587 INFO Certificate(varnishkafka) Generating CA certificate file 2022-12-08 09:43:19,587 INFO Certificate(varnishkafka) Generating PKCS12 keystore file 2022-12-08 09:43:19,948 INFO Certificate(varnishkafka) Generating Java keystore file 2022-12-08 09:43:20,933 INFO Certificate(varnishkafka) Importing PuppetCA(puppetmaster1001.eqiad.wmnet_8140) cert into Java keystore 2022-12-08 09:43:21,922 INFO Certificate(varnishkafka) Generating Java truststore file with CA certificate PuppetCA(puppetmaster1001.eqiad.wmnet_8140)
disabling puppet temporarily on cp hosts
stevemunene@cumin1001:~$ sudo cumin A:cp "disable-puppet 'renewing varnishkafka certificates - T323771 - ${USER}'"
96 hosts will be targeted:
cp[2027-2042].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1075-1090].eqiad.wmnet,cp[5017-5032].eqsin.wmnet,cp[3050-3065].esams.wmnet,cp[4037-4052].ulsfo.wmnet
Ok to proceed on 96 hosts? Enter the number of affected hosts to confirm or "q" to quit 96
NO OUTPUT
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (96/96) [01:23<00:00, 1.15hosts/s]
FAIL | | 0% (0/96) [01:23<?, ?hosts/s]
100.0% (96/96) success ratio (>= 100.0% threshold) for command: 'disable-puppet '...1 - stevemunene''.
100.0% (96/96) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
stevemunene@cumin1001:~$
Dec 7 2022
Checking the cert status on one of the cp hosts.
This is done, all servers successfully joined the cluster.
Dec 6 2022
Dec 5 2022
an-presto1007 is now part of the cluster. the delay in joining the cluster was caused by the timing between the puppet run and the adding of the related keytabs.
Setting up the python is python3 automation for the presto servers then proceeding with adding an-presto1008-15 to the cluster.
@Ottomata This might affect the rare packages using python2 or the deployments that had already set up symlinks to python3. We shall discuss more on potential implications on the deployments already done.
Dec 2 2022
Thanks @elukey Working on that.
Installed package python-is-python3 and the presto dependency was able to compile.
presto-server.service still could not start because of a permission issue
Dec 02 11:18:08 an-presto1006 presto-server[2138247]: ERROR: [Errno 13] Permission denied: '/srv/presto/var/run'
File /srv/presto/var/run was owned by root:root instead of presto:presto This was fixed by deleting the folder sudo rm -rf /srv/presto/var
Dec 1 2022
Presto-service run fails on the Debian11 boxes due to a python issue caused by the unversioned /usr/bin/python required by a dependency.
We do not have the right build for bullseye, thus we need to upgrade the packages for that. Here is a snippet from thew logs.
Nov 30 2022
Nov 23 2022
Upgrade to 1.38.2 is done and all data cubes are visible.
Something to note the pre configured data cubes are rediscovered with every scan as on the logs below.
Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Introspecting all sources in cluster 'druid-analytics-eqiad' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Scanning cluster 'druid-analytics-eqiad' for new sources Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'banner_activity_minutely' and will introspect 'banner_activity_minutely' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'mediawiki_geoeditors_monthly' and will introspect 'mediawiki_geoeditors_monthly' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'pageviews_daily' and will introspect 'pageviews_daily' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'pageviews_hourly' and will introspect 'pageviews_hourly' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_domain_daily' and will introspect 'unique_devices_per_domain_daily' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_domain_monthly' and will introspect 'unique_devices_per_domain_monthly' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_project_family_daily' and will introspect 'unique_devices_per_project_family_daily' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'unique_devices_per_project_family_monthly' and will introspect 'unique_devices_per_project_family_monthly' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'virtualpageviews_hourly' and will introspect 'virtualpageviews_hourly' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'webrequest_sampled_128' and will introspect 'webrequest_sampled_128' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'webrequest_sampled_live' and will introspect 'webrequest_sampled_live' Nov 23 09:46:04 an-tool1007 turnilo[1177511]: Cluster 'druid-analytics-eqiad' has never seen 'wmf_netflow' and will introspect 'wmf_netflow'
Nov 22 2022
Nov 15 2022
Got some input from the Turnilo Slack referencing this open issue.
With the issue being the deletion of introspected sources. The function deleteDataCube is called and the filter implemented there is not accurate.
Next was to implement the recommended change on /src/common/models/sources/sources.ts changing the filter from dataCubes: sources.dataCubes.filter(dc => dc.name === dataCube.name) to dataCubes: sources.dataCubes.filter(dc => dc.name !== dataCube.name). Then build the node project and test our configs
Nov 14 2022
Tried various options on the staging environment.
Setting sourceListScan: auto enables the sources of this cluster be automatically scanned and new sources added as data cubes. However, some of the hard-coded configs are not visible with this option while all the non-defined Druid tables are visible. The only table visible in both sourceListScan: auto and sourceListScan: disable is mediawiki_history_beta.