[00:17:57] (03PS1) 10GoranSMilovanovic: T294985 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739946 [00:18:22] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T294984 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739380 (owner: 10GoranSMilovanovic) [00:18:35] (03CR) 10GoranSMilovanovic: [V: 03+2 C: 03+2] T294985 [analytics/wmde/WD/WikidataAnalytics] - 10https://gerrit.wikimedia.org/r/739946 (owner: 10GoranSMilovanovic) [00:25:18] (DruidSegmentsUnavailable) firing: More than 30 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [00:25:18] (DruidSegmentsUnavailable) firing: More than 20 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [00:35:18] (DruidSegmentsUnavailable) resolved: More than 30 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [00:35:18] (DruidSegmentsUnavailable) resolved: More than 20 segments have been unavailable for wmf_netflow on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/dashboard/db/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1&var-cluster=druid_analytics - https://alerts.wikimedia.org [06:33:56] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Browser-Support-Microsoft-Edge: Problem with delay caused by intake-analytics.wikimedia.org - https://phabricator.wikimedia.org/T295427 (10Downsize43) This problem becomes more mysterious. It occurs on my HP laptop running Windows 10 and Microsoft Edge a... [07:31:15] (03CR) 10Joal: [C: 03+1] "LGTM :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/739922 (https://phabricator.wikimedia.org/T290516) (owner: 10Milimetric) [07:33:08] bonjour [07:33:15] Good morning :) [07:33:22] I have a couple of Gobblin code reviews when you have time :) [07:33:28] I have seen! [07:33:34] at least one of them :) [07:33:51] elukey: I assume it'd be better to test before appling, right? [07:34:18] joal: yep yep even if I am pretty sure that it should work fine [07:34:24] I can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/739476 to create the bundle [07:34:35] so that we can swap the values and test [07:36:05] elukey: recap on the plan for me: you merge/deploy that patch making truststore available on anlauncher1002 (and possibly other hosts), and then I can test your patch using the truststore in gobblin - right? [07:36:43] joal: yes correct! There are gobblin timers also on an-test-coord1001 afaics, not sure if they are running or not though [07:38:23] elukey: I assume they are running (let's check) [07:39:56] elukey: I confirm they are running [07:40:08] ok - let's do it elukey [07:40:34] joal: all deployed, you can go ahead [07:40:56] ack - running a manual test after a manual change of conf that mimics your patch [07:41:02] elukey@an-test-coord1001:~$ ls -l /etc/ssl/localcerts/wmf_trusted_root_cas.jks [07:41:05] -rw-r--r-- 1 root root 2283 Nov 19 07:38 /etc/ssl/localcerts/wmf_trusted_root_cas.jk [07:42:13] elukey: dumb question before I test - is there to check that the new truststore is actually used? [07:42:58] joal: if it doesn't the TLS connection shouldn't happen, unless certificate validation is turned off by default (but it is unlikely) [07:43:10] ok :) [07:43:49] in the kafka world it seems that "only" hostname validation is disabled (so after validating that the cert is signed by a trusted CA the client validates that the cert holds the hostname of the endpoint as CN: or SAN:) [07:44:15] in our case all kafka brokers have "CN: kafka-cluster-name" [07:44:54] so clients can connect to the endpoint, but they cannot validate that they are effectively connected to broker-x [07:45:19] ok - but the jobs work currently, how so? [07:45:37] because hostname validation is turned off by default in basically all kafka libs [07:46:16] except some Python ones (David found the problem using a python snippet to pull data from kafka main IIRC) [07:46:48] ok - but with that, how will we check that your patch is actually doing something? Being adding a truststore in conf sets the validation to true? [07:47:22] nope, the new truststore holds two CA certs, Puppet CA and PKI Root CA [07:47:39] so a client using it can choose between the two [07:47:47] when validating certificates from brokers [07:48:10] if the client is able to connect to kafka it means that it is using one of the two [07:48:43] we'll have a moment when a given cluster, like Jumbo, will have some brokers running the old certs and some the new ones [07:48:50] Gobblin will have to accept both :D [07:49:32] the bundle has been tested with kafka brokers so it works fine, but we have a kafka test broker with the new certs (kafka-test1006), so we can test Gobblin with it too [07:49:46] ok I think I get it now [07:50:31] sorry it is a bit confusing, but the procedure to make this transition transparent is a little elaborate :( [07:51:27] I had not understood it was about certificate change - I thought it was about hostname-check [07:54:03] elukey: job worked, kafka config logs tells me about the new truststore - seems working as expected :) [07:55:20] \o/ [07:56:36] thanks a lot for helping out :) [07:56:37] elukey: actually - beginning of job worked - it got its kafka-offset as expected - now the mapreduce part of it seems to fail - I need to wait for the job to have completely failed before looking at logs, but my expectation are that the worker nodes don't have the new truststore and try to use it when connecting to kafka [07:56:59] ahh it was too easy yes [07:57:14] will triple cdheck in minutes [07:57:42] going to prep a change to deploy the bundle on hadoop test workers in the meantime [07:58:49] ok correct assumption: Failed to load SSL keystore [08:01:40] joal: are you testing on launcher or an-test-coord? [08:01:51] elukey: I'm testing on launcher [08:02:18] elukey: I had a change almost ready for gobblin config in there [08:02:32] ah ok, I thought on the test cluster, then I need to amend my patch to deploy the bundle :D [08:02:59] PROBLEM - Check unit status of gobblin-webrequest on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit gobblin-webrequest https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:03:31] hm [08:03:35] will check that--^ [08:07:16] running puppet on worker nodes, it will take a bit [08:07:21] no prob lu [08:07:54] Ok, error above was because of my test job stopping the prob one (same name) - it'll recover by itself (currently running [08:14:05] RECOVERY - Check unit status of gobblin-webrequest on an-launcher1002 is OK: OK: Status of the systemd unit gobblin-webrequest https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:18:26] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) At this point we need to migrate all Kafka client using TLS to the new bundle before proceeding further... [08:24:42] 10Analytics, 10Data-Engineering, 10Event-Platform: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) [08:25:31] joal: ready for the test :) [08:25:38] Banzai! [08:27:40] elukey: successfull hadoop job - will rerun with some data movement to triple check, but it looks all good [08:28:47] \o/ [08:30:22] successfull job elukey [08:32:21] all good :) [08:34:45] thank youuuu [08:35:42] Thanks elukey for the change :) [08:35:56] elukey: let me know when you wish that to be deployed [08:36:38] joal: during the next ops weekly deployment would be good, but no rush [08:37:22] Ack elukey - the puppet stuff being done it's only gobblin, right? (test-workers also have it?) [08:37:52] test-workers done as well [08:38:05] ok super -preping the work for next deploy on our side :0 [08:38:53] in theory it shold be a no-op after the deployment right? [08:39:14] absolutely elukey [08:39:37] elukey: once deployed, new gobblin run uses the new feature and should work [08:39:54] perfect :) [08:40:05] I am moving the kafka test cluster to the new certificates btw [08:40:52] ack elukey [08:41:03] (03CR) 10Joal: [V: 03+2 C: 03+2] "Tested with @elukey - merging for next week deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/739475 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [08:43:12] elukey: deployment etherpad prepared and patch merged - we're all set [08:43:58] <3 [08:57:11] 10Analytics-Radar, 10WMDE-GeoInfo-FocusArea, 10WMDE-TechWish-Sprint-2021-11-10: Review existing dashboards and metrics for maps - https://phabricator.wikimedia.org/T295315 (10lilients_WMDE) [09:13:09] kafka-test running with the new certs! [09:16:08] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) ` elukey@kafka-test1006:~$ openssl s_client -CAfile /etc/ssl/localcerts/wmf_trusted_root_CAs.pem -verif... [09:16:12] dcausse: --^ [09:16:55] elukey: wow! [09:17:27] moving kafka main to this new scheme will be way more painful :D [09:17:28] elukey: is this available with the wmf-certs deb package? [09:18:13] dcausse: the bundle is stored on all nodes in production at the moment (not on pods, working on it) but the bulk of the work is to change all Kafka broker TLS certificates [09:18:23] sure [09:18:43] thanks for doing this! [09:18:52] and lemme me know if I can help testing something [09:19:17] sure! [10:44:49] (03PS1) 10DCausse: rdf_streaming_updater: add a reconcile event schema [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/740109 (https://phabricator.wikimedia.org/T279541) [10:49:55] (03CR) 10DCausse: [C: 04-1] "still working on the whole system around reconciliation so I might need other bits in these event but would love early feedback, esp. whet" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/740109 (https://phabricator.wikimedia.org/T279541) (owner: 10DCausse) [11:26:28] (03PS6) 10AKhatun: Save commons json dumps as a table and add fields for wikidata [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/739129 (https://phabricator.wikimedia.org/T258834) [11:42:58] joal: I have tagged you as a reviewer on https://gerrit.wikimedia.org/r/c/operations/alerts/+/740128 regarding the druid alerts. [13:15:09] Hi btullis - will review! [13:19:52] Thanks. No hurry, but I think it might stop those alerts we've been getting. [13:24:27] btullis: I have a question on how you set up the alarms - would you have a minute? [14:20:14] joal: Yes, sorry I was afk for a while. Any time. [14:22:24] btullis: batcave? [14:22:35] On my way.. [14:47:35] elukey: heya - would ou have aminute? [14:47:40] sure [14:47:57] elukey: we're in the batcave with btullis talking about druid metrics - would you join? [14:48:09] https://meet.google.com/rxb-bjxn-nip [14:49:07] elukey: we're wondering about the missing-segment metric [15:17:34] (03CR) 10MNeisler: [C: 03+2] Update talk_page_edit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [15:18:54] (03Merged) 10jenkins-bot: Update talk_page_edit schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739865 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [15:29:35] heya joal :] I'm going to look into the gobblin-webrequest alarm, do you suspect of anything that could have caused that? [15:34:07] (03CR) 10MNeisler: [C: 03+2] "Thanks for making this change @nettrom and sorry for the delay! Everything looks good to me, approved!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [15:35:18] (03CR) 10jerkins-bot: [V: 04-1] Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [15:35:23] 10Analytics-Radar, 10Data-Engineering, 10Event-Platform, 10SRE, 10Patch-For-Review: Allow kafka clients to verify brokers hostnames when using SSL - https://phabricator.wikimedia.org/T291905 (10elukey) Let's wait for T296089 before proceeding :) [15:47:47] (03PS3) 10Bearloga: Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [15:54:59] (03CR) 10jerkins-bot: [V: 04-1] Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [15:56:08] (03PS4) 10Bearloga: Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [15:58:14] (03CR) 10Bearloga: "Needed to delete materialized schema files and re-materialize 1.1.0 via:" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [15:59:14] (03CR) 10Bearloga: [C: 03+2] Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [16:00:12] (03Merged) 10jenkins-bot: Update documentation for anonymous_user_token [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/725351 (https://phabricator.wikimedia.org/T292209) (owner: 10Nettrom) [16:35:41] 10Data-Engineering, 10Data-Engineering-Kanban: Error creating custom SQL metrics in Superset (event_sanitized.centralnoticebannerhistory) - https://phabricator.wikimedia.org/T292751 (10EYener) Hi @BTullis thanks - looking back at this task, it is actually the dataset currently owned by @mpopov (schema: event_s... [17:11:04] hey mforns - I'm sorry I didn't answer the email about the gobblin error - it was me testing a manual gobblin run with Luca earlier on [17:11:29] ah! OK no problemo, it was quick to look at :] [17:12:28] joal: I'm looking now at the Druid segments alerts, I think we need to reduce the size of the segments for those datasets [17:12:39] I see segments for Netflow of up to 1.5GB [17:12:46] wow this is big indeed [17:13:21] mforns: the alert we have seen in the past few days for segments are false alerts [17:14:16] mforns: Ben has changed the alert mechanism when moving from Icinga to AlertManager, and since then we are receiving false alerts [17:14:32] mforns: This should be fixed next week, we had a talk with Ben earlier on today [17:14:46] also mforns: reducing the segment size for netflow seems a great idea! [17:15:12] mforns: we didn't do formal ops-week rotation yesterday, my bad I should have mentionned [17:17:39] no no, should I change the size for webrequerst_sampled_128? The biggest segments are at 674MB [17:17:45] joal: ^ [17:19:17] mforns: IIRC druid said that optimal segment size was 750Mb, but now they have changed their way of advising, based on data-points present in the segment - we should review our segments in regard to this instead of just size [17:20:36] mforns: will leave now - I'll review the druid info with you early next week if ok for you :) [17:20:52] ok joal :] reading: https://druid.apache.org/docs/latest/operations/segment-optimization.html [17:21:43] You rock mforns :) [17:22:05] :D, have a nice weekend! [17:59:42] Hi mforns. Sorry I've done a poor job of communicating about this Druid alert. [17:59:42] I should have copied you on this CR. https://gerrit.wikimedia.org/r/c/operations/alerts/+/740128 [17:59:42] We've been getting these alerts all this week, since I migrated the check from Icinga to Alertmanager. I made a slight change to the logic and it made it a bit trigger-happy. Should be fixed by the above CR. [18:04:55] no no. no problem at all! actually, it has been good, because it made me look at druid segment sized, and I spotted some segments that need to be split [18:05:01] btullis: ^ [18:06:39] Great. Have a good weekend. [19:13:01] (03CR) 10Nray: [C: 03+2] Update web_ui_scroll schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739659 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming) [19:13:46] (03Merged) 10jenkins-bot: Update web_ui_scroll schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/739659 (https://phabricator.wikimedia.org/T294246) (owner: 10Clare Ming)