[01:16:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:43] (SystemdUnitFailed) resolved: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:54:35] * brouberol waves good morin [06:54:57] * brouberol waves good morning, even. Curses at cat walking on keyboards [06:58:56] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) The next step is to evacuate `webrequest_upload`. We'll generate a reassignment plan in multiple steps: ` brouberol@kafka-jumbo1010:~/topicmappr$ topicmappr rebuild --topics webrequest_upload --brokers... [07:02:41] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase0.json --execute --throttle 30000000 kafka-reassign-partitions --zookeeper conf1007... [07:03:05] brouberol: o/ [07:03:25] o/ [07:03:46] one thing that I forgot to mention for metrics-fetcher - if you don't want to wait for upstream to merge etc.., you can always add your code as debian patch [07:03:51] release the deb etc.. [07:04:03] and once upstream merges, we can remove the patch and upgrade the deb version [07:04:06] good morning folks o/ [07:04:11] bonjour [07:04:21] oh, indeed. I thought about it but because this is non-blocking for us atm, I'm ok with waiting [07:04:57] I was both pleased and sorry elukey after the France-Italy rugby match this weekend :) [07:04:57] one thing I'll probably send upstream as well is the option of passing flags via env vars, so that we only need to run `prometheus-metricsfetcher` and be done w/ it [07:05:07] instead of passing 22 flags [07:05:20] oof, yep. This was a rough one [07:08:27] joal: ah snap didn't follow, I am not a rugby fan, but I reckon that Les Bleus may have destroyed Italy :D [07:09:14] :) [07:15:56] 10Data-Platform-SRE, 10Data Pipelines: Evaluate the PRESTO_EXPAND_DATA feature flag in superset - https://phabricator.wikimedia.org/T340144 (10JAllemandou) >>! In T340144#9225081, @Ottomata wrote: >> While super useful when it works, the feature is not stable enough to roller-out to production > > @JAllemando... [07:44:04] 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10JAllemandou) Indeed that's weird - I'll contact Alluxio folks. [07:45:24] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10JMeybohm) It should be possible to do the cluster upgrade procedure without an actual update, ye... [07:52:43] (03CR) 10Peter Fischer: "Since this schema lives under development/ we should be fine without version changes. Moving it here definitely was a good idea, @Gemodena" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/963990 (owner: 10Peter Fischer) [07:53:27] btullis: I see that you have been able to make progress on https://github.com/apache/superset/issues/25397. Do you need me to help on https://github.com/apache/superset/issues/23483? [08:02:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:26] brouberol: I think I'm sort of unblocked on the superset upgrade for now, but it might make sense for us to have a sync at some point to discuss superset in general. [08:05:14] sure thing [08:15:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:40:10] (03CR) 10Urbanecm: [C: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:43:46] elukey: FYI https://github.com/tarvip/kafkakit-prometheus-metricsfetcher/pull/4 [08:44:33] _that_ I'd like to include as a patch and package it in advance, as it'd help rebalance metrics in a more straightforward fashion I think [08:46:48] (03PS10) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:47:23] (03CR) 10CI reject: [V: 04-1] Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [08:54:25] I'm about to start a rolling restart of analytics10[70-77] for T344587 [08:55:25] !log started rolling restart of analytics10[70-77] for T344587 [08:55:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:55:41] (03PS11) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) [08:57:22] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10dcausse) >>! In T342149#9234617, @JMeybohm wrote: > It should be possible to do the cluster upgr... [08:58:10] 10Data-Engineering, 10Tool-Pageviews: None result with some chars in the file name - https://phabricator.wikimedia.org/T347899 (10Lokal_Profil) >>! In T347899#9217630, @MusikAnimal wrote: > Well first, the file was only uploaded 22 hours ago, so the data might simply [[ https://pageviews.wmcloud.org/mediaviews... [09:13:53] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10JMeybohm) >>! In T342149#9234816, @dcausse wrote: > The operator will purge the H/A metadata if... [09:24:06] 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10dcausse) >>! In T342149#9234852, @JMeybohm wrote: >>>! In T342149#9234816, @dcausse wrote: >> @J... [09:28:28] PROBLEM - SSH on analytics1072 is CRITICAL: connect to address 10.64.21.116 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:30:28] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: flink-app: swift bucket and zookeeper paths should be templated. - https://phabricator.wikimedia.org/T336901 (10JMeybohm) [09:30:34] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10JMeybohm) [10:06:08] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase1.json --execute --throttle 30000000 kafka-reassign-partitions --zookeeper conf1007... [10:10:08] PROBLEM - SSH on analytics1073 is CRITICAL: connect to address 10.64.21.117 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:11:38] RECOVERY - SSH on analytics1073 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:16:53] brouberol: yes +1, you can add https://github.com/tarvip/kafkakit-prometheus-metricsfetcher/commit/9533e8132ec539a99f4526e4c95a335bcfffae98.diff under debian/patches etc.. and bump the package version [10:24:32] RECOVERY - SSH on analytics1072 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:46:59] !log started rolling restart of an-worker1[078-156] for T344587 [10:47:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:08:36] elukey: I need to use the quilt debian source format for this, don't I? [11:09:18] brouberol: IIRC the git diff format is ok as well [11:09:26] you can test it with quilt apply [11:09:38] I'll check it out, thanks [11:13:06] ah, one thing that isn't covered by this patch though, is the content of the vendor/ folder (that we have in our repo that isn't there upstream), changed by our diff in go.{mod,sum} [11:16:04] so I might have to introduce a second patch that modifies the vendor content itself [11:53:29] 10Data-Platform-SRE: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) [11:56:00] (03PS15) 10Btullis: Update to Superset version 2.1.1 [analytics/superset/deploy] - 10https://gerrit.wikimedia.org/r/957938 (https://phabricator.wikimedia.org/T335356) [12:36:25] btullis: would you have a couple of minutes today to talk about where we are with our k8s dse cluster? [12:38:49] Yep, sure thing. Can you give me 20 minutes first? [12:38:58] of course [12:39:48] to give you some context, I started digging into secret management. You _can_ use external tools such as Vault, AWS KMS & many others to store secrets encrypted at rest, but you can also use plain old etcd: https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/ [12:40:34] and it turns out that this ^ seems already setup in our puppet repo: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/kubeadm/templates/encryption-conf.yaml.erb#1 [12:45:13] brouberol: Yes, good stuff. We already use this built-in method for supplying parameters from hiera and other files that are text encoded. [12:46:02] I have yet to understand how/if this is configured for the DSE cluster though [12:46:25] I'm trying to make sense of some monitoring. It looks like we're ignoring disk usage monitoring of /var/lib/hadoop/data on Hadoop workers... does that make sense or am I way off? [12:47:36] brouberol: What I'd like to look at is whether we could use it to supply a binary file: `/etc/security/keytabs/superset/superset.keytab` - I think we would need to base64 encode it for the helm chart, then decode it again to binary within the pod. [12:47:46] but my very high level thinking right now to integrate kubernetes and Kerberos would be a) leverage secret management in etcd b) encrypt manually generated keytabs as secrets, tied to the app service account c) render these secrets in the pod as keytab files [12:47:55] slyngs: Let me check. [12:48:13] again, these are more a collection of hunches than anything else right now [12:51:07] slyngs: I think we are monitoring it. `/etc/nagios/nrpe.d/check_disk_space_hadoop_worker.cfg` contains: `/usr/lib/nagios/plugins/check_disk -v --units GB -w 32 -c 16 -e -l -r "/var/lib/hadoop/data"` [12:51:41] Well, that just makes things more complicated :-) [12:51:45] Thanks [12:52:12] I think we duplicate the disk check so probably exclude them from the common one: https://github.com/wikimedia/operations-puppet/blob/84edbbf9f057160c0c286bd2bf04c01d71e645de/hieradata/role/common/analytics_cluster/hadoop/worker.yaml#L32-L35 [12:53:33] That was the one I was looking at [12:54:10] So I just saw the -i "/var/lib/hadoop/data" [12:59:51] > I think we would need to base64 encode it for the helm chart, then decode it again to binary within the pod. [12:59:51] This you can probably do with helm functions indeed [14:19:11] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase2.json --execute --throttle 60000000 kafka-reassign-partitions --zookeeper conf1007... [15:21:32] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 2 others: Add Antoine_Quhen to the deployment group - https://phabricator.wikimedia.org/T347296 (10thcipriani) Approved! Reason for access makes sense. [15:22:24] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10Gehel) [15:22:56] 10Data-Platform-SRE: Prometheus unable to scrape search-loader[12]002 - https://phabricator.wikimedia.org/T348222 (10Gehel) [15:31:32] 10Data-Platform-SRE, 10Discovery-Search (Current work): Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 (10Gehel) [15:33:00] 10Data-Platform-SRE: Consider using git-lfs for elastic plugins repo - https://phabricator.wikimedia.org/T344462 (10Gehel) [15:33:36] 10Data-Platform-SRE: Migrate apifeatureusage hosts to Bullseye or later - https://phabricator.wikimedia.org/T346053 (10Gehel) [15:48:44] 10Data-Platform-SRE: Set requests (not limits) for cirrus-streaming-updater in k8s - https://phabricator.wikimedia.org/T348350 (10Gehel) [15:49:38] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10Gehel) [15:51:37] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10Gehel) Timeboxed to 1 day to understand if we have a real issue. We will re-estimate after that. [16:06:04] btullis: I see you're rebooting hosts with the reboot-single cookbook one by one, if you need a hand to setup a roll-resart-reboot cookbook le us know [16:47:17] 10Data-Platform-SRE: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` rouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_upload-phase3.json --execute --throttle 60000000 kafka-reassign-partitions --zookeeper conf1007.... [17:06:08] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10Aklapper) [17:25:58] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10Audiodude) Looking at that wiki page I linked, it seems at least somewhat out of date. I'd like to work on upgrading Python to at least 3.11, since 3.7 is EOL since June of 2023. Of course this might require upgrading dependencies... [17:36:29] 10Data-Engineering: Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10Miriam) Oh sorry @BTullis I completely missed this, and thanks @Sfaci for the ping! Is it possible to move this data to @tizianopiccardi's home, as he is a co-author of the paper? [17:57:36] 10Data-Engineering: Check home/HDFS leftovers of ryanmax - https://phabricator.wikimedia.org/T325527 (10tizianopiccardi) The files in the folder of `ryanmax` can be deleted. The relevant files were already moved to my home folder. [18:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:35:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:34] !log deployed airflow analytics [18:35:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:45:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:52:40] (03PS1) 10Joal: Update referer archive job to use icerberg table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) [19:02:19] (03PS2) 10Joal: Update referer archive job to use icerberg table [analytics/refinery] - 10https://gerrit.wikimedia.org/r/964573 (https://phabricator.wikimedia.org/T347693) [19:32:02] 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10rook) [19:32:36] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) >>! In T348184#9233043, @SD0001 wrote: > @rook Are there any docs on how to do deployments once a GitHub PR gets merged? The document you found describes the process. https://wikitech.wikimedia.org/wiki/Portal:Data_Services... [19:32:42] (SystemdUnitFailed) firing: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:42] (SystemdUnitFailed) resolved: produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:55:45] 10Data-Engineering: Check home/HDFS leftovers of zxane - https://phabricator.wikimedia.org/T348127 (10Sfaci) No files found for zxane user in stat**/HDFS: `santi@wmf3277 scripts % ./check-users-leftovers zxane ====== stat1004 ====== total 0 ====== stat1005 ====== total 0 ====== stat1006 ====== total 0 ===... [20:57:25] 10Data-Engineering: Check home/HDFS leftovers of zxane - https://phabricator.wikimedia.org/T348127 (10Sfaci) 05Open→03Resolved a:03Sfaci [20:59:57] 10Data-Engineering: Check home/HDFS leftovers of tsepothoabala - https://phabricator.wikimedia.org/T348114 (10Sfaci) 05Open→03Resolved a:03Sfaci No files found for **tsepothoabala** in stat**/HDFS: ` santi@wmf3277 scripts % ./check-users-leftovers tsepothoabala ====== stat1004 ====== total 0 ======... [21:01:40] 10Data-Engineering: Check home/HDFS leftovers of essexigyan - https://phabricator.wikimedia.org/T348106 (10Sfaci) 05Open→03Resolved a:03Sfaci No files found for **essexigyan** in stat**/HDFS. ` santi@wmf3277 scripts % ./check-users-leftovers essexigyan ====== stat1004 ====== total 0 ====== stat1005 =... [21:47:30] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10Audiodude) Thank you for all the information, it is very helpful! We can stick to asynchronous communication if that's what works best, no problem. I guess we can keep using this ticket for Q&A? Anyways looking at T301469, anothe... [21:51:09] 10Quarry: Add maintainers to quarry - https://phabricator.wikimedia.org/T348184 (10rook) Ah yes when the k8s investigation ticket was opened the quarry source was hosted in Gerrit. The source has since moved to GitHub and GitHub would be the correct place to do development. I can add some container building logi... [22:45:48] 10Quarry, 10Patch-For-Review: investigate quarry on k8s - https://phabricator.wikimedia.org/T301469 (10Audiodude) I'm completely new to Kubernetes but have been reading through https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Workshop. Does WM Cloud provide k8s clusters, or is it expected that we woul...