[00:33:39] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[01:29:42] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:31:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:33] <jinxer-wm>	 (HdfsCorruptBlocks) firing: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[05:29:57] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:17:53] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_text-phase2.json --execute --throttle 60000000 kafka-reassign-partitions...
[06:34:22] <jinxer-wm>	 (HdfsCorruptBlocks) resolved: HDFS corrupt blocks detected on the analytics-hadoop HDFS cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_corrupt_blocks - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=39&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsCorruptBlocks
[06:48:33] <joal>	 Hi team - I have a power cut this morning, I'll be working offline.
[06:48:47] <joal>	 Back in ~3h
[07:09:51] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[07:20:29] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:21:55] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:26:15] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:27:43] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:37:32] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene)
[07:37:47] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:39:51] <jinxer-wm>	 (HdfsFSImageAge) firing: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[07:42:05] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:54:59] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:57:51] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:09:21] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:09:35] <brouberol>	 I'm eyeing https://phabricator.wikimedia.org/T329398 as something to work on as kafka resassignments are ongoing. Does anyone know how we usually monitor x509 certificate expiration, and make sure certificates are puppetized?
[08:13:21] <btullis>	 brouberol: I can talk to you about this one in our sync if you like. This skein certificate is a bit of an oddity because we don't manage it at present, it's just auto-created with a 1 year expiry.
[08:14:04] <brouberol>	 👍 so at least having an alert on its expiration date would be useful
[08:14:28] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) The new druid workers have fully joined the cluster and we are ready to move on with the next steps. {F38214901} From the conversations on the last druid refresh, the servers w...
[08:15:09] <btullis>	 Yes. I'd start with the one on an-test-client1002, which is associated with our analytics-test airflow instance.
[08:23:33] <btullis>	 The first question I would ask myself is: will I have to use Icinga to check (alert on) the expiration date, or is there a smarter way of doing it with Prometheus these days?
[08:28:37] <brouberol>	 I'm guessing we could rely on exporters such as https://github.com/amimof/node-cert-exporter. I'm not sure whether we could build them to run on Buster though
[08:31:29] <btullis>	 There are some expiry based alerts in prometheus here: https://codesearch.wmcloud.org/search/?q=expiry&files=&excludeFiles=&repos=operations%2Falerts - I would look to see how those values get into prometheus.
[08:32:07] <btullis>	 Maybe some are based on probes, or maybe some are based on the textfile collector of the node exporter. 
[08:32:35] <btullis>	 You could also ask in #wikimedia-observability or tag some people from that team and ask in the ticket.
[08:34:34] <btullis>	 This looks like a probe based check for cassandra TLS expiry: https://gerrit.wikimedia.org/g/operations/puppet/+/a72cec21a8c47d54605de0bcaa50786e0972fc55/modules/cassandra/manifests/instance/monitoring.pp#78
[08:34:39] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:35:17] <btullis>	 ...but I don't that won't work in the case of skein, because there isn't a TCP port using this certificate (I don't think).
[08:42:55] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:51:27] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:57:13] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:57:23] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10BTullis) >>! In T336042#9245252, @Stevemunene wrote: > The new druid workers have fully joined the cluster and we are ready to move on with the next steps. Great!  > From the conversations...
[08:57:36] <btullis>	 ^ I'm going to look at this behaviour of an-master1002
[08:58:12] <btullis>	 I need to reboot an-master1002 anyway for T344671
[08:58:39] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:05:53] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:11:41] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:13:08] <btullis>	 Well, it's not happy. I've tried logging in as root over IPMI but I've not quite got a bash prompt yet.
[09:13:11] <btullis>	 https://usercontent.irccloud-cdn.com/file/Sqn8Xjy5/image.png
[09:13:15] <btullis>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=an-master1002&var-datasource=thanos&var-cluster=analytics
[09:14:33] <elukey>	 https://kafka.apache.org/blog#apache_kafka_360_release_announcement - KRaft ready for production!
[09:15:59] <icinga-wm>	 PROBLEM - SSH on an-master1002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:16:00] <btullis>	 elukey: Ooh, interesting.
[09:18:17] <elukey>	 there is a nice guide to jump from any version to 3.6
[09:18:19] <btullis>	 !log power cycling an-master1002 to address unresponsiveness
[09:18:21] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:18:23] <elukey>	 that is reassuring for us 
[09:18:55] <btullis>	 Yes. Now we just have to make time to prioritise it :-)
[09:20:19] <icinga-wm>	 RECOVERY - SSH on an-master1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:31:55] <btullis>	 !log rebooting an-coord1002 for T344671
[09:31:57] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:34:42] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:34:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[09:34:51] <jinxer-wm>	 (HdfsFSImageAge) resolved: The HDFS FSImage on analytics-hadoop:an-master1001:10080 is too old. - https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Alerts#HDFS_FSImage_too_old - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&viewPanel=129&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsFSImageAge
[10:09:04] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform, 10Patch-For-Review: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 (10elukey) Tried to deploy es-internal in staging, and got:  ` {"name":"eventstreams","hostname":...
[10:15:32] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Bring druid10[09-11] into service - https://phabricator.wikimedia.org/T336042 (10Stevemunene) This does make sense, Thanks @BTullis. I shall be doing a string of patches for this
[10:34:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:49:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:01:47] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis)
[11:02:28] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) p:05Triage→03High
[11:04:17] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10CodeReviewBot) btullis updated https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/merge_requests/4  Configure gradle proxies for trusted runners only
[11:24:31] <joal>	 \o/ back online :)
[11:24:49] <btullis>	 Ah, we've missed you :-)
[11:25:58] <joal>	 btullis: tell me, what have I missed :)
[11:27:47] <btullis>	 joal: an-master1002 jammed up for unknown reason, rebooted now and back to normal. Unusual though.
[11:28:10] <joal>	 btullis: was it active Namenode at the time?
[11:28:47] <btullis>	 No, standby role at the time. No known impact on HDFS.
[11:29:30] <joal>	 weird :S
[11:29:57] <joal>	 Could it be related to FSimage? a moment when it was either creating the image, or copying it?
[11:30:02] <joal>	 anyhow - thanks for fixing!
[11:30:36] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) I think that I have fixed this now.  I used the following syntax for the command options, which I believe works in dash: ` - -Dhttp.proxyHost=${HTTP_PROXY:+webproxy} - -Dhttps.proxyHos...
[11:31:51] <btullis>	 A pleasure. I haven't investigated the cause further yet.
[11:36:59] <btullis>	 joal: I have added you as a reviewer to the chain of patches here, since they fiddle with `yarn-site.xml` and node manager config. https://gerrit.wikimedia.org/r/c/operations/puppet/+/963281
[11:37:16] <joal>	 ack btullis - checking
[11:37:26] <btullis>	 Should be ready to roll out multiple spark shufflers in parallel to the test cluster early next week.
[12:00:23] <btullis>	 !log pushing out presto version 0.283 to the test cluster.
[12:00:24] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:01:29] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) I'm pushing out the new version of presto to the test cluster with: ` btullis@cumin1001:~$ sudo debdeploy deploy -u 2023-10-12-presto.yaml -Q 'P{O:analytics_test_cluster::presto::ser...
[12:04:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:25:51] <brouberol>	 I have a set of 3 xs MRs as prep-work for kafka-jumbo100[1-6] retirement, for anyone interested: https://gerrit.wikimedia.org/r/c/operations/puppet/+/965159 https://gerrit.wikimedia.org/r/c/operations/puppet/+/965160 https://gerrit.wikimedia.org/r/c/operations/puppet/+/965161
[12:28:27] <brouberol>	 tx for the quick review btullis! Do we have to do anything to get https://gerrit.wikimedia.org/r/c/operations/puppet/+/965160 to apply? eg restart a service manually? Or does puppet handle everything?
[12:36:32] <brouberol>	 joal: I've added you as reviewer on https://gerrit.wikimedia.org/r/c/analytics/refinery/+/965166. Feel free to 301 if you think you're not the right person.  Thank you!
[12:38:43] <btullis>	 brouberol: It looks like puppet will restart karapace on any config file change: https://github.com/wikimedia/operations-puppet/blob/a766af194acb07336e95f47c964f65a2634695c7/modules/karapace/manifests/init.pp#L52
[12:38:52] <btullis>	 ...but there's no harm in checking.
[12:39:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:40:45] <brouberol>	 oh, sorry. This one I worked out on my own, but I was more worried about the changes to hieradata/role/common/analytics_cluster/launcher.yaml in https://gerrit.wikimedia.org/r/c/operations/puppet/+/965160
[12:42:55] <brouberol>	 grepping for `datahub_kafka_jumbo` yields no particular result, so that might just be a no-op, but I'm not 100% sure TBH
[12:44:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:17] <btullis>	 brouberol: Err, yes this one is a little more complicated. I *think* that the only service impacted will be restarted automatically: https://github.com/wikimedia/operations-puppet/blob/a766af194acb07336e95f47c964f65a2634695c7/modules/airflow/manifests/instance.pp#L423-L434
[13:01:03] <btullis>	 ..but you may want to `systemctl status airflow-scheduler@analytics.service` to make sure.
[13:01:33] <brouberol>	 thanks, on it
[13:02:01] <brouberol>	 that'll be on an-launc
[13:02:09] <btullis>	 Yup
[13:02:13] <brouberol>	 *an-launcher, or every an-* hosts?
[13:02:57] <btullis>	 Just an-launcher1002, because that's the host where our analytics airflow instance is running: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow/Instances#analytics
[13:03:04] <brouberol>	 on an-launcher: systemctl status airflow-scheduler@analytics.service ->  Active: active (running) since Thu 2023-10-12 12:45:10 UTC; 17min ago
[13:03:11] <brouberol>	 seems like you were right!
[13:04:35] <btullis>	 Great! You can also double-check what puppet logged and did by looking at the last report: https://puppetboard.wikimedia.org/report/an-launcher1002.eqiad.wmnet/461f259564e58631552ed6b85011749591c064cf
[13:10:46] <brouberol>	 oh nice! TIL, thanks
[13:18:36] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Upgrade Presto to version 0.283 - https://phabricator.wikimedia.org/T342343 (10BTullis) I've done some basic tests on the test cluster and things appear to be working properly. ` presto> SELECT node_id,node_version FROM system.runtime.nodes;             node_id...
[13:19:38] <wikibugs>	 10Data-Platform-SRE: Reboot apifeatureusage* hosts - https://phabricator.wikimedia.org/T348418 (10bking) 05Open→03Resolved a:03bking
[13:20:25] <btullis>	 Heads-up, I'm going to reboot archiva1002 in a few minutes. Hopefully it won't affect any deployments.
[13:22:58] <btullis>	 !log rebooting archiva1002.wikimedia.org for T344671
[13:23:00] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:24:44] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/datahub/-/merge_requests/4  Configure gradle proxies for trusted runners only
[13:28:56] <wikibugs>	 10Quarry: Remove gerrit git from quarry puppet - https://phabricator.wikimedia.org/T348748 (10rook)
[13:29:05] <inflatador>	 btullis we're in the pairing session if you wanna join
[13:30:35] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM! Let us know when you'd like to see this deployed @brouberol" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965166 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[13:31:42] <joal>	 btullis: I've +1ed you patches for the spark shuffler thing - let me know when you wish to apply to test, so that we monitor
[13:31:42] <wikibugs>	 (03CR) 10Brouberol: Remove kafka-jumbo100[1-6] brokers from bootstrap hosts (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/965166 (https://phabricator.wikimedia.org/T336044) (owner: 10Brouberol)
[13:31:45] <inflatador>	 btullis NM, we jumped out...hit me up if you are interested in pairing later though. I don't have too much on my plate
[13:32:36] <btullis>	 inflatador: Gah! Sorry, I missed your message. I'm still around for the next 30 minutes.
[13:45:17] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) This worked as hoped. We can see that the `publish` stage worked on the `main` branch, which required trusted runners, and the `build` stage worked on the unprotected feature branch. {...
[13:46:14] <btullis>	 joal: I think we should merge the two no-op spark shuffler ones today, then we will be ready for the third one to enable on the test cluster next week.
[13:47:49] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/1  Create a script for packa...
[13:50:48] <joal>	 works for me btullis - Let's merge/deploy/restart today and monitor that no-op is actually no-op :)
[13:52:16] <btullis>	 Cool, just doing a final pcc run after fixing that `if` syntax  suggestion of yours: https://puppet-compiler.wmflabs.org/output/963304/44022/
[13:52:38] <joal>	 btullis: I'll be gone for kids for the next 2 hours, then back
[13:53:07] <btullis>	 OK, it's a date :-)
[13:53:18] <joal>	 :)
[13:59:53] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/2  Set the debian/rules file...
[14:00:04] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/2  Set the debian/rules file...
[14:56:27] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007....
[14:56:42] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki...
[14:57:14] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1007....
[15:05:39] <wikibugs>	 10Data-Engineering, 10serviceops, 10Event-Platform: Traffic for eventstreams-internal seems to be zero for the past months - https://phabricator.wikimedia.org/T348763 (10elukey)
[15:14:15] <wikibugs>	 (03PS2) 10Milimetric: Add siteinfo information to output XML [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/963836 (https://phabricator.wikimedia.org/T348761)
[15:19:41] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate recent CirrusSearch p95 latency - https://phabricator.wikimedia.org/T347988 (10bking)
[15:31:23] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1007.wiki...
[15:33:16] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @bking @Papaul    I was able to change netbox to Public Vlan  redoing most of the steps for setting up...
[15:55:42] <joal>	 Here I am :)
[15:56:04] <joal>	 btullis: happy to help monitoring (while in meetings) if you wish to apply the patch
[15:56:29] <btullis>	 joal: Yes, let's do it. 
[15:58:25] <btullis>	 Oh I spotted something that shouldn't be there. Might not make a difference, but I will probably put it back.
[15:58:39] <btullis>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/963281/35/modules/profile/manifests/hadoop/worker.pp#b15
[16:12:13] <btullis>	 joal: That first one is now merged. Should be a noop.
[16:13:53] <btullis>	 second patch is now merged. Should also be a noop.
[16:14:56] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009....
[16:15:04] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010....
[16:15:12] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008....
[16:19:32] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Hard-deprecate mw.eventLog.inSample - https://phabricator.wikimedia.org/T348776 (10phuedx)
[16:19:52] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents: Add mw.eventLog.pageviewInSample() method - https://phabricator.wikimedia.org/T348777 (10phuedx)
[16:21:40] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I've deployed the first two patches:  * 963281: Support multiple spark yarn shufflers in parallel | https://gerrit.wikimedia...
[16:40:05] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Decommission kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T336044 (10brouberol) ` brouberol@kafka-jumbo1010:~/topicmappr$ kafka reassign-partitions --reassignment-json-file ./webrequest_text-phase3.json --execute --throttle 60000000 kafka-reassign-partitions...
[16:44:43] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:46:35] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-WikimediaEvents: Add mw.eventLog.pageviewInSample() - https://phabricator.wikimedia.org/T348777 (10phuedx)
[16:46:43] <wikibugs>	 10Data-Engineering, 10MediaWiki-extensions-EventLogging: Hard-deprecate mw.eventLog.inSample() - https://phabricator.wikimedia.org/T348776 (10phuedx)
[17:04:50] <wikibugs>	 10Data-Platform-SRE, 10Data-Catalog: Fix the build process for DataHub - https://phabricator.wikimedia.org/T348738 (10BTullis) 05Open→03Resolved
[17:12:24] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I have added the yarn shuffler jars to the apt repository. ` btullis@apt1001:~/spark$ sudo -i reprepro include bullseye-wiki...
[17:17:39] <wikibugs>	 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10github-toolforge-bot) siddharthvp closed https://github.com/toolforge/quarry/pull/24
[17:18:11] <wikibugs>	 10Quarry, 10cloud-services-team: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10github-toolforge-bot) siddharthvp closed https://github.com/toolforge/quarry/pull/26
[17:35:07] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki...
[17:35:11] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki...
[17:35:19] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki...
[17:38:50] <ebernhardson>	 where does the code for org.wikimedia:eventutilities live? Not having any luck finding it
[17:38:55] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10Ahoelzl) @lbowmaker to be prioritized for Sprint 4
[17:43:11] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010....
[17:43:18] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009....
[17:43:38] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008....
[17:59:44] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+2] Refactor schema structure [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965258 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia)
[18:00:18] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor schema structure [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/965258 (https://phabricator.wikimedia.org/T346106) (owner: 10Kimberly Sarabia)
[18:01:50] <btullis>	 ebernhardson: could it be here? https://github.com/nomoa/wikimedia-event-utilities
[18:02:30] <btullis>	 I found it from the readme here: https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python
[18:03:32] <ebernhardson>	 btullis: ahha, thanks! Thats not the right place, but it's a fork of the right one and it contains the SCM information in the pom.xml pointing at the right place. Thanks!
[18:03:51] <wikibugs>	 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo)
[18:07:06] <wikibugs>	 10Data-Platform-SRE: Upgrade Spark to a version with long term Iceberg support, and with fixes to support Dumps 2.0 - https://phabricator.wikimedia.org/T338057 (10xcollazo)
[18:14:25] <btullis>	 ebernhardson: Great! You're welcome.
[18:54:11] <wikibugs>	 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10github-toolforge-bot) vivian-rook closed https://github.com/toolforge/quarry/pull/27
[18:54:36] <wikibugs>	 10Quarry: git-crypt for config.yaml files - https://phabricator.wikimedia.org/T348476 (10rook) 05Open→03Resolved a:03rook
[18:56:05] <wikibugs>	 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook)
[18:59:54] <wikibugs>	 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Quarry#Deployment felt the more appropriate place.
[19:00:02] <wikibugs>	 10Quarry: update readme with notes on server setup - https://phabricator.wikimedia.org/T348798 (10rook) 05Open→03Resolved
[19:00:48] <wikibugs>	 10Quarry: Quarry suggests invalid database names, and doesn't suggest some valid database names - https://phabricator.wikimedia.org/T289943 (10rook) With PR-24 closed, should this task be closed as resolved?
[19:01:38] <wikibugs>	 10Quarry, 10cloud-services-team: Support queries against Quarry's own database and ToolsDB - https://phabricator.wikimedia.org/T151158 (10rook) With PR-26 closed, should this task be closed as resolved?
[19:03:21] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki...
[19:03:33] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki...
[19:03:41] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki...
[19:34:32] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3): Update AQS API with September net new content data - https://phabricator.wikimedia.org/T348598 (10nshahquinn-wmf) 05Open→03Resolved Yes, this is resolved! Thanks, everyone 😊
[19:45:36] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1009....
[19:45:49] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1010....
[19:49:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:04:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:26:08] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1010.wiki...
[20:26:13] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1009.wiki...
[20:38:55] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008....
[21:17:20] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) I started a data reload for hosts`wdqs1022-1024`. These are running in a tmux window under my user on`cumin1001`.   Based on T323096 , we expe...
[21:59:08] <wikibugs>	 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki...
[22:35:23] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Event-Platform: Test common operations in the flink operator/k8s/Flink ZK environment - https://phabricator.wikimedia.org/T342149 (10bking) 05Resolved→03In progress
[22:35:31] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking)
[22:37:09] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Event-Platform: [SPIKE] Investigate what happens to deployed Flink clusters if the k8s operator goes down? - https://phabricator.wikimedia.org/T346231 (10bking) @gmodena [[ https://phabricator.wikimedia.org/T342149#92286...