[00:01:02] <Amir1>	 cdanis: Do you want me to deploy?
[00:05:21] <icinga-wm>	 PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[00:06:19] <icinga-wm>	 RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief
[00:10:49] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:15:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:18:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:18:59] <wikibugs>	 (03PS1) 10Cwhite: logstash: update codfw jobs host to logging-sd2001 [puppet] - 10https://gerrit.wikimedia.org/r/1109188 (https://phabricator.wikimedia.org/T353912)
[00:19:01] <wikibugs>	 (03PS1) 10Cwhite: logstash: update eqiad jobs host to logging-sd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1109189 (https://phabricator.wikimedia.org/T353912)
[00:25:51] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:30:06] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1109149|filerepo: Fix schema compatibility constant usage (T383269)]]
[00:30:09] <stashbot>	 T383269: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'commonswiki.file' doesn't exist - https://phabricator.wikimedia.org/T383269
[00:32:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[00:36:29] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, cdanis: Backport for [[gerrit:1109149|filerepo: Fix schema compatibility constant usage (T383269)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[00:36:32] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup, cdanis: Continuing with sync
[00:36:32] <stashbot>	 T383269: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'commonswiki.file' doesn't exist - https://phabricator.wikimedia.org/T383269
[00:39:04] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Unable to upload images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443249 (10Bugreporter)
[00:39:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109190
[00:39:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109190 (owner: 10TrainBranchBot)
[00:39:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443250 (10phaultfinder)
[00:43:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 210991960 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:43:40] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109149|filerepo: Fix schema compatibility constant usage (T383269)]] (duration: 13m 34s)
[00:43:43] <stashbot>	 T383269: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'commonswiki.file' doesn't exist - https://phabricator.wikimedia.org/T383269
[00:45:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 25488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:59:24] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109190 (owner: 10TrainBranchBot)
[01:00:10] <wikibugs>	 (03PS1) 10Gergő Tisza: Create auth.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1109193 (https://phabricator.wikimedia.org/T377187)
[01:09:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109194
[01:09:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109194 (owner: 10TrainBranchBot)
[01:13:23] <icinga-wm>	 PROBLEM - snapshot of s6 in codfw on backupmon1001 is CRITICAL: snapshot for s6 at codfw (db2197) taken more than 3 days ago: Most recent backup 2025-01-06 01:09:15 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:13:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "Let us know when you want to deploy it and if there is anything else required from Traffic around this." [dns] - 10https://gerrit.wikimedia.org/r/1109193 (https://phabricator.wikimedia.org/T377187) (owner: 10Gergő Tisza)
[01:15:49] <wikibugs>	 (03PS1) 10Gergő Tisza: Add Apache configuration for auth.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187)
[01:15:57] <wikibugs>	 (03PS2) 10Gergő Tisza: Create auth.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1109193 (https://phabricator.wikimedia.org/T377187)
[01:16:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Add Apache configuration for auth.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187) (owner: 10Gergő Tisza)
[01:19:26] <tgr|away>	 sukhe: I was thinking of adding https://gerrit.wikimedia.org/r/c/operations/dns/+/1109193 to the puppet window, does that work?
[01:19:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443273 (10phaultfinder)
[01:20:21] <sukhe>	 tgr|away: you could, or you can simply ping us when you want to roll it out. rolling this out is quite trivial.
[01:20:24] <sukhe>	 whatever works for you
[01:20:43] <sukhe>	 there is no Puppet window for DNS changes if that is what you were asking
[01:21:09] <wikibugs>	 (03PS2) 10Gergő Tisza: Add Apache configuration for auth.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187)
[01:22:04] <tgr|away>	 is it ok to deploy it before the Apache changes?
[01:24:00] <sukhe>	 as long as we are OK with the domain existing but not pointing to anything functional without the backend in place
[01:24:54] <sukhe>	 there is one more thing we can do is that you can add the apache patch to the Puppet window and simply ping us to merge this around that time
[01:25:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109194 (owner: 10TrainBranchBot)
[01:26:00] <sukhe>	 the Puppet window SRE will have rights to merge this out and most SREs have rolled out DNS changes, so no issues there at all
[01:26:10] <sukhe>	 s/merge this out/merge and roll this out
[01:26:56] <tgr|away>	 thanks, I'll do that
[01:27:27] <sukhe>	 ok. please feel free to ping me if I can help (and if I am not around, just ping us in #wikimedia-traffic)
[01:28:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:32:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[01:37:01] <wikibugs>	 (03PS2) 10Gergő Tisza: SUL3: Add auth domain to httpbb URL tests [puppet] - 10https://gerrit.wikimedia.org/r/1099339 (https://phabricator.wikimedia.org/T380574)
[01:37:23] <wikibugs>	 (03PS3) 10Gergő Tisza: SUL3: Add auth domain to httpbb URL tests [puppet] - 10https://gerrit.wikimedia.org/r/1099339 (https://phabricator.wikimedia.org/T380574)
[01:38:56] <wikibugs>	 (03PS2) 10Gergő Tisza: SUL3: Add auth domain to URL tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099338 (https://phabricator.wikimedia.org/T380574)
[01:44:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443305 (10phaultfinder)
[01:46:17] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/a6ecd352cffc949f5c1f2cf2f34cbfc392034a833edcf512caaf68d4b9d9c117/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:59:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443319 (10phaultfinder)
[02:04:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443320 (10phaultfinder)
[02:06:17] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:07:05] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:22:05] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:24:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[02:24:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443344 (10phaultfinder)
[02:27:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:27:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:37:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[02:38:07] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[03:08:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:09:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443374 (10phaultfinder)
[03:18:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:24:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:29:06] <jinxer-wm>	 FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:29:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[03:34:06] <jinxer-wm>	 RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster  - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures
[03:50:15] <icinga-wm>	 PROBLEM - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2197) taken more than 3 days ago: Most recent backup 2025-01-06 03:31:24 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[04:06:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:13:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:15:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:19:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:23:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:25:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:25:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:27:27] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:30:23] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14418MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[04:32:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:34:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:39:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:40:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:43:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:00:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:07:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:14:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443430 (10phaultfinder)
[05:22:01] <icinga-wm>	 PROBLEM - snapshot of x1 in codfw on backupmon1001 is CRITICAL: snapshot for x1 at codfw (db2197) taken more than 3 days ago: Most recent backup 2025-01-06 05:08:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:25:28] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:27:27] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[05:28:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:45:55] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:49:57] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:02:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:09:45] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:11:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P71891 and previous config saved to /var/cache/conftool/dbconfig/20250109-061142-root.json
[06:14:39] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1109302 (https://phabricator.wikimedia.org/T382569)
[06:15:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1109302 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[06:17:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1042 depooled T382569', diff saved to https://phabricator.wikimedia.org/P71892 and previous config saved to /var/cache/conftool/dbconfig/20250109-061724-marostegui.json
[06:17:28] <stashbot>	 T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569
[06:18:07] <wikibugs>	 (03PS1) 10Marostegui: es1042: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109303
[06:18:50] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1042: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109303 (owner: 10Marostegui)
[06:21:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:24:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[06:26:00] <kart_>	 Deploying cxserver..
[06:26:28] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-07-045930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108544 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry)
[06:26:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P71893 and previous config saved to /var/cache/conftool/dbconfig/20250109-062647-root.json
[06:27:13] <wikibugs>	 (03PS1) 10Marostegui: es1042: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1109304
[06:27:38] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2025-01-07-045930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108544 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry)
[06:27:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71894 and previous config saved to /var/cache/conftool/dbconfig/20250109-062749-root.json
[06:27:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1042: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1109304 (owner: 10Marostegui)
[06:30:35] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[06:31:02] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:31:07] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1109305 (https://phabricator.wikimedia.org/T382569)
[06:32:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:36:05] <icinga-wm>	 PROBLEM - Disk space on dbprov2004 is CRITICAL: DISK CRITICAL - free space: /srv 449399 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2004&var-datasource=codfw+prometheus/ops
[06:38:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1109305 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui)
[06:40:17] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:40:48] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:41:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1022 T382569', diff saved to https://phabricator.wikimedia.org/P71895 and previous config saved to /var/cache/conftool/dbconfig/20250109-064117-marostegui.json
[06:41:21] <stashbot>	 T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569
[06:41:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1022.eqiad.wmnet with reason: cloning es1042
[06:41:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1022.eqiad.wmnet with reason: cloning es1042
[06:41:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71896 and previous config saved to /var/cache/conftool/dbconfig/20250109-064153-root.json
[06:42:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71897 and previous config saved to /var/cache/conftool/dbconfig/20250109-064254-root.json
[06:44:54] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:45:29] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:45:41] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1109306 (https://phabricator.wikimedia.org/T373579)
[06:45:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2131 T373579', diff saved to https://phabricator.wikimedia.org/P71898 and previous config saved to /var/cache/conftool/dbconfig/20250109-064556-marostegui.json
[06:46:00] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[06:46:08] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1109306 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[06:47:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: cloning db2231
[06:47:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: cloning db2231
[06:47:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:49:08] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db2231 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109307 (https://phabricator.wikimedia.org/T373579)
[06:49:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2231 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109307 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[06:50:23] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14180MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[06:51:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2231 to dbctl depooled T373579', diff saved to https://phabricator.wikimedia.org/P71899 and previous config saved to /var/cache/conftool/dbconfig/20250109-065114-marostegui.json
[06:51:18] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[06:52:16] <logmsgbot>	 !log root@cumin1002 START - Cookbook sre.mysql.clone of db2131.codfw.wmnet onto db2231.codfw.wmnet
[06:53:34] <kart_>	 !log Updated cxserver to 2025-01-07-045930-production (T377966, T377813, T381379)
[06:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:40] <stashbot>	 T377966: cxserver: Logstash entries seems difficult to read - https://phabricator.wikimedia.org/T377966
[06:53:41] <stashbot>	 T377813: Migrate cxserver code from CommonJS to ESM / ECMAScript - https://phabricator.wikimedia.org/T377813
[06:53:41] <stashbot>	 T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379
[06:53:43] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:54:29] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:56:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71900 and previous config saved to /var/cache/conftool/dbconfig/20250109-065658-root.json
[06:58:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71901 and previous config saved to /var/cache/conftool/dbconfig/20250109-065759-root.json
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T0700)
[07:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T0700).
[07:02:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[07:09:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443573 (10phaultfinder)
[07:12:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P71902 and previous config saved to /var/cache/conftool/dbconfig/20250109-071203-root.json
[07:13:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71903 and previous config saved to /var/cache/conftool/dbconfig/20250109-071305-root.json
[07:20:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:20:48] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:27:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P71904 and previous config saved to /var/cache/conftool/dbconfig/20250109-072709-root.json
[07:28:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71905 and previous config saved to /var/cache/conftool/dbconfig/20250109-072809-root.json
[07:29:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2014,2017].codfw.wmnet
[07:32:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2014,2017].codfw.wmnet
[07:34:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2014.codfw.wmnet with OS bookworm
[07:34:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2017.codfw.wmnet with OS bookworm
[07:34:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2014
[07:34:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2014
[07:34:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2017
[07:34:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2017
[07:39:02] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:42:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71906 and previous config saved to /var/cache/conftool/dbconfig/20250109-074214-root.json
[07:42:42] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Mail: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10443606 (10Aklapper)
[07:43:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 6%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71907 and previous config saved to /var/cache/conftool/dbconfig/20250109-074314-root.json
[07:44:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443608 (10phaultfinder)
[07:44:53] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443609 (10Aklapper)
[07:45:56] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443610 (10Aklapper)
[07:46:45] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Switchover m1-master proxy [dns] - 10https://gerrit.wikimedia.org/r/1109374
[07:52:21] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage
[07:52:39] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage
[07:55:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage
[07:58:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 7%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71908 and previous config saved to /var/cache/conftool/dbconfig/20250109-075820-root.json
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T0800). nyaa~
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:28] <wikibugs>	 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443634 (10Underbar_dk) This is really weird: I tried uploading some other image instead, that went through fine (https://commons.wikimedia.org/wiki/File:San_yan_cho...
[08:01:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage
[08:06:05] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109117 (owner: 10Muehlenhoff)
[08:13:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 8%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71909 and previous config saved to /var/cache/conftool/dbconfig/20250109-081324-root.json
[08:15:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2014.codfw.wmnet with OS bookworm
[08:18:07] <wikibugs>	 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300 (10LSobanski) 03NEW
[08:18:29] <wikibugs>	 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance thanos-be1005) - https://phabricator.wikimedia.org/T383301 (10LSobanski) 03NEW
[08:19:07] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302 (10LSobanski) 03NEW
[08:19:19] <wikibugs>	 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303 (10LSobanski) 03NEW
[08:20:57] <wikibugs>	 (03PS1) 10JMeybohm: aptrepo: Add bookworm components calico329 and kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984)
[08:21:50] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2017.codfw.wmnet with OS bookworm
[08:22:04] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:22:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2017.codfw.wmnet
[08:22:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2017.codfw.wmnet
[08:22:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2014.codfw.wmnet
[08:22:20] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2014.codfw.wmnet
[08:22:57] <wikibugs>	 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance thanos-be1005) - https://phabricator.wikimedia.org/T383301#10443704 (10MatthewVernon)
[08:22:58] <wikibugs>	 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10443703 (10MatthewVernon)
[08:22:59] <wikibugs>	 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300#10443705 (10MatthewVernon)
[08:23:47] <wikibugs>	 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300#10443709 (10MatthewVernon) This (and the thanos one) are casualties of us still not having working tooling on the Supermicro Config J systems (see T3...
[08:24:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2011-2013].codfw.wmnet
[08:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443712 (10phaultfinder)
[08:26:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2011-2013].codfw.wmnet
[08:28:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 9%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71910 and previous config saved to /var/cache/conftool/dbconfig/20250109-082829-root.json
[08:31:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2011.codfw.wmnet with OS bookworm
[08:31:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2011
[08:31:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2011
[08:31:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2013.codfw.wmnet with OS bookworm
[08:31:50] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2012.codfw.wmnet with OS bookworm
[08:32:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2013
[08:32:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2013
[08:32:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2012
[08:32:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2012
[08:33:03] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1043.eqiad.wmnet with reason: cloning
[08:33:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1043.eqiad.wmnet with reason: cloning
[08:35:18] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:35:52] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Switchover m1-master proxy [dns] - 10https://gerrit.wikimedia.org/r/1109374 (owner: 10Marostegui)
[08:36:04] <icinga-wm>	 PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:43:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71911 and previous config saved to /var/cache/conftool/dbconfig/20250109-084335-root.json
[08:44:58] <wikibugs>	 (03CR) 10Volans: [C:03+2] ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:49:17] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage
[08:49:37] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage
[08:49:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage
[08:50:20] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse)
[08:50:54] <wikibugs>	 (03Merged) 10jenkins-bot: ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:52:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage
[08:55:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage
[08:57:31] <logmsgbot>	 !log root@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2131.codfw.wmnet onto db2231.codfw.wmnet
[08:57:46] <wikibugs>	 (03CR) 10Volans: [C:03+2] ownership: Collaboration Services cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:58:29] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1109117 (owner: 10Muehlenhoff)
[08:58:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71912 and previous config saved to /var/cache/conftool/dbconfig/20250109-085840-root.json
[08:59:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage
[09:02:16] <vgutierrez>	 !log update to haproxy 2.8.13 on component thirdparty/haproxy28 bullseye-wikimedia (apt.wm.o) - T383111
[09:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:19] <stashbot>	 T383111: Upgrade haproxy to 2.8.13 on cp hosts - https://phabricator.wikimedia.org/T383111
[09:04:11] <wikibugs>	 (03Merged) 10jenkins-bot: ownership: Collaboration Services cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[09:12:20] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2011.codfw.wmnet with OS bookworm
[09:12:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:13:13] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: remove redundant cp hosts definitions [puppet] - 10https://gerrit.wikimedia.org/r/1109383
[09:13:27] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp
[09:13:43] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109383 (owner: 10Vgutierrez)
[09:13:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71913 and previous config saved to /var/cache/conftool/dbconfig/20250109-091345-root.json
[09:15:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2012.codfw.wmnet with OS bookworm
[09:16:08] <icinga-wm>	 RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:17:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh)
[09:17:58] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp
[09:18:59] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp
[09:19:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2013.codfw.wmnet with OS bookworm
[09:19:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2011.codfw.wmnet
[09:19:45] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2011.codfw.wmnet
[09:19:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2012.codfw.wmnet
[09:19:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2012.codfw.wmnet
[09:19:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2013.codfw.wmnet
[09:20:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2013.codfw.wmnet
[09:20:28] <wikibugs>	 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10443781 (10Volans) @elukey good question. Surely we're not working on this but we still have python2 code around, not too much but there is. I'm...
[09:21:10] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet
[09:23:17] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: remove cp4052 deprecated hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1109383
[09:23:44] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp
[09:25:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:26:09] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109383 (owner: 10Vgutierrez)
[09:28:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71914 and previous config saved to /var/cache/conftool/dbconfig/20250109-092850-root.json
[09:31:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: remove cp4052 deprecated hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1109383 (owner: 10Vgutierrez)
[09:33:13] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ml-serve2001 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 2, Failed: 0, Spare: 1 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383307 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:33:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383307 (10ops-monitoring-bot) 03NEW
[09:36:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Make profile::presto::server::ferm_srange optional [puppet] - 10https://gerrit.wikimedia.org/r/1109117 (owner: 10Muehlenhoff)
[09:40:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71915 and previous config saved to /var/cache/conftool/dbconfig/20250109-094022-root.json
[09:42:59] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095
[09:43:51] <wikibugs>	 (03PS3) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176)
[09:43:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71916 and previous config saved to /var/cache/conftool/dbconfig/20250109-094355-root.json
[09:44:02] <wikibugs>	 (03CR) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[09:44:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:44:03] <jinxer-wm>	 FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:44:13] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes20[53,56,58] to wikikube-worker[2192-2194] [puppet] - 10https://gerrit.wikimedia.org/r/1109389 (https://phabricator.wikimedia.org/T377877)
[09:44:56] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff)
[09:46:48] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm, thanks for updating the probe!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109043 (https://phabricator.wikimedia.org/T382617) (owner: 10DDesouza)
[09:47:52] <wikibugs>	 (03CR) 10Jelto: [C:03+1] aptrepo: Add bookworm components calico329 and kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[09:48:27] <wikibugs>	 (03PS1) 10Marostegui: db2231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109390 (https://phabricator.wikimedia.org/T373579)
[09:50:56] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109390 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui)
[09:51:19] <wikibugs>	 (03PS4) 10Muehlenhoff: Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095
[09:52:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71917 and previous config saved to /var/cache/conftool/dbconfig/20250109-095207-root.json
[09:52:18] <wikibugs>	 (03Abandoned) 10Btullis: Add conftool-data for dbstore hosts to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis)
[09:52:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[53,56,58] to wikikube-worker[2192-2194] [puppet] - 10https://gerrit.wikimedia.org/r/1109389 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[09:53:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff)
[09:55:18] <icinga-wm>	 RECOVERY - MD RAID on ml-serve2001 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[09:55:18] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet
[09:55:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71918 and previous config saved to /var/cache/conftool/dbconfig/20250109-095527-root.json
[09:55:36] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet
[09:56:04] <icinga-wm>	 RECOVERY - Disk space on dbprov2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2004&var-datasource=codfw+prometheus/ops
[09:58:04] <moritzm>	 !log installing glibc bugfix updates for Bookworm
[09:58:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro)
[10:00:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] matomo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1103313 (owner: 10Muehlenhoff)
[10:01:34] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff)
[10:03:08] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] aptrepo: Add bookworm components calico329 and kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm)
[10:04:07] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Enable Translate message bundle Scribunto library on MetaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro)
[10:05:51] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbprov2004.codfw.wmnet with reason: os upgrade
[10:06:07] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbprov2004.codfw.wmnet with reason: os upgrade
[10:06:49] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans)
[10:07:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71919 and previous config saved to /var/cache/conftool/dbconfig/20250109-100712-root.json
[10:09:40] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet
[10:10:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P71920 and previous config saved to /var/cache/conftool/dbconfig/20250109-101033-root.json
[10:11:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2053,2056,2058].codfw.wmnet
[10:11:20] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] db_maint_mapper_sal.py: Update list of nicks [software] - 10https://gerrit.wikimedia.org/r/1108879 (owner: 10Marostegui)
[10:11:55] <wikibugs>	 (03Merged) 10jenkins-bot: db_maint_mapper_sal.py: Update list of nicks [software] - 10https://gerrit.wikimedia.org/r/1108879 (owner: 10Marostegui)
[10:12:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2053,2056,2058].codfw.wmnet
[10:15:41] <wikibugs>	 (03CR) 10Volans: [C:03+2] sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans)
[10:17:19] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes20[53,56,58] to wikikube-worker[2192-2194] [puppet] - 10https://gerrit.wikimedia.org/r/1109389 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[10:20:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2126 db2226 T373579', diff saved to https://phabricator.wikimedia.org/P71921 and previous config saved to /var/cache/conftool/dbconfig/20250109-102010-marostegui.json
[10:20:14] <stashbot>	 T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579
[10:20:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2126,2187,2226].codfw.wmnet with reason: maintenance
[10:20:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2126,2187,2226].codfw.wmnet with reason: maintenance
[10:21:07] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans)
[10:21:27] <marostegui>	 !log Move db2187:3312 under db2226 s2 codfw dbmaint T373579
[10:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71922 and previous config saved to /var/cache/conftool/dbconfig/20250109-102218-root.json
[10:23:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: user@0.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:25:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 10%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71923 and previous config saved to /var/cache/conftool/dbconfig/20250109-102512-root.json
[10:25:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 10%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71924 and previous config saved to /var/cache/conftool/dbconfig/20250109-102522-root.json
[10:25:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P71925 and previous config saved to /var/cache/conftool/dbconfig/20250109-102538-root.json
[10:27:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc4', diff saved to https://phabricator.wikimedia.org/P71926 and previous config saved to /var/cache/conftool/dbconfig/20250109-102700-ladsgroup.json
[10:30:34] <wikibugs>	 (03CR) 10Btullis: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman)
[10:31:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:34:14] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for pc1016.eqiad.wmnet
[10:37:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71927 and previous config saved to /var/cache/conftool/dbconfig/20250109-103723-root.json
[10:38:20] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc1016.eqiad.wmnet
[10:40:07] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for pc2015.codfw.wmnet
[10:40:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71928 and previous config saved to /var/cache/conftool/dbconfig/20250109-104017-root.json
[10:40:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 25%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71929 and previous config saved to /var/cache/conftool/dbconfig/20250109-104027-root.json
[10:40:33] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1108875 (owner: 10Vgutierrez)
[10:40:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443973 (10phaultfinder)
[10:40:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71930 and previous config saved to /var/cache/conftool/dbconfig/20250109-104043-root.json
[10:40:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff)
[10:42:36] <icinga-wm>	 PROBLEM - MariaDB Replica IO: pc4 on pc1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc2015.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc2015.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:42:48] <marostegui>	 Amir1: ^
[10:43:06] <Amir1>	 the cookbook should have downtimed it
[10:43:15] <Amir1>	 it's the replica, sigh
[10:43:21] <marostegui>	 Yeah I was going to say
[10:43:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1016.eqiad.wmnet with reason: Reboot
[10:43:54] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1016.eqiad.wmnet with reason: Reboot
[10:45:11] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.puppet.renew-cert for dbprov2004.codfw.wmnet: Renew puppet certificate - root@cumin2002
[10:45:56] <logmsgbot>	 !log root@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for dbprov2004.codfw.wmnet: Renew puppet certificate - root@cumin2002
[10:45:57] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc2015.codfw.wmnet
[10:47:36] <icinga-wm>	 RECOVERY - MariaDB Replica IO: pc4 on pc1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:48:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch Presto access for coordinators to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109393
[10:51:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff)
[10:52:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71931 and previous config saved to /var/cache/conftool/dbconfig/20250109-105228-root.json
[10:52:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff)
[10:55:13] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[10:55:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71932 and previous config saved to /var/cache/conftool/dbconfig/20250109-105523-root.json
[10:55:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 50%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71933 and previous config saved to /var/cache/conftool/dbconfig/20250109-105533-root.json
[10:57:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool pc4', diff saved to https://phabricator.wikimedia.org/P71934 and previous config saved to /var/cache/conftool/dbconfig/20250109-105708-ladsgroup.json
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1100)
[11:05:26] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:07:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 6%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71935 and previous config saved to /var/cache/conftool/dbconfig/20250109-110734-root.json
[11:10:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71936 and previous config saved to /var/cache/conftool/dbconfig/20250109-111029-root.json
[11:10:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 75%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71937 and previous config saved to /var/cache/conftool/dbconfig/20250109-111038-root.json
[11:10:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443999 (10phaultfinder)
[11:11:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:18:02] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947)
[11:18:02] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947)
[11:19:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[11:22:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 7%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71938 and previous config saved to /var/cache/conftool/dbconfig/20250109-112239-root.json
[11:23:23] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10444031 (10hnowlan)
[11:23:59] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm
[11:24:04] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022
[11:24:04] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022
[11:24:16] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10444033 (10MatthewVernon) There is no row in the `object` table with rowid 423322 (in any copy), the other complained-of row is extant: ` 2701219|f/f8/Kuro...
[11:25:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71939 and previous config saved to /var/cache/conftool/dbconfig/20250109-112534-root.json
[11:25:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 100%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71940 and previous config saved to /var/cache/conftool/dbconfig/20250109-112543-root.json
[11:29:49] <logmsgbot>	 !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2022.codfw.wmnet with OS bookworm
[11:34:36] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10444067 (10MatthewVernon) Attempting the recovery operation (i.e. `sqlite3 4077d9164732d6587761ef101bcbc280.db .recover >recovered.sql`) gives us 3 files w...
[11:35:01] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] changeprop-jobqueue: remove support for video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108737 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan)
[11:36:32] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: remove support for video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108737 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan)
[11:37:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[11:37:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 8%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71942 and previous config saved to /var/cache/conftool/dbconfig/20250109-113744-root.json
[11:45:12] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur)
[11:52:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 9%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71943 and previous config saved to /var/cache/conftool/dbconfig/20250109-115250-root.json
[11:53:07] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply
[11:53:19] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[11:53:45] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[11:53:56] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff)
[11:54:15] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[11:57:51] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply
[11:58:35] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply
[11:59:13] <claime>	 hnowlan: \o/
[12:03:58] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[12:04:09] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:05:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2053 to wikikube-worker2192
[12:05:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:05:37] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[12:06:49] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:06:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Switch Presto access for coordinators to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff)
[12:07:25] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[12:07:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71944 and previous config saved to /var/cache/conftool/dbconfig/20250109-120755-root.json
[12:08:41] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[12:12:21] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2053 to wikikube-worker2192 - jelto@cumin1002"
[12:12:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2053 to wikikube-worker2192 - jelto@cumin1002"
[12:12:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:12:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2192
[12:13:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2192
[12:13:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2053 to wikikube-worker2192
[12:14:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2056 to wikikube-worker2193
[12:15:02] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/1109401
[12:15:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:18:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove access for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/1109401 (owner: 10Muehlenhoff)
[12:18:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2056 to wikikube-worker2193 - jelto@cumin1002"
[12:19:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2056 to wikikube-worker2193 - jelto@cumin1002"
[12:19:38] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:19:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2193
[12:23:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71945 and previous config saved to /var/cache/conftool/dbconfig/20250109-122301-root.json
[12:24:11] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Piccardi out of all services on: 2310 hosts
[12:24:29] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2193
[12:25:03] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Piccardi out of all services on: 2310 hosts
[12:25:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2056 to wikikube-worker2193
[12:25:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2058 to wikikube-worker2194
[12:26:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:28:25] <jinxer-wm>	 RESOLVED: [2x] SystemdUnitFailed: user@0.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:29:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2058 to wikikube-worker2194 - jelto@cumin1002"
[12:30:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2058 to wikikube-worker2194 - jelto@cumin1002"
[12:30:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:30:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2194
[12:30:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2194
[12:30:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2058 to wikikube-worker2194
[12:31:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2192.codfw.wmnet wikikube-worker2193.codfw.wmnet wikikube-worker2194.codfw.wmnet on all recursors
[12:31:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2192.codfw.wmnet wikikube-worker2193.codfw.wmnet wikikube-worker2194.codfw.wmnet on all recursors
[12:34:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2192.codfw.wmnet with OS bookworm
[12:34:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2192
[12:34:25] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:37:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2193.codfw.wmnet with OS bookworm
[12:37:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2193
[12:37:47] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2192 - jelto@cumin1002"
[12:37:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2192 - jelto@cumin1002"
[12:37:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:37:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2192.codfw.wmnet 221.48.192.10.in-addr.arpa 1.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:37:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:37:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2192.codfw.wmnet 221.48.192.10.in-addr.arpa 1.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:37:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2192
[12:38:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2192
[12:38:05] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2192
[12:38:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71946 and previous config saved to /var/cache/conftool/dbconfig/20250109-123806-root.json
[12:40:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2194.codfw.wmnet with OS bookworm
[12:40:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2194
[12:41:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2193 - jelto@cumin1002"
[12:41:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2193 - jelto@cumin1002"
[12:41:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:41:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2193.codfw.wmnet 62.48.192.10.in-addr.arpa 2.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:41:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2193.codfw.wmnet 62.48.192.10.in-addr.arpa 2.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:41:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2193
[12:41:31] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:41:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2193
[12:41:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2193
[12:43:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[12:43:24] <icinga-wm>	 RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2197) taken on 2025-01-09 11:52:22 (527 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[12:43:30] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[12:44:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2194 - jelto@cumin1002"
[12:45:00] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2194 - jelto@cumin1002"
[12:45:01] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:45:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2194.codfw.wmnet 224.32.192.10.in-addr.arpa 4.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:45:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2194.codfw.wmnet 224.32.192.10.in-addr.arpa 4.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:45:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2194
[12:45:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2194
[12:45:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2194
[12:53:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71947 and previous config saved to /var/cache/conftool/dbconfig/20250109-125313-root.json
[12:54:28] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@9073e46]: Refine refactoring
[12:54:49] <logmsgbot>	 !log aqu@deploy2002 deploy aborted: Refine refactoring (duration: 00m 20s)
[12:55:51] <wikibugs>	 (03CR) 10David Caro: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1300)
[13:02:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:04:10] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10444250 (10akosiaris)
[13:04:40] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10444253 (10akosiaris) I 've updated the list of servers to mark out some that are to be decommissioned, namely the ones in {T383226}
[13:08:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71948 and previous config saved to /var/cache/conftool/dbconfig/20250109-130818-root.json
[13:09:30] <wikibugs>	 (03PS3) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086)
[13:10:09] <wikibugs>	 (03CR) 10Máté Szabó: Unify IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó)
[13:11:20] <moritzm>	 !log installing sqlparse security updates
[13:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:38] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@9073e46]: Refine refactoring
[13:21:29] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@9073e46]: Refine refactoring (duration: 02m 51s)
[13:21:50] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1088-1092].eqiad.wmnet
[13:21:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1088-1092].eqiad.wmnet
[13:23:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:25:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:25:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10444361 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03Dzahn This looks all complete and Da...
[13:26:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10444364 (10kamila)
[13:28:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:29:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch Presto access to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109411
[13:29:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444382 (10phaultfinder)
[13:31:34] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109411 (owner: 10Muehlenhoff)
[13:33:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Presto: Remove ferm support [puppet] - 10https://gerrit.wikimedia.org/r/1109412
[13:36:01] <wikibugs>	 (03CR) 10Xcollazo: "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[13:37:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:38:00] <wikibugs>	 (03CR) 10Gmodena: [C:03+1] mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[13:38:22] <wikibugs>	 (03PS1) 10Kamila Součková: kubernetes: rename mw145[7-9] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109413 (https://phabricator.wikimedia.org/T365571)
[13:39:54] <moritzm>	 !log installing jinja2 security updates
[13:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:23] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[13:44:03] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:44:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10444423 (10WMDE-leszek) thank you all.
[13:51:11] <wikibugs>	 (03PS1) 10DCausse: Add gmodena to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/1109417
[13:58:24] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2192.codfw.wmnet with OS bookworm
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1400)
[14:00:05] <jouncebot>	 abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2193.codfw.wmnet with OS bookworm
[14:01:11] <Lucas_WMDE>	 o/
[14:01:15] <Lucas_WMDE>	 I can probably deploy in a few minutes
[14:02:23] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[14:04:26] <kart_>	 Lucas_WMDE: let me ping abijeet
[14:05:35] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2194.codfw.wmnet with OS bookworm
[14:06:12] <Lucas_WMDE>	 (I can deploy now btw)
[14:07:57] <kart_>	 cool.
[14:11:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2193.codfw.wmnet with OS bookworm
[14:11:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2193
[14:11:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2193
[14:12:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2194.codfw.wmnet with OS bookworm
[14:12:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2194
[14:12:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2194
[14:14:49] <wikibugs>	 (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri)
[14:15:36] <Lucas_WMDE>	 maybe Nikerabbit could accompany the deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1099725 if abijeet isn’t around?
[14:16:01] <Lucas_WMDE>	 or we wait / postpone
[14:16:11] <wikibugs>	 (03CR) 10Btullis: "Is there a relevant ticket that we can link to, for the record?" [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (owner: 10DCausse)
[14:17:54] <kart_>	 let's wait. Nikerabbit is in meeting and seems abijeet having internet issues.
[14:18:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] profile::tlsproxy::envoy: Explicitly configure retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris)
[14:18:39] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1109411 (owner: 10Muehlenhoff)
[14:18:59] <Lucas_WMDE>	 ok sure
[14:19:05] <Lucas_WMDE>	 thanks for checking
[14:19:52] <wikibugs>	 (03PS2) 10DCausse: Add gmodena to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (https://phabricator.wikimedia.org/T383333)
[14:19:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney)
[14:19:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Remove Tech News feed URL from Planet [puppet] - 10https://gerrit.wikimedia.org/r/1109131 (owner: 10Amire80)
[14:21:44] <wikibugs>	 (03CR) 10DCausse: "sure, filed T383333" [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (https://phabricator.wikimedia.org/T383333) (owner: 10DCausse)
[14:23:22] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add gmodena to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (https://phabricator.wikimedia.org/T383333) (owner: 10DCausse)
[14:24:30] <logmsgbot>	 !log dcaro@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[14:26:53] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10444603 (10MatthewVernon) So we have 3 database files with at least similar contents (same number of rows, inspecting differences by hand it seems to be th...
[14:27:31] <abijeet>	 hello, am I too late for the deployment window?
[14:27:38] <wikibugs>	 (03PS1) 10David Caro: cloudceph: move cloudcephosd1012 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1109424 (https://phabricator.wikimedia.org/T309789)
[14:28:52] <wikibugs>	 (03CR) 10David Caro: [C:03+2] cloudceph: move cloudcephosd1012 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1109424 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro)
[14:29:02] <abijeet>	 Poke Lucas_WMDE :-)
[14:29:08] <Lucas_WMDE>	 hi!
[14:29:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2193.codfw.wmnet with reason: host reimage
[14:29:11] <Lucas_WMDE>	 we can deploy now :)
[14:29:37] <abijeet>	 Thank you, and sorry for not being here on time!
[14:29:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro)
[14:30:32] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Translate message bundle Scribunto library on MetaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro)
[14:30:55] <wikibugs>	 (03PS3) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021)
[14:31:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2194.codfw.wmnet with reason: host reimage
[14:31:35] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1099725|Enable Translate message bundle Scribunto library on MetaWiki (T379892)]]
[14:31:37] <stashbot>	 T379892: Initial roll-out of Scribunto library for accessing message bundles - https://phabricator.wikimedia.org/T379892
[14:31:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene)
[14:32:06] <logmsgbot>	 !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[14:32:41] <wikibugs>	 (03CR) 10Volans: "thanks for the patch, minor nit inline, the rest looks good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney)
[14:33:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2193.codfw.wmnet with reason: host reimage
[14:33:54] <wikibugs>	 (03PS4) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021)
[14:37:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2194.codfw.wmnet with reason: host reimage
[14:37:54] <logmsgbot>	 !log dcaro@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bullseye
[14:38:51] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1099725|Enable Translate message bundle Scribunto library on MetaWiki (T379892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:38:54] <stashbot>	 T379892: Initial roll-out of Scribunto library for accessing message bundles - https://phabricator.wikimedia.org/T379892
[14:38:58] <Lucas_WMDE>	 abijeet: please test :)
[14:39:02] <abijeet>	 Lucas_WMDE, on it
[14:39:08] <wikibugs>	 (03PS4) 10Cathal Mooney: QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501)
[14:39:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10444634 (10Jelto) >>! In T381878#10441783, @Jclark-ctr wrote: > @Jelto   i performed flea power drain and looks to im...
[14:40:47] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh)
[14:41:21] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2192.codfw.wmnet with OS bookworm
[14:41:25] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2192
[14:41:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2192
[14:41:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] QoS rules for cloudcephosd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[14:47:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10444684 (10Jclark-ctr) @Jelto  i am going to start flea power draining them and reimaging them  wanted to try to reso...
[14:48:02] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10444685 (10Jclark-ctr) 05Open→03Resolved
[14:50:15] <icinga-wm>	 RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2197) taken on 2025-01-09 13:49:20 (888 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[14:51:02] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kubernetes: rename mw145[7-9] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109413 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[14:52:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1457-1459].eqiad.wmnet
[14:52:45] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2193.codfw.wmnet with OS bookworm
[14:53:30] <abijeet>	 Lucas_WMDE, all good.
[14:54:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Continuing with sync
[14:54:09] <Lucas_WMDE>	 alright, thanks!
[14:54:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1457-1459].eqiad.wmnet
[14:54:15] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw145[7-9] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109413 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[14:56:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2194.codfw.wmnet with OS bookworm
[14:58:12] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1073.eqiad.wmnet with OS bookworm
[14:58:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube...
[15:03:09] <wikibugs>	 (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[15:03:46] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "🚢" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:04:28] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099725|Enable Translate message bundle Scribunto library on MetaWiki (T379892)]] (duration: 32m 53s)
[15:04:30] <wikibugs>	 (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin)
[15:04:32] <stashbot>	 T379892: Initial roll-out of Scribunto library for accessing message bundles - https://phabricator.wikimedia.org/T379892
[15:04:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444746 (10phaultfinder)
[15:04:50] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:04:57] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:05:49] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:06:06] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[15:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1457 to wikikube-worker1093
[15:06:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:07:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1458 to wikikube-worker1094
[15:08:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1459 to wikikube-worker1095
[15:09:02] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:09:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:09:16] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:09:21] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:10:27] <wikibugs>	 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339 (10Jelto) 03NEW
[15:10:55] <wikibugs>	 (03PS1) 10Cathal Mooney: WMCS: Modify QoS marking for Ceph OSD heartbeat traffic [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501)
[15:10:57] <wikibugs>	 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10444802 (10Jelto)
[15:11:10] <wikibugs>	 (03Merged) 10jenkins-bot: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:11:26] <wikibugs>	 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10444809 (10Jelto) Similar to issues in eqiad, like T381878
[15:12:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1458 to wikikube-worker1094 - kamila@cumin1002"
[15:12:51] <wikibugs>	 (03PS2) 10Cathal Mooney: WMCS: Modify QoS marking for Ceph OSD heartbeat traffic [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501)
[15:13:33] <kamila_>	 jelto: caught your wikikube-worker2192.mgmt.codfw.wmnet in sync-netbox-hiera, proceeding unless you stop me
[15:13:35] <wikibugs>	 (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[15:14:07] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:14:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1458 to wikikube-worker1094 - kamila@cumin1002"
[15:14:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:14:22] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1094
[15:14:43] <jelto>	 kmila_: yes please proceed. Thank you!
[15:15:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[15:15:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube...
[15:15:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1094
[15:16:28] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:16:29] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1095
[15:16:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:16:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1458 to wikikube-worker1094
[15:16:42] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1073.eqiad.wmnet with reason: host reimage
[15:17:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1095
[15:18:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1459 to wikikube-worker1095
[15:18:58] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:18:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1093
[15:19:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444840 (10phaultfinder)
[15:19:59] <logmsgbot>	 !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2192.codfw.wmnet with OS bookworm
[15:20:28] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1073.eqiad.wmnet with reason: host reimage
[15:20:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1093
[15:21:08] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1457 to wikikube-worker1093
[15:21:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1093.eqiad.wmnet wikikube-worker1094.eqiad.wmnet wikikube-worker1095.eqiad.wmnet on all recursors
[15:21:17] <wikibugs>	 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10444844 (10dcaro) Our current version of ceph does not support the `mon_use_min_delay_socket=true` option :/, so only for osds then.   To set...
[15:21:18] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1093.eqiad.wmnet wikikube-worker1094.eqiad.wmnet wikikube-worker1095.eqiad.wmnet on all recursors
[15:21:54] <jelto>	 !log homer 'lsw1-d3-codfw*' commit 'T377877'
[15:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:57] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[15:22:08] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[15:22:11] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:22:17] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:22:52] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] WMCS: Modify QoS marking for Ceph OSD heartbeat traffic [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney)
[15:23:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1093.eqiad.wmnet with OS bookworm
[15:23:20] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1093
[15:23:20] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1093
[15:23:32] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1094.eqiad.wmnet with OS bookworm
[15:23:35] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1094
[15:23:36] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1094
[15:23:44] <logmsgbot>	 ����
[15:23:45] <logmsgbot>	 ����
[15:23:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1095.eqiad.wmnet with OS bookworm
[15:23:55] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1095
[15:23:55] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1095
[15:23:56] <wikibugs>	 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10444858 (10Jelto) The following commands have to be executed when the host is back (just noting it down so I don't forget it):  ` cookbook sre.host...
[15:24:13] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615)
[15:24:20] <sukhe>	 that was me trying to see why I can't send messages to logmsgbot from the DNS host :)
[15:24:35] <jelto>	 !log homer 'lsw1-c5-codfw*' commit 'T377877'
[15:24:37] <kamila_>	 sukhe: ok, glad I didn't break something :D 
[15:24:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:24] <jelto>	 !log homer 'cr*codfw*' commit 'T377877'
[15:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:34] <wikibugs>	 (03PS2) 10Bking: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615)
[15:27:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2193-2194].codfw.wmnet
[15:27:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2193-2194].codfw.wmnet
[15:28:02] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[15:28:25] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341 (10Jelto) 03NEW
[15:28:29] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[15:28:44] <logmsgbot>	 !log testing update from dns host
[15:29:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Track LDAP access for fceratto [puppet] - 10https://gerrit.wikimedia.org/r/1109439
[15:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444906 (10phaultfinder)
[15:29:49] <abijeet>	 thanks, Lucas_WMDE 
[15:30:09] <logmsgbot>	 !log sukhe@dns1004: START - running authdns-update
[15:30:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Track LDAP access for fceratto [puppet] - 10https://gerrit.wikimedia.org/r/1109439 (owner: 10Muehlenhoff)
[15:31:22] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1069.eqiad.wmnet with OS bookworm
[15:31:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10444916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube...
[15:31:47] <logmsgbot>	 !log sukhe@dns1004: END - running authdns-update
[15:32:23] <wikibugs>	 (03PS3) 10Ssingh: P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141
[15:33:03] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4776/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh)
[15:33:33] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1057.eqiad.wmnet with reason: host reimage
[15:33:48] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:33:54] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:37:06] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1057.eqiad.wmnet with reason: host reimage
[15:38:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:38:52] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615)
[15:39:26] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:39:43] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:39:43] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1073.eqiad.wmnet with OS bookworm
[15:39:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-wor...
[15:40:21] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444954 (10Jclark-ctr) 05Open→03Resolved Reimaged passed with no issues
[15:44:27] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:45:53] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:46:00] <logmsgbot>	 !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply
[15:46:37] <wikibugs>	 (03PS4) 10Ssingh: P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141
[15:46:39] <inflatador>	 !log bking@an-airflow1005 stopping airflow-search services as part of k8s migration T380615
[15:46:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bookworm
[15:46:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:42] <stashbot>	 T380615: Migrate the airflow-search database to Kubernetes - https://phabricator.wikimedia.org/T380615
[15:46:48] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10445015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host w...
[15:47:31] <wikibugs>	 (03CR) 10BBlack: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh)
[15:48:56] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh)
[15:49:11] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1069.eqiad.wmnet with reason: host reimage
[15:49:49] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[15:50:09] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10445019 (10MatthewVernon) As perhaps expected, the final transaction before the incident is a DELETE of the various thumbnails of 300px-Gascones,_molino_(1...
[15:50:18] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[15:50:23] <icinga-wm>	 PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14831MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[15:52:01] <icinga-wm>	 RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2197) taken on 2025-01-09 15:14:09 (381 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[15:52:25] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[15:52:30] <logmsgbot>	 !log sukhe@dns1004: START - running authdns-update
[15:52:40] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[15:53:18] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1069.eqiad.wmnet with reason: host reimage
[15:54:07] <logmsgbot>	 !log sukhe@dns1004: END - running authdns-update
[15:55:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[15:55:46] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:56:03] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[15:56:04] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[15:56:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10445031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube...
[15:57:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057.eqiad.wmnet - https://phabricator.wikimedia.org/T381676#10445032 (10Jclark-ctr) 05Open→03Resolved Reimaged server without issues.  it was posted onto T381789 ticket by mistake
[15:58:04] <wikibugs>	 06SRE, 10Observability-Metrics: Add slabinfo prometheus exporter - https://phabricator.wikimedia.org/T160071#10445038 (10tappof) https://github.com/prometheus/node_exporter/pull/2376
[15:58:33] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10445039 (10MoritzMuehlenhoff)
[15:59:54] <wikibugs>	 (03PS1) 10TChin: mw-content-history-reconcile-enrich: Add HA storageDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176)
[16:00:05] <jouncebot>	 dduvall and dancy: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1600)
[16:01:31] <wikibugs>	 (03CR) 10Volans: [C:03+2] ownership: Data Platform cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:04:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445058 (10phaultfinder)
[16:05:52] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1243.eqiad.wmnet with reason: host reimage
[16:07:34] <wikibugs>	 (03Merged) 10jenkins-bot: ownership: Data Platform cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[16:08:35] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1243.eqiad.wmnet with reason: host reimage
[16:10:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445069 (10Jclark-ctr) Rebalanced AA breaker and BB breaker
[16:11:48] <dduvall>	 jouncebot: nowandnext
[16:11:48] <jouncebot>	 For the next 0 hour(s) and 48 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1600)
[16:11:49] <jouncebot>	 In 0 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1700)
[16:11:51] <wikibugs>	 (03PS1) 10Bking: dse-k8s-eqiad: empty out values-postgresql-airflow-search.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615)
[16:12:00] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:12:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: empty out values-postgresql-airflow-search.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:13:27] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes20[49-52] to wikikube-worker219[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1109453 (https://phabricator.wikimedia.org/T377877)
[16:14:53] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[49-52] to wikikube-worker219[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1109453 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto)
[16:15:30] <wikibugs>	 (03PS1) 10David Caro: ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501)
[16:15:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445106 (10phaultfinder)
[16:15:47] <wikibugs>	 (03PS2) 10Bking: dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615)
[16:16:15] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:16:19] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:16:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro)
[16:19:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:20:21] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[16:21:08] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:21:09] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1069.eqiad.wmnet with OS bookworm
[16:21:15] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[16:21:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10445122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-wor...
[16:21:24] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1093.eqiad.wmnet with OS bookworm
[16:21:24] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1094.eqiad.wmnet with OS bookworm
[16:21:28] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1095.eqiad.wmnet with OS bookworm
[16:21:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10445123 (10Jclark-ctr) 05Open→03Resolved flea power drain and Reimaged server
[16:21:31] <dduvall>	 heads up, i'm going to use the remainder of this window to get wmf.11 back to group1
[16:23:26] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109455 (https://phabricator.wikimedia.org/T382362)
[16:23:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109455 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[16:23:39] <wikibugs>	 (03PS4) 10Fabfur: varnish: pass WME HEAD reqs to pass for ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771)
[16:24:14] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109455 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[16:26:26] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur)
[16:27:04] <wikibugs>	 (03PS3) 10Brouberol: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:27:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:27:40] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:27:52] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:27:57] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking)
[16:28:32] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply
[16:29:30] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply
[16:32:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445199 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[16:33:06] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[16:33:07] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1243.eqiad.wmnet with OS bookworm
[16:33:16] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10445203 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikik...
[16:33:33] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10445205 (10MatthewVernon) I've copied proxy-access and server logs from the frontends and serverlog from the backends onto cumin1002 to give myself a littl...
[16:33:40] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10445206 (10Jclark-ctr) 05Open→03Resolved Reimaged passed with no issues
[16:35:07] <wikibugs>	 (03PS1) 10JMeybohm: Support multiple kubernetes-client versions [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984)
[16:40:00] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.11  refs T382362
[16:40:03] <stashbot>	 T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362
[16:43:24] <cdanis>	 dduvall: looking good?
[16:43:31] <wikibugs>	 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10445230 (10MatthewVernon) To summarise:   - 07:19:14 - final successful PUT   - 07:19:50 - final successful DELETE (recorded in databases OK)   - 07:20:28...
[16:43:38] <dduvall>	 cdanis: so far so good
[16:44:31] <cdanis>	 jouncebot: nowandnext
[16:44:32] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1600)
[16:44:32] <jouncebot>	 In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1700)
[16:44:43] <cdanis>	 dduvall: mind if I do a config deploy now?
[16:46:54] <dduvall>	 yeah, no problem
[16:47:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[16:47:16] <wikibugs>	 (03CR) 10CDanis: [C:03+2] group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[16:47:22] <cdanis>	 The change '1109133' has been rejected (Code-Review -2) by 'CDanis'
[16:47:24] <cdanis>	 lol
[16:47:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[16:47:28] <wikibugs>	 (03PS2) 10David Caro: ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501)
[16:48:02] <wikibugs>	 (03PS1) 10Hnowlan: shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109459 (https://phabricator.wikimedia.org/T383317)
[16:48:03] <wikibugs>	 (03Merged) 10jenkins-bot: group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis)
[16:48:33] <logmsgbot>	 !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]]
[16:48:37] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[16:50:27] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022
[16:50:43] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.netbox
[16:53:42] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:53:46] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[16:53:57] <logmsgbot>	 !log cdanis@deploy2002 cdanis: Continuing with sync
[16:54:07] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2022 - elukey@cumin1002"
[16:54:11] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2022 - elukey@cumin1002"
[16:54:12] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:54:12] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2022.codfw.wmnet 212.32.192.10.in-addr.arpa 2.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:54:15] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2022.codfw.wmnet 212.32.192.10.in-addr.arpa 2.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[16:54:15] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2022
[16:54:17] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2022
[16:54:17] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022
[16:54:42] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022
[16:54:42] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022
[16:55:39] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm
[16:55:43] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022
[16:55:43] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022
[17:00:04] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1700).
[17:00:05] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:27] <jhathaway>	 o/
[17:02:56] <logmsgbot>	 !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] (duration: 14m 22s)
[17:02:59] <stashbot>	 T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552
[17:03:30] <tgr|away>	 o/
[17:04:05] <tgr|away>	 jhathaway: one of the patches is for DNS rather than puppet, I hope that's OK
[17:05:41] <jhathaway>	 that is fine, however on the two puppet patches I don't see any reviews, I'm not sure if I have enough context to review them
[17:09:13] <tgr|away>	 jhathaway: do you know who should review them?
[17:10:13] <elukey>	 tgr|away: I'd suggest to ping somebody from #wikimedia-serviceops for those, so that they are aware and can provide assistance. I guess that the new vhost needs to be also deployed to k8s pods right?
[17:10:28] <wikibugs>	 (03PS1) 10CDanis: tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461
[17:10:56] <tgr|away>	 yeah, would that involve a different piece of code?
[17:11:29] <wikibugs>	 (03PS2) 10CDanis: tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461
[17:11:31] <elukey>	 lemme check
[17:12:34] <jhathaway>	 thank elukey, I'll ask in serviceops
[17:12:42] <elukey>	 IIUC the config needs to run on the deployment servers via puppet run, so the correspondent yaml files for helmfile are updated
[17:13:02] <elukey>	 and after that, a deploy would need to be kicked off to refresh the httpd config
[17:14:12] <wikibugs>	 (03CR) 10CDanis: [C:03+2] tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461 (owner: 10CDanis)
[17:14:16] <elukey>	 tgr|away: not sure how urgent this is but maybe we could follow up with serviceops to gather +1s and then deploy early next week? This seems something that needs to happen during a mediawiki maintenance window
[17:14:30] <elukey>	 Cc: jhathaway: --^
[17:14:43] <tgr|away>	 early next week would be fine
[17:14:50] <elukey>	 just to be sure
[17:15:19] <wikibugs>	 (03Merged) 10jenkins-bot: tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461 (owner: 10CDanis)
[17:15:20] <elukey>	 ack super, going afk but ping me during the next days if anything is needed
[17:15:25] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[17:15:29] <tgr|away>	 do you mean a mediawiki infrastructure window? or should I schedule a custom one?
[17:16:01] <elukey>	 infra window yes, it seems a good one in my opinion
[17:16:06] <elukey>	 since we are adding a vhost etc..
[17:16:22] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[17:16:23] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[17:16:23] <elukey>	 so in there we can pack puppet + mw deploy
[17:16:31] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2022.codfw.wmnet with reason: host reimage
[17:16:53] <tgr|away>	 thx, I'll reschedule
[17:16:57] <elukey>	 np!
[17:17:02] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[17:17:03] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:17:07] <jhathaway>	 thanks, sorry for the delay
[17:18:34] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:18:35] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:18:38] <stashbot>	 cdanis@deploy2002: Failed to log message to wiki. Somebody should check the error logs.
[17:19:13] <wikibugs>	 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10445361 (10cmooney) This seems to be working ok following the merge.  Packets are being properly matched in the iptables rules and the DSCP m...
[17:19:47] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:19:49] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:20:17] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2022.codfw.wmnet with reason: host reimage
[17:20:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1093.eqiad.wmnet with OS bookworm
[17:20:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1093
[17:20:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1093
[17:21:32] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:21:33] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:22:45] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:22:47] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:25:02] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:25:03] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:25:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:26:29] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:35:17] <wikibugs>	 (03PS1) 10CDanis: mw-*: trace sampling rate: another tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109466
[17:35:47] <wikibugs>	 (03CR) 10CDanis: [C:03+2] mw-*: trace sampling rate: another tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109466 (owner: 10CDanis)
[17:37:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1094.eqiad.wmnet with OS bookworm
[17:37:07] <wikibugs>	 (03Merged) 10jenkins-bot: mw-*: trace sampling rate: another tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109466 (owner: 10CDanis)
[17:37:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1094
[17:37:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1094
[17:37:29] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[17:37:31] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1095.eqiad.wmnet with OS bookworm
[17:37:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1095
[17:37:40] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1095
[17:38:42] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[17:38:43] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[17:39:49] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[17:39:50] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[17:41:03] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2022.codfw.wmnet with OS bookworm
[17:41:17] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[17:41:19] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[17:42:06] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Add file tables and OAuthRateLimiter table to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/1109467 (https://phabricator.wikimedia.org/T363581)
[17:42:28] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[17:42:29] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[17:44:03] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[17:44:04] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[17:45:16] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[17:45:17] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[17:47:19] <wikibugs>	 (03PS1) 10Kamila Součková: kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791)
[17:47:32] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[17:47:33] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[17:47:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) (owner: 10Kamila Součková)
[17:48:57] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[17:48:58] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[17:49:38] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[17:49:39] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[17:50:17] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[17:51:23] <wikibugs>	 (03PS1) 10Kamila Součková: kubernetes: fix my previous host rename CR [puppet] - 10https://gerrit.wikimedia.org/r/1109471 (https://phabricator.wikimedia.org/T365571)
[17:53:00] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kubernetes: fix my previous host rename CR [puppet] - 10https://gerrit.wikimedia.org/r/1109471 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[17:53:33] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] kubernetes: fix my previous host rename CR [puppet] - 10https://gerrit.wikimedia.org/r/1109471 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková)
[17:55:34] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1093.eqiad.wmnet with OS bookworm
[17:55:39] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1094.eqiad.wmnet with OS bookworm
[17:55:48] <logmsgbot>	 !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1095.eqiad.wmnet with OS bookworm
[17:58:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1093.eqiad.wmnet with OS bookworm
[17:58:13] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1093
[17:58:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1093
[17:58:17] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1094.eqiad.wmnet with OS bookworm
[17:58:21] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1094
[17:58:21] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1094
[17:58:26] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1095.eqiad.wmnet with OS bookworm
[17:58:30] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1095
[17:58:30] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1095
[18:00:05] <jouncebot>	 bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1800).
[18:00:05] <jouncebot>	 swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1800). nyaa~
[18:00:16] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] ncredir: Add wikimedia.ro/wikipedia.ro [puppet] - 10https://gerrit.wikimedia.org/r/1109123 (https://phabricator.wikimedia.org/T222080) (owner: 10BCornwall)
[18:01:12] <swfrench-wmf>	 cdanis: how are things looking in terms of the trace sampling tuning you've been doing?
[18:01:31] <swfrench-wmf>	 I have a change planned for the infra window, but can hold off for a bit if you need time
[18:03:19] <wikibugs>	 (03CR) 10BCornwall: [C:04-1] "Looks like you forgot to remove the original entries to -01 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn)
[18:08:48] <wikibugs>	 (03PS1) 10CDanis: final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473
[18:13:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1093.eqiad.wmnet with reason: host reimage
[18:13:59] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1095.eqiad.wmnet with reason: host reimage
[18:14:10] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1094.eqiad.wmnet with reason: host reimage
[18:14:46] <wikibugs>	 (03PS3) 10Scott French: mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519)
[18:17:02] <wikibugs>	 (03CR) 10Scott French: [C:03+2] mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French)
[18:17:23] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1093.eqiad.wmnet with reason: host reimage
[18:17:37] <wikibugs>	 (03CR) 10CDanis: [C:03+2] final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 (owner: 10CDanis)
[18:18:10] <wikibugs>	 (03Merged) 10jenkins-bot: mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French)
[18:19:27] <wikibugs>	 (03PS2) 10CDanis: final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473
[18:20:14] <wikibugs>	 (03CR) 10CDanis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 (owner: 10CDanis)
[18:20:44] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1094.eqiad.wmnet with reason: host reimage
[18:21:28] <wikibugs>	 (03Merged) 10jenkins-bot: final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 (owner: 10CDanis)
[18:21:33] <swfrench-wmf>	 coordinating with cdanis out of band, we'll be deploying both patches together to mw-web and mw-api-ext once 1109473 is merged
[18:21:37] <swfrench-wmf>	 aaaand there is is
[18:21:42] <swfrench-wmf>	 *it is
[18:21:58] <cdanis>	 swfrench-wmf: merged and ready for you on deploy2002
[18:22:19] <wikibugs>	 (03CR) 10Krinkle: ClusterConfig: add support for dumps trait (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[18:23:20] <wikibugs>	 (03CR) 10Krinkle: ClusterConfig: add support for dumps trait (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto)
[18:23:40] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[18:24:56] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[18:25:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1095.eqiad.wmnet with reason: host reimage
[18:26:45] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[18:27:51] <logmsgbot>	 !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[18:28:46] <swfrench-wmf>	 cdanis: FYI, I'll be spacing things out by 5-10m between eqiad and codfw just to validate that my maths weren't wildly off
[18:28:52] <cdanis>	 ack!
[18:33:43] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[18:35:22] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] prometheus: add ttl option to statsd-exporter, set to 30d [puppet] - 10https://gerrit.wikimedia.org/r/1105971 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite)
[18:35:23] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[18:36:14] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[18:36:24] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1093.eqiad.wmnet with OS bookworm
[18:37:35] <logmsgbot>	 !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[18:38:11] <swfrench-wmf>	 cdanis: all yours! mw-api-int, mw-parsoid, mw-wikifunctions remain among those updated in 1109473
[18:38:33] <swfrench-wmf>	 whoops I mean jobrunner, not parsoid :)
[18:38:43] <cdanis>	 yep :D thanks Scott!
[18:39:50] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[18:40:03] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1094.eqiad.wmnet with OS bookworm
[18:41:27] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[18:41:28] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[18:42:45] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[18:42:46] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[18:44:02] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[18:44:03] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[18:44:47] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1095.eqiad.wmnet with OS bookworm
[18:45:32] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[18:45:34] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[18:46:39] <logmsgbot>	 !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[18:46:41] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply
[18:49:58] <logmsgbot>	 !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply
[19:00:05] <jouncebot>	 dduvall and dancy: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1900).
[19:05:16] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:05:22] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:07:23] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10445725 (10VRiley-WMF) We have recieved the part. Will update when this is completed
[19:13:40] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10445755 (10Jhancock.wm) Service Request Number: 203753434
[19:18:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:19:43] <wikibugs>	 (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109480 (https://phabricator.wikimedia.org/T382362)
[19:19:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109480 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[19:20:28] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109480 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot)
[19:22:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:25:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:34:05] <logmsgbot>	 !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.11  refs T382362
[19:34:09] <stashbot>	 T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362
[19:38:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10445801 (10cmooney) @Andrew I've updated the switch config for this host to also trunk the //cloud-pirvate-b1-codfw// vlan, so should be ok on that front n...
[19:43:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[19:44:53] <wikibugs>	 (03PS1) 10CDanis: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481
[19:45:06] <wikibugs>	 (03PS2) 10CDanis: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481
[19:46:57] <wikibugs>	 (03PS3) 10CDanis: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481
[19:48:04] <wikibugs>	 (03CR) 10CDanis: [C:03+2] bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 (owner: 10CDanis)
[19:49:01] <wikibugs>	 (03Merged) 10jenkins-bot: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 (owner: 10CDanis)
[19:49:13] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[19:50:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10445805 (10Jhancock.wm) it is cabled up and connected to port 43 on the cloud switch
[19:51:28] <wikibugs>	 (03PS1) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368)
[19:52:04] <wikibugs>	 (03PS1) 10CDanis: jaeger: 3Gi instead, 4Gi disallowed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109484
[19:52:17] <wikibugs>	 (03PS2) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368)
[19:52:25] <wikibugs>	 (03CR) 10CDanis: [C:03+2] jaeger: 3Gi instead, 4Gi disallowed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109484 (owner: 10CDanis)
[19:53:33] <wikibugs>	 (03Merged) 10jenkins-bot: jaeger: 3Gi instead, 4Gi disallowed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109484 (owner: 10CDanis)
[19:54:13] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[19:54:29] <logmsgbot>	 !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply
[20:01:18] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:05:34] <logmsgbot>	 !log dcausse@deploy2002 Started deploy [airflow-dags/search@718e870]: search: switch query_clicks to SparkSqlOperator
[20:05:54] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking)
[20:06:01] <logmsgbot>	 !log dcausse@deploy2002 Finished deploy [airflow-dags/search@718e870]: search: switch query_clicks to SparkSqlOperator (duration: 00m 27s)
[20:10:30] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:29:36] <logmsgbot>	 !log aqu@deploy2002 Started deploy [airflow-dags/analytics@0e4370e]: Canary event fix
[20:30:59] <logmsgbot>	 !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@0e4370e]: Canary event fix (duration: 01m 23s)
[20:38:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[20:39:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10445898 (10bking) We are currently getting [[ https://puppet-compiler.wmflabs.org/output/1109483/2700/cloudelastic1011.eqiad.wmnet/change.cloudelastic1011.e...
[20:40:26] <wikibugs>	 (03PS3) 10Dzahn: certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 7 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080)
[20:40:32] <wikibugs>	 (03CR) 10Dzahn: "ugh, yea, I did. fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn)
[20:40:47] <wikibugs>	 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10445900 (10BCornwall) @CRoslof I'm noticing that both wikimedia.org and wikipedia.ro have duplicate MarkMonitor entries - Could you please remove the inactive second one, please? Thanks!
[20:51:12] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:54:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery)
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T2100)
[21:00:05] <jouncebot>	 Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:10] <Pppery>	 here
[21:07:30] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:13:22] <icinga-wm>	 RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.50 ms
[21:18:59] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking)
[21:19:46] <icinga-wm>	 PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[21:25:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:27:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:32:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:33:34] <wikibugs>	 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash, 13Patch-For-Review: Reduce the number of fields declared in OpenSearch by logstash - https://phabricator.wikimedia.org/T180051#10446087 (10andrea.denisse)
[21:35:26] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:36:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:39:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[21:40:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:43:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync cloudelastic1011 status change after Netbox update - bking@cumin2002 - T378368"
[21:43:10] <stashbot>	 T378368: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368
[21:44:24] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync cloudelastic1011 status change after Netbox update - bking@cumin2002 - T378368"
[21:44:45] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking)
[21:48:59] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] Update French wikinews license to CC-BY-SA 4.0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer)
[21:53:13] <wikibugs>	 06SRE, 10Observability-Logging, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10446147 (10andrea.denisse) I think that having a list of the MW maintenance scripts that have this behavior wou...
[22:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T2200)
[22:02:17] <wikibugs>	 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#10446170 (10andrea.denisse) Hi,  I’m having trouble understanding the goal of this task. Could you clarify if it involves adding an include profile::prometheus::apache_export...
[22:03:57] <wikibugs>	 06SRE, 10Observability-Logging, 10Wikimedia-Apache-configuration: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#10446174 (10andrea.denisse) Is this related to T187434 ?
[22:05:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:09:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:14:16] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:22:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:34:30] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking)
[22:38:45] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10446369 (10Dzahn) Hello @Kgraessle   it looks to me like you already have shell access, an SSH key and membership in analytics-privatedata-users.  Could you share d...
[22:53:11] <wikibugs>	 (03PS3) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368)
[22:55:22] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking)
[22:58:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:00:09] <inflatador>	 !log bking@puppetserver1001:~$ sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 T378368
[23:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:12] <stashbot>	 T378368: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368
[23:00:23] <inflatador>	 !log bking@pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor T378368
[23:00:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:22] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:02:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:22:22] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:29:16] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:31:32] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:44:16] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[23:55:32] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process