[00:01:02] cdanis: Do you want me to deploy? [00:05:21] PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief2002 is CRITICAL: PROCS CRITICAL: 1 process with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [00:06:19] RECOVERY - Ensure acme-chief-backend is running only in the active node on acmechief2002 is OK: PROCS OK: 0 processes with args acme-chief-backend https://wikitech.wikimedia.org/wiki/Acme-chief [00:10:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:21] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:18:59] (03PS1) 10Cwhite: logstash: update codfw jobs host to logging-sd2001 [puppet] - 10https://gerrit.wikimedia.org/r/1109188 (https://phabricator.wikimedia.org/T353912) [00:19:01] (03PS1) 10Cwhite: logstash: update eqiad jobs host to logging-sd1001 [puppet] - 10https://gerrit.wikimedia.org/r/1109189 (https://phabricator.wikimedia.org/T353912) [00:25:51] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:30:06] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1109149|filerepo: Fix schema compatibility constant usage (T383269)]] [00:30:09] T383269: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'commonswiki.file' doesn't exist - https://phabricator.wikimedia.org/T383269 [00:32:21] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:36:29] !log ladsgroup@deploy2002 ladsgroup, cdanis: Backport for [[gerrit:1109149|filerepo: Fix schema compatibility constant usage (T383269)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:36:32] !log ladsgroup@deploy2002 ladsgroup, cdanis: Continuing with sync [00:36:32] T383269: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'commonswiki.file' doesn't exist - https://phabricator.wikimedia.org/T383269 [00:39:04] 06SRE, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: Unable to upload images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443249 (10Bugreporter) [00:39:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109190 [00:39:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109190 (owner: 10TrainBranchBot) [00:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443250 (10phaultfinder) [00:43:15] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 210991960 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:40] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109149|filerepo: Fix schema compatibility constant usage (T383269)]] (duration: 13m 34s) [00:43:43] T383269: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'commonswiki.file' doesn't exist - https://phabricator.wikimedia.org/T383269 [00:45:15] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 25488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:24] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1109190 (owner: 10TrainBranchBot) [01:00:10] (03PS1) 10Gergő Tisza: Create auth.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1109193 (https://phabricator.wikimedia.org/T377187) [01:09:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109194 [01:09:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109194 (owner: 10TrainBranchBot) [01:13:23] PROBLEM - snapshot of s6 in codfw on backupmon1001 is CRITICAL: snapshot for s6 at codfw (db2197) taken more than 3 days ago: Most recent backup 2025-01-06 01:09:15 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:13:26] (03CR) 10Ssingh: [C:03+1] "Let us know when you want to deploy it and if there is anything else required from Traffic around this." [dns] - 10https://gerrit.wikimedia.org/r/1109193 (https://phabricator.wikimedia.org/T377187) (owner: 10Gergő Tisza) [01:15:49] (03PS1) 10Gergő Tisza: Add Apache configuration for auth.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187) [01:15:57] (03PS2) 10Gergő Tisza: Create auth.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1109193 (https://phabricator.wikimedia.org/T377187) [01:16:08] (03CR) 10CI reject: [V:04-1] Add Apache configuration for auth.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187) (owner: 10Gergő Tisza) [01:19:26] sukhe: I was thinking of adding https://gerrit.wikimedia.org/r/c/operations/dns/+/1109193 to the puppet window, does that work? [01:19:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443273 (10phaultfinder) [01:20:21] tgr|away: you could, or you can simply ping us when you want to roll it out. rolling this out is quite trivial. [01:20:24] whatever works for you [01:20:43] there is no Puppet window for DNS changes if that is what you were asking [01:21:09] (03PS2) 10Gergő Tisza: Add Apache configuration for auth.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1109196 (https://phabricator.wikimedia.org/T377187) [01:22:04] is it ok to deploy it before the Apache changes? [01:24:00] as long as we are OK with the domain existing but not pointing to anything functional without the backend in place [01:24:54] there is one more thing we can do is that you can add the apache patch to the Puppet window and simply ping us to merge this around that time [01:25:27] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1109194 (owner: 10TrainBranchBot) [01:26:00] the Puppet window SRE will have rights to merge this out and most SREs have rolled out DNS changes, so no issues there at all [01:26:10] s/merge this out/merge and roll this out [01:26:56] thanks, I'll do that [01:27:27] ok. please feel free to ping me if I can help (and if I am not around, just ping us in #wikimedia-traffic) [01:28:21] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:32:21] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:37:01] (03PS2) 10Gergő Tisza: SUL3: Add auth domain to httpbb URL tests [puppet] - 10https://gerrit.wikimedia.org/r/1099339 (https://phabricator.wikimedia.org/T380574) [01:37:23] (03PS3) 10Gergő Tisza: SUL3: Add auth domain to httpbb URL tests [puppet] - 10https://gerrit.wikimedia.org/r/1099339 (https://phabricator.wikimedia.org/T380574) [01:38:56] (03PS2) 10Gergő Tisza: SUL3: Add auth domain to URL tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099338 (https://phabricator.wikimedia.org/T380574) [01:44:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443305 (10phaultfinder) [01:46:17] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/a6ecd352cffc949f5c1f2cf2f34cbfc392034a833edcf512caaf68d4b9d9c117/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:59:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443319 (10phaultfinder) [02:04:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443320 (10phaultfinder) [02:06:17] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:07:05] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:22:05] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:24:03] FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443344 (10phaultfinder) [02:27:07] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:27:21] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:37:21] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:38:07] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:08:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443374 (10phaultfinder) [03:18:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:24:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:29:06] FIRING: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:29:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker2008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:34:06] RESOLVED: MediaWikiEditFailures: Elevated MediaWiki edit failures (conflict) for cluster - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000208/edit-count?orgId=1&viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiEditFailures [03:50:15] PROBLEM - snapshot of s2 in codfw on backupmon1001 is CRITICAL: snapshot for s2 at codfw (db2197) taken more than 3 days ago: Most recent backup 2025-01-06 03:31:24 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [04:06:21] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:13:21] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:15:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:21] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:23:21] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:25:21] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:25:28] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:27:27] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:30:23] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14418MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [04:32:21] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:34:21] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:39:21] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:40:21] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:43:21] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:00:21] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:07:21] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443430 (10phaultfinder) [05:22:01] PROBLEM - snapshot of x1 in codfw on backupmon1001 is CRITICAL: snapshot for x1 at codfw (db2197) taken more than 3 days ago: Most recent backup 2025-01-06 05:08:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:25:28] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:27:27] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:28:21] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [05:45:55] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:49:57] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:02:21] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:09:45] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:11:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P71891 and previous config saved to /var/cache/conftool/dbconfig/20250109-061142-root.json [06:14:39] (03PS1) 10Marostegui: instances.yaml: Add es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1109302 (https://phabricator.wikimedia.org/T382569) [06:15:23] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1042 [puppet] - 10https://gerrit.wikimedia.org/r/1109302 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [06:17:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1042 depooled T382569', diff saved to https://phabricator.wikimedia.org/P71892 and previous config saved to /var/cache/conftool/dbconfig/20250109-061724-marostegui.json [06:17:28] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [06:18:07] (03PS1) 10Marostegui: es1042: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109303 [06:18:50] (03CR) 10Marostegui: [C:03+2] es1042: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109303 (owner: 10Marostegui) [06:21:21] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:24:03] FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:26:00] Deploying cxserver.. [06:26:28] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-07-045930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108544 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [06:26:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P71893 and previous config saved to /var/cache/conftool/dbconfig/20250109-062647-root.json [06:27:13] (03PS1) 10Marostegui: es1042: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1109304 [06:27:38] (03Merged) 10jenkins-bot: Update cxserver to 2025-01-07-045930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108544 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [06:27:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71894 and previous config saved to /var/cache/conftool/dbconfig/20250109-062749-root.json [06:27:53] (03CR) 10Marostegui: [C:03+2] es1042: Remove note [puppet] - 10https://gerrit.wikimedia.org/r/1109304 (owner: 10Marostegui) [06:30:35] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [06:31:02] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:31:07] (03PS1) 10Marostegui: mariadb: Productionize es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1109305 (https://phabricator.wikimedia.org/T382569) [06:32:21] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:36:05] PROBLEM - Disk space on dbprov2004 is CRITICAL: DISK CRITICAL - free space: /srv 449399 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2004&var-datasource=codfw+prometheus/ops [06:38:28] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1043 [puppet] - 10https://gerrit.wikimedia.org/r/1109305 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [06:40:17] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:40:48] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:41:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1022 T382569', diff saved to https://phabricator.wikimedia.org/P71895 and previous config saved to /var/cache/conftool/dbconfig/20250109-064117-marostegui.json [06:41:21] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [06:41:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1022.eqiad.wmnet with reason: cloning es1042 [06:41:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1022.eqiad.wmnet with reason: cloning es1042 [06:41:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71896 and previous config saved to /var/cache/conftool/dbconfig/20250109-064153-root.json [06:42:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71897 and previous config saved to /var/cache/conftool/dbconfig/20250109-064254-root.json [06:44:54] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:45:29] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:45:41] (03PS1) 10Marostegui: mariadb: Productionize db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1109306 (https://phabricator.wikimedia.org/T373579) [06:45:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2131 T373579', diff saved to https://phabricator.wikimedia.org/P71898 and previous config saved to /var/cache/conftool/dbconfig/20250109-064556-marostegui.json [06:46:00] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [06:46:08] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db2231 [puppet] - 10https://gerrit.wikimedia.org/r/1109306 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [06:47:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: cloning db2231 [06:47:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2131.codfw.wmnet with reason: cloning db2231 [06:47:21] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [06:49:08] (03PS1) 10Marostegui: instances.yaml: Add db2231 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109307 (https://phabricator.wikimedia.org/T373579) [06:49:29] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2231 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1109307 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [06:50:23] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14180MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [06:51:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2231 to dbctl depooled T373579', diff saved to https://phabricator.wikimedia.org/P71899 and previous config saved to /var/cache/conftool/dbconfig/20250109-065114-marostegui.json [06:51:18] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [06:52:16] !log root@cumin1002 START - Cookbook sre.mysql.clone of db2131.codfw.wmnet onto db2231.codfw.wmnet [06:53:34] !log Updated cxserver to 2025-01-07-045930-production (T377966, T377813, T381379) [06:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:40] T377966: cxserver: Logstash entries seems difficult to read - https://phabricator.wikimedia.org/T377966 [06:53:41] T377813: Migrate cxserver code from CommonJS to ESM / ECMAScript - https://phabricator.wikimedia.org/T377813 [06:53:41] T381379: Post-creation work for tigwiki - https://phabricator.wikimedia.org/T381379 [06:53:43] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:29] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71900 and previous config saved to /var/cache/conftool/dbconfig/20250109-065658-root.json [06:58:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71901 and previous config saved to /var/cache/conftool/dbconfig/20250109-065759-root.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T0700) [07:00:05] marostegui, Amir1, and arnaudb: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T0700). [07:02:22] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443573 (10phaultfinder) [07:12:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P71902 and previous config saved to /var/cache/conftool/dbconfig/20250109-071203-root.json [07:13:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71903 and previous config saved to /var/cache/conftool/dbconfig/20250109-071305-root.json [07:20:28] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 129, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:48] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P71904 and previous config saved to /var/cache/conftool/dbconfig/20250109-072709-root.json [07:28:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71905 and previous config saved to /var/cache/conftool/dbconfig/20250109-072809-root.json [07:29:12] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2014,2017].codfw.wmnet [07:32:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2014,2017].codfw.wmnet [07:34:45] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2014.codfw.wmnet with OS bookworm [07:34:45] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2017.codfw.wmnet with OS bookworm [07:34:53] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2014 [07:34:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2014 [07:34:54] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2017 [07:34:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2017 [07:39:02] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:42:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71906 and previous config saved to /var/cache/conftool/dbconfig/20250109-074214-root.json [07:42:42] 06SRE, 06Infrastructure-Foundations, 10Mail: Message sizes exceeding limits - https://phabricator.wikimedia.org/T383271#10443606 (10Aklapper) [07:43:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 6%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71907 and previous config saved to /var/cache/conftool/dbconfig/20250109-074314-root.json [07:44:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443608 (10phaultfinder) [07:44:53] 06SRE, 06Commons, 10MediaWiki-Uploading, 10UploadWizard: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443609 (10Aklapper) [07:45:56] 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443610 (10Aklapper) [07:46:45] (03PS1) 10Marostegui: wmnet: Switchover m1-master proxy [dns] - 10https://gerrit.wikimedia.org/r/1109374 [07:52:21] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage [07:52:39] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage [07:55:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2014.codfw.wmnet with reason: host reimage [07:58:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 7%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71908 and previous config saved to /var/cache/conftool/dbconfig/20250109-075820-root.json [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T0800). nyaa~ [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:28] 06SRE, 06Commons, 10MediaWiki-Uploading: HTTP 503 error when uploading images on Wikimedia Commons - https://phabricator.wikimedia.org/T383274#10443634 (10Underbar_dk) This is really weird: I tried uploading some other image instead, that went through fine (https://commons.wikimedia.org/wiki/File:San_yan_cho... [08:01:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2017.codfw.wmnet with reason: host reimage [08:06:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109117 (owner: 10Muehlenhoff) [08:13:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 8%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71909 and previous config saved to /var/cache/conftool/dbconfig/20250109-081324-root.json [08:15:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2014.codfw.wmnet with OS bookworm [08:18:07] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300 (10LSobanski) 03NEW [08:18:29] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance thanos-be1005) - https://phabricator.wikimedia.org/T383301 (10LSobanski) 03NEW [08:19:07] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: BGP status (instance cr2-eqord) - https://phabricator.wikimedia.org/T383302 (10LSobanski) 03NEW [08:19:19] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303 (10LSobanski) 03NEW [08:20:57] (03PS1) 10JMeybohm: aptrepo: Add bookworm components calico329 and kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) [08:21:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2017.codfw.wmnet with OS bookworm [08:22:04] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:09] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2017.codfw.wmnet [08:22:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2017.codfw.wmnet [08:22:18] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2014.codfw.wmnet [08:22:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2014.codfw.wmnet [08:22:57] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance thanos-be1005) - https://phabricator.wikimedia.org/T383301#10443704 (10MatthewVernon) [08:22:58] 07Puppet, 10SRE-swift-storage, 10SRE-tools, 06DC-Ops, and 2 others: RAID monitoring on new hardware spec requires new or updated user space cli tool - https://phabricator.wikimedia.org/T377853#10443703 (10MatthewVernon) [08:22:59] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300#10443705 (10MatthewVernon) [08:23:47] 07sre-alert-triage, 10SRE-swift-storage: Alert in need of triage: Dell PowerEdge RAID Controller (instance ms-be1091) - https://phabricator.wikimedia.org/T383300#10443709 (10MatthewVernon) This (and the thanos one) are casualties of us still not having working tooling on the Supermicro Config J systems (see T3... [08:24:26] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2011-2013].codfw.wmnet [08:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443712 (10phaultfinder) [08:26:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2011-2013].codfw.wmnet [08:28:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 9%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71910 and previous config saved to /var/cache/conftool/dbconfig/20250109-082829-root.json [08:31:19] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2011.codfw.wmnet with OS bookworm [08:31:38] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2011 [08:31:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2011 [08:31:49] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2013.codfw.wmnet with OS bookworm [08:31:50] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2012.codfw.wmnet with OS bookworm [08:32:08] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2013 [08:32:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2013 [08:32:09] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2012 [08:32:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2012 [08:33:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1043.eqiad.wmnet with reason: cloning [08:33:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1043.eqiad.wmnet with reason: cloning [08:35:18] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:35:52] (03CR) 10Marostegui: [C:03+2] wmnet: Switchover m1-master proxy [dns] - 10https://gerrit.wikimedia.org/r/1109374 (owner: 10Marostegui) [08:36:04] PROBLEM - BGP status on lsw1-a5-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:43:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71911 and previous config saved to /var/cache/conftool/dbconfig/20250109-084335-root.json [08:44:58] (03CR) 10Volans: [C:03+2] ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:49:17] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage [08:49:37] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage [08:49:44] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage [08:50:20] (03CR) 10Gmodena: [C:03+1] eventstreams: add wikidata & commons RDF update streams [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105919 (https://phabricator.wikimedia.org/T374921) (owner: 10DCausse) [08:50:54] (03Merged) 10jenkins-bot: ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:52:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2011.codfw.wmnet with reason: host reimage [08:55:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2012.codfw.wmnet with reason: host reimage [08:57:31] !log root@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2131.codfw.wmnet onto db2231.codfw.wmnet [08:57:46] (03CR) 10Volans: [C:03+2] ownership: Collaboration Services cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:58:29] (03CR) 10Brouberol: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1109117 (owner: 10Muehlenhoff) [08:58:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71912 and previous config saved to /var/cache/conftool/dbconfig/20250109-085840-root.json [08:59:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2013.codfw.wmnet with reason: host reimage [09:02:16] !log update to haproxy 2.8.13 on component thirdparty/haproxy28 bullseye-wikimedia (apt.wm.o) - T383111 [09:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:19] T383111: Upgrade haproxy to 2.8.13 on cp hosts - https://phabricator.wikimedia.org/T383111 [09:04:11] (03Merged) 10jenkins-bot: ownership: Collaboration Services cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104954 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:12:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2011.codfw.wmnet with OS bookworm [09:12:24] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:13] (03PS1) 10Vgutierrez: hiera: remove redundant cp hosts definitions [puppet] - 10https://gerrit.wikimedia.org/r/1109383 [09:13:27] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:13:43] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109383 (owner: 10Vgutierrez) [09:13:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71913 and previous config saved to /var/cache/conftool/dbconfig/20250109-091345-root.json [09:15:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2012.codfw.wmnet with OS bookworm [09:16:08] RECOVERY - BGP status on lsw1-a5-codfw.mgmt is OK: BGP OK - up: 32, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:17:39] (03CR) 10Filippo Giunchedi: [C:03+1] P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh) [09:17:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:18:59] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:19:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2013.codfw.wmnet with OS bookworm [09:19:43] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2011.codfw.wmnet [09:19:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2011.codfw.wmnet [09:19:51] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2012.codfw.wmnet [09:19:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2012.codfw.wmnet [09:19:59] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2013.codfw.wmnet [09:20:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2013.codfw.wmnet [09:20:28] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10443781 (10Volans) @elukey good question. Surely we're not working on this but we still have python2 code around, not too much but there is. I'm... [09:21:10] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [09:23:17] (03PS2) 10Vgutierrez: hiera: remove cp4052 deprecated hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1109383 [09:23:44] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4044.ulsfo.wmnet,cp4052.ulsfo.wmnet} and A:cp [09:25:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:09] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109383 (owner: 10Vgutierrez) [09:28:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71914 and previous config saved to /var/cache/conftool/dbconfig/20250109-092850-root.json [09:31:55] (03CR) 10Vgutierrez: [C:03+2] hiera: remove cp4052 deprecated hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1109383 (owner: 10Vgutierrez) [09:33:13] ACKNOWLEDGEMENT - MD RAID on ml-serve2001 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 2, Failed: 0, Spare: 1 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T383307 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:33:18] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ml-serve2001 - https://phabricator.wikimedia.org/T383307 (10ops-monitoring-bot) 03NEW [09:36:32] (03CR) 10Muehlenhoff: [C:03+2] Make profile::presto::server::ferm_srange optional [puppet] - 10https://gerrit.wikimedia.org/r/1109117 (owner: 10Muehlenhoff) [09:40:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P71915 and previous config saved to /var/cache/conftool/dbconfig/20250109-094022-root.json [09:42:59] (03PS3) 10Muehlenhoff: Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095 [09:43:51] (03PS3) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) [09:43:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1042 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71916 and previous config saved to /var/cache/conftool/dbconfig/20250109-094355-root.json [09:44:02] (03CR) 10TChin: mw-content-history-reconcile-enrich: Enable K8 HA (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [09:44:03] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:44:03] FIRING: [2x] KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:44:13] (03PS1) 10Jelto: Rename kubernetes20[53,56,58] to wikikube-worker[2192-2194] [puppet] - 10https://gerrit.wikimedia.org/r/1109389 (https://phabricator.wikimedia.org/T377877) [09:44:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff) [09:46:48] (03CR) 10Jelto: [C:03+1] "lgtm, thanks for updating the probe!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109043 (https://phabricator.wikimedia.org/T382617) (owner: 10DDesouza) [09:47:52] (03CR) 10Jelto: [C:03+1] aptrepo: Add bookworm components calico329 and kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:48:27] (03PS1) 10Marostegui: db2231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109390 (https://phabricator.wikimedia.org/T373579) [09:50:56] (03CR) 10Marostegui: [C:03+2] db2231: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1109390 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [09:51:19] (03PS4) 10Muehlenhoff: Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095 [09:52:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 1%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71917 and previous config saved to /var/cache/conftool/dbconfig/20250109-095207-root.json [09:52:18] (03Abandoned) 10Btullis: Add conftool-data for dbstore hosts to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1108071 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [09:52:46] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[53,56,58] to wikikube-worker[2192-2194] [puppet] - 10https://gerrit.wikimedia.org/r/1109389 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [09:53:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff) [09:55:18] RECOVERY - MD RAID on ml-serve2001 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [09:55:18] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [09:55:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P71918 and previous config saved to /var/cache/conftool/dbconfig/20250109-095527-root.json [09:55:36] !log klausman@cumin2002 START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [09:56:04] RECOVERY - Disk space on dbprov2004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dbprov2004&var-datasource=codfw+prometheus/ops [09:58:04] !log installing glibc bugfix updates for Bookworm [09:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro) [10:00:17] (03CR) 10Muehlenhoff: [C:03+2] matomo: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1103313 (owner: 10Muehlenhoff) [10:01:34] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff) [10:03:08] (03CR) 10JMeybohm: [C:03+2] aptrepo: Add bookworm components calico329 and kubernetes131 [puppet] - 10https://gerrit.wikimedia.org/r/1109379 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:04:07] (03CR) 10Nikerabbit: [C:03+1] Enable Translate message bundle Scribunto library on MetaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro) [10:05:51] !log jynus@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbprov2004.codfw.wmnet with reason: os upgrade [10:06:07] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbprov2004.codfw.wmnet with reason: os upgrade [10:06:49] (03CR) 10David Caro: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans) [10:07:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71919 and previous config saved to /var/cache/conftool/dbconfig/20250109-100712-root.json [10:09:40] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [10:10:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P71920 and previous config saved to /var/cache/conftool/dbconfig/20250109-101033-root.json [10:11:09] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2053,2056,2058].codfw.wmnet [10:11:20] (03CR) 10Ladsgroup: [C:03+2] db_maint_mapper_sal.py: Update list of nicks [software] - 10https://gerrit.wikimedia.org/r/1108879 (owner: 10Marostegui) [10:11:55] (03Merged) 10jenkins-bot: db_maint_mapper_sal.py: Update list of nicks [software] - 10https://gerrit.wikimedia.org/r/1108879 (owner: 10Marostegui) [10:12:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2053,2056,2058].codfw.wmnet [10:15:41] (03CR) 10Volans: [C:03+2] sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans) [10:17:19] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[53,56,58] to wikikube-worker[2192-2194] [puppet] - 10https://gerrit.wikimedia.org/r/1109389 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [10:20:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2126 db2226 T373579', diff saved to https://phabricator.wikimedia.org/P71921 and previous config saved to /var/cache/conftool/dbconfig/20250109-102010-marostegui.json [10:20:14] T373579: Productionize db22[21-40] - https://phabricator.wikimedia.org/T373579 [10:20:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2126,2187,2226].codfw.wmnet with reason: maintenance [10:20:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2126,2187,2226].codfw.wmnet with reason: maintenance [10:21:07] (03Merged) 10jenkins-bot: sre.hosts.upgrade-and-reboot: remove cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1102887 (https://phabricator.wikimedia.org/T379259) (owner: 10Volans) [10:21:27] !log Move db2187:3312 under db2226 s2 codfw dbmaint T373579 [10:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71922 and previous config saved to /var/cache/conftool/dbconfig/20250109-102218-root.json [10:23:25] FIRING: [2x] SystemdUnitFailed: user@0.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 10%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71923 and previous config saved to /var/cache/conftool/dbconfig/20250109-102512-root.json [10:25:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 10%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71924 and previous config saved to /var/cache/conftool/dbconfig/20250109-102522-root.json [10:25:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P71925 and previous config saved to /var/cache/conftool/dbconfig/20250109-102538-root.json [10:27:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling pc4', diff saved to https://phabricator.wikimedia.org/P71926 and previous config saved to /var/cache/conftool/dbconfig/20250109-102700-ladsgroup.json [10:30:34] (03CR) 10Btullis: modules+hiera: Add module to do Ceph mounts and mount ml-lab /home (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1109044 (owner: 10Klausman) [10:31:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2053:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:34:14] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for pc1016.eqiad.wmnet [10:37:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71927 and previous config saved to /var/cache/conftool/dbconfig/20250109-103723-root.json [10:38:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc1016.eqiad.wmnet [10:40:07] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.upgrade for pc2015.codfw.wmnet [10:40:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71928 and previous config saved to /var/cache/conftool/dbconfig/20250109-104017-root.json [10:40:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 25%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71929 and previous config saved to /var/cache/conftool/dbconfig/20250109-104027-root.json [10:40:33] (03CR) 10Fabfur: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1108875 (owner: 10Vgutierrez) [10:40:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443973 (10phaultfinder) [10:40:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P71930 and previous config saved to /var/cache/conftool/dbconfig/20250109-104043-root.json [10:40:45] (03CR) 10Muehlenhoff: [C:03+2] Switch presto in test cluster to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109095 (owner: 10Muehlenhoff) [10:42:36] PROBLEM - MariaDB Replica IO: pc4 on pc1016 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl2024@pc2015.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on pc2015.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:42:48] Amir1: ^ [10:43:06] the cookbook should have downtimed it [10:43:15] it's the replica, sigh [10:43:21] Yeah I was going to say [10:43:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on pc1016.eqiad.wmnet with reason: Reboot [10:43:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on pc1016.eqiad.wmnet with reason: Reboot [10:45:11] !log root@cumin2002 START - Cookbook sre.puppet.renew-cert for dbprov2004.codfw.wmnet: Renew puppet certificate - root@cumin2002 [10:45:56] !log root@cumin2002 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for dbprov2004.codfw.wmnet: Renew puppet certificate - root@cumin2002 [10:45:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for pc2015.codfw.wmnet [10:47:36] RECOVERY - MariaDB Replica IO: pc4 on pc1016 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:48:14] (03PS1) 10Muehlenhoff: Switch Presto access for coordinators to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109393 [10:51:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff) [10:52:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71931 and previous config saved to /var/cache/conftool/dbconfig/20250109-105228-root.json [10:52:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff) [10:55:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [10:55:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71932 and previous config saved to /var/cache/conftool/dbconfig/20250109-105523-root.json [10:55:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 50%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71933 and previous config saved to /var/cache/conftool/dbconfig/20250109-105533-root.json [10:57:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repool pc4', diff saved to https://phabricator.wikimedia.org/P71934 and previous config saved to /var/cache/conftool/dbconfig/20250109-105708-ladsgroup.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1100) [11:05:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:07:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 6%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71935 and previous config saved to /var/cache/conftool/dbconfig/20250109-110734-root.json [11:10:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71936 and previous config saved to /var/cache/conftool/dbconfig/20250109-111029-root.json [11:10:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 75%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71937 and previous config saved to /var/cache/conftool/dbconfig/20250109-111038-root.json [11:10:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10443999 (10phaultfinder) [11:11:22] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:18:02] (03PS2) 10Giuseppe Lavagetto: ClusterConfig: add support for dumps trait [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) [11:18:02] (03PS2) 10Giuseppe Lavagetto: Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) [11:19:19] (03CR) 10CI reject: [V:04-1] Use a bespoke database configuration for dumps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109109 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [11:22:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 7%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71938 and previous config saved to /var/cache/conftool/dbconfig/20250109-112239-root.json [11:23:23] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10444031 (10hnowlan) [11:23:59] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [11:24:04] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [11:24:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [11:24:16] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10444033 (10MatthewVernon) There is no row in the `object` table with rowid 423322 (in any copy), the other complained-of row is extant: ` 2701219|f/f8/Kuro... [11:25:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71939 and previous config saved to /var/cache/conftool/dbconfig/20250109-112534-root.json [11:25:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2226 (re)pooling @ 100%: Repooling after moving sanitarium', diff saved to https://phabricator.wikimedia.org/P71940 and previous config saved to /var/cache/conftool/dbconfig/20250109-112543-root.json [11:29:49] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2022.codfw.wmnet with OS bookworm [11:34:36] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10444067 (10MatthewVernon) Attempting the recovery operation (i.e. `sqlite3 4077d9164732d6587761ef101bcbc280.db .recover >recovered.sql`) gives us 3 files w... [11:35:01] (03CR) 10Hnowlan: [C:03+2] changeprop-jobqueue: remove support for video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108737 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [11:36:32] (03Merged) 10jenkins-bot: changeprop-jobqueue: remove support for video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1108737 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [11:37:22] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:37:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 8%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71942 and previous config saved to /var/cache/conftool/dbconfig/20250109-113744-root.json [11:45:12] (03CR) 10Fabfur: [C:03+2] haproxy:benthos: produce msg compatible with our schema guidelines [puppet] - 10https://gerrit.wikimedia.org/r/1101166 (https://phabricator.wikimedia.org/T329332) (owner: 10Fabfur) [11:52:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 9%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71943 and previous config saved to /var/cache/conftool/dbconfig/20250109-115250-root.json [11:53:07] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [11:53:19] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [11:53:45] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [11:53:56] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff) [11:54:15] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [11:57:51] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [11:58:35] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [11:59:13] hnowlan: \o/ [12:03:58] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [12:04:09] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [12:05:11] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2053 to wikikube-worker2192 [12:05:32] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:05:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [12:06:49] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [12:06:51] (03CR) 10Muehlenhoff: [C:03+2] Switch Presto access for coordinators to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109393 (owner: 10Muehlenhoff) [12:07:25] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [12:07:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71944 and previous config saved to /var/cache/conftool/dbconfig/20250109-120755-root.json [12:08:41] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [12:12:21] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2053 to wikikube-worker2192 - jelto@cumin1002" [12:12:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2053 to wikikube-worker2192 - jelto@cumin1002" [12:12:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:12:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2192 [12:13:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2192 [12:13:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2053 to wikikube-worker2192 [12:14:38] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2056 to wikikube-worker2193 [12:15:02] (03PS1) 10Muehlenhoff: Remove access for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/1109401 [12:15:05] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:18:09] (03CR) 10Muehlenhoff: [C:03+2] Remove access for piccardi [puppet] - 10https://gerrit.wikimedia.org/r/1109401 (owner: 10Muehlenhoff) [12:18:53] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2056 to wikikube-worker2193 - jelto@cumin1002" [12:19:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2056 to wikikube-worker2193 - jelto@cumin1002" [12:19:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:38] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2193 [12:23:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71945 and previous config saved to /var/cache/conftool/dbconfig/20250109-122301-root.json [12:24:11] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Piccardi out of all services on: 2310 hosts [12:24:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2193 [12:25:03] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Piccardi out of all services on: 2310 hosts [12:25:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2056 to wikikube-worker2193 [12:25:49] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2058 to wikikube-worker2194 [12:26:11] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:28:25] RESOLVED: [2x] SystemdUnitFailed: user@0.service on ganeti6001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:29:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2058 to wikikube-worker2194 - jelto@cumin1002" [12:30:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2058 to wikikube-worker2194 - jelto@cumin1002" [12:30:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:30:06] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2194 [12:30:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2194 [12:30:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2058 to wikikube-worker2194 [12:31:35] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2192.codfw.wmnet wikikube-worker2193.codfw.wmnet wikikube-worker2194.codfw.wmnet on all recursors [12:31:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2192.codfw.wmnet wikikube-worker2193.codfw.wmnet wikikube-worker2194.codfw.wmnet on all recursors [12:34:00] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2192.codfw.wmnet with OS bookworm [12:34:10] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2192 [12:34:25] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:37:26] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2193.codfw.wmnet with OS bookworm [12:37:36] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2193 [12:37:47] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2192 - jelto@cumin1002" [12:37:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2192 - jelto@cumin1002" [12:37:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:37:51] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2192.codfw.wmnet 221.48.192.10.in-addr.arpa 1.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:37:54] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:37:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2192.codfw.wmnet 221.48.192.10.in-addr.arpa 1.2.2.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:37:55] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2192 [12:38:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2192 [12:38:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2192 [12:38:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71946 and previous config saved to /var/cache/conftool/dbconfig/20250109-123806-root.json [12:40:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2194.codfw.wmnet with OS bookworm [12:40:46] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2194 [12:41:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2193 - jelto@cumin1002" [12:41:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2193 - jelto@cumin1002" [12:41:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:41:19] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2193.codfw.wmnet 62.48.192.10.in-addr.arpa 2.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:41:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2193.codfw.wmnet 62.48.192.10.in-addr.arpa 2.6.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:41:23] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2193 [12:41:31] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:41:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2193 [12:41:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2193 [12:43:22] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:43:24] RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2197) taken on 2025-01-09 11:52:22 (527 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:43:30] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [12:44:56] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2194 - jelto@cumin1002" [12:45:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2194 - jelto@cumin1002" [12:45:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:45:01] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2194.codfw.wmnet 224.32.192.10.in-addr.arpa 4.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:45:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2194.codfw.wmnet 224.32.192.10.in-addr.arpa 4.2.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:45:04] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2194 [12:45:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2194 [12:45:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2194 [12:53:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71947 and previous config saved to /var/cache/conftool/dbconfig/20250109-125313-root.json [12:54:28] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@9073e46]: Refine refactoring [12:54:49] !log aqu@deploy2002 deploy aborted: Refine refactoring (duration: 00m 20s) [12:55:51] (03CR) 10David Caro: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1300) [13:02:23] RECOVERY - Hadoop NodeManager on an-worker1154 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:04:10] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10444250 (10akosiaris) [13:04:40] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10444253 (10akosiaris) I 've updated the list of servers to mark out some that are to be decommissioned, namely the ones in {T383226} [13:08:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2231 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P71948 and previous config saved to /var/cache/conftool/dbconfig/20250109-130818-root.json [13:09:30] (03PS3) 10Máté Szabó: Unify IPInfo access levels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) [13:10:09] (03CR) 10Máté Szabó: Unify IPInfo access levels (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081370 (https://phabricator.wikimedia.org/T375086) (owner: 10Máté Szabó) [13:11:20] !log installing sqlparse security updates [13:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:38] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@9073e46]: Refine refactoring [13:21:29] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@9073e46]: Refine refactoring (duration: 02m 51s) [13:21:50] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1088-1092].eqiad.wmnet [13:21:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1088-1092].eqiad.wmnet [13:23:23] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:25:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:25:58] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10444361 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03Dzahn This looks all complete and Da... [13:26:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383213#10444364 (10kamila) [13:28:23] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:29:09] (03PS1) 10Muehlenhoff: Switch Presto access to nftables-compatible firewall settings [puppet] - 10https://gerrit.wikimedia.org/r/1109411 [13:29:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444382 (10phaultfinder) [13:31:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109411 (owner: 10Muehlenhoff) [13:33:05] (03PS1) 10Muehlenhoff: Presto: Remove ferm support [puppet] - 10https://gerrit.wikimedia.org/r/1109412 [13:36:01] (03CR) 10Xcollazo: "LGTM." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:37:23] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:38:00] (03CR) 10Gmodena: [C:03+1] mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [13:38:22] (03PS1) 10Kamila Součková: kubernetes: rename mw145[7-9] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109413 (https://phabricator.wikimedia.org/T365571) [13:39:54] !log installing jinja2 security updates [13:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:23] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [13:44:03] FIRING: KubernetesCalicoDown: wikikube-worker2022.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2022.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:44:44] 06SRE, 10SRE-Access-Requests, 06Gerrit-Privilege-Requests, 10Infrastructure Security, 10LDAP-Access-Requests: Offboard Muhammad Jazirahly from WMF systems - https://phabricator.wikimedia.org/T383056#10444423 (10WMDE-leszek) thank you all. [13:51:11] (03PS1) 10DCausse: Add gmodena to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/1109417 [13:58:24] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2192.codfw.wmnet with OS bookworm [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1400) [14:00:05] abijeet: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2193.codfw.wmnet with OS bookworm [14:01:11] o/ [14:01:15] I can probably deploy in a few minutes [14:02:23] PROBLEM - Hadoop NodeManager on an-worker1156 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [14:04:26] Lucas_WMDE: let me ping abijeet [14:05:35] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2194.codfw.wmnet with OS bookworm [14:06:12] (I can deploy now btw) [14:07:57] cool. [14:11:12] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2193.codfw.wmnet with OS bookworm [14:11:16] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2193 [14:11:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2193 [14:12:05] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2194.codfw.wmnet with OS bookworm [14:12:08] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2194 [14:12:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2194 [14:14:49] (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [14:15:36] maybe Nikerabbit could accompany the deployment of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1099725 if abijeet isn’t around? [14:16:01] or we wait / postpone [14:16:11] (03CR) 10Btullis: "Is there a relevant ticket that we can link to, for the record?" [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (owner: 10DCausse) [14:17:54] let's wait. Nikerabbit is in meeting and seems abijeet having internet issues. [14:18:02] (03CR) 10Alexandros Kosiaris: [C:03+2] profile::tlsproxy::envoy: Explicitly configure retries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1109070 (https://phabricator.wikimedia.org/T380958) (owner: 10Alexandros Kosiaris) [14:18:39] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1109411 (owner: 10Muehlenhoff) [14:18:59] ok sure [14:19:05] thanks for checking [14:19:52] (03PS2) 10DCausse: Add gmodena to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (https://phabricator.wikimedia.org/T383333) [14:19:57] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1100166 (https://phabricator.wikimedia.org/T381389) (owner: 10Cathal Mooney) [14:19:58] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove Tech News feed URL from Planet [puppet] - 10https://gerrit.wikimedia.org/r/1109131 (owner: 10Amire80) [14:21:44] (03CR) 10DCausse: "sure, filed T383333" [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (https://phabricator.wikimedia.org/T383333) (owner: 10DCausse) [14:23:22] (03CR) 10Btullis: [C:03+2] Add gmodena to analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/1109417 (https://phabricator.wikimedia.org/T383333) (owner: 10DCausse) [14:24:30] !log dcaro@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [14:26:53] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10444603 (10MatthewVernon) So we have 3 database files with at least similar contents (same number of rows, inspecting differences by hand it seems to be th... [14:27:31] hello, am I too late for the deployment window? [14:27:38] (03PS1) 10David Caro: cloudceph: move cloudcephosd1012 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1109424 (https://phabricator.wikimedia.org/T309789) [14:28:52] (03CR) 10David Caro: [C:03+2] cloudceph: move cloudcephosd1012 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1109424 (https://phabricator.wikimedia.org/T309789) (owner: 10David Caro) [14:29:02] Poke Lucas_WMDE :-) [14:29:08] hi! [14:29:09] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2193.codfw.wmnet with reason: host reimage [14:29:11] we can deploy now :) [14:29:37] Thank you, and sorry for not being here on time! [14:29:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro) [14:30:32] (03Merged) 10jenkins-bot: Enable Translate message bundle Scribunto library on MetaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1099725 (https://phabricator.wikimedia.org/T379892) (owner: 10Abijeet Patro) [14:30:55] (03PS3) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) [14:31:01] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2194.codfw.wmnet with reason: host reimage [14:31:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1099725|Enable Translate message bundle Scribunto library on MetaWiki (T379892)]] [14:31:37] T379892: Initial roll-out of Scribunto library for accessing message bundles - https://phabricator.wikimedia.org/T379892 [14:31:39] (03CR) 10CI reject: [V:04-1] Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) (owner: 10Stevemunene) [14:32:06] !log dcaro@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [14:32:41] (03CR) 10Volans: "thanks for the patch, minor nit inline, the rest looks good." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [14:33:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2193.codfw.wmnet with reason: host reimage [14:33:54] (03PS4) 10Stevemunene: Make WikibaseQualityConstraints use split-graph query service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105879 (https://phabricator.wikimedia.org/T374021) [14:37:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2194.codfw.wmnet with reason: host reimage [14:37:54] !log dcaro@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [14:38:51] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1099725|Enable Translate message bundle Scribunto library on MetaWiki (T379892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:38:54] T379892: Initial roll-out of Scribunto library for accessing message bundles - https://phabricator.wikimedia.org/T379892 [14:38:58] abijeet: please test :) [14:39:02] Lucas_WMDE, on it [14:39:08] (03PS4) 10Cathal Mooney: QoS rules for cloudcephosd [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) [14:39:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10444634 (10Jelto) >>! In T381878#10441783, @Jclark-ctr wrote: > @Jelto i performed flea power drain and looks to im... [14:40:47] (03CR) 10Ssingh: [C:03+2] P:tcpircbot: add DNS hosts to allowed CIDRs for tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/1109120 (owner: 10Ssingh) [14:41:21] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2192.codfw.wmnet with OS bookworm [14:41:25] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2192 [14:41:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2192 [14:41:42] (03CR) 10Cathal Mooney: [C:03+2] QoS rules for cloudcephosd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1058612 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [14:47:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10444684 (10Jclark-ctr) @Jelto i am going to start flea power draining them and reimaging them wanted to try to reso... [14:48:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1081.eqiad.wmnet - https://phabricator.wikimedia.org/T381878#10444685 (10Jclark-ctr) 05Open→03Resolved [14:50:15] RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2197) taken on 2025-01-09 13:49:20 (888 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [14:51:02] (03CR) 10Hnowlan: [C:03+1] kubernetes: rename mw145[7-9] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109413 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [14:52:29] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1457-1459].eqiad.wmnet [14:52:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2193.codfw.wmnet with OS bookworm [14:53:30] Lucas_WMDE, all good. [14:54:06] !log lucaswerkmeister-wmde@deploy2002 abi, lucaswerkmeister-wmde: Continuing with sync [14:54:09] alright, thanks! [14:54:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1457-1459].eqiad.wmnet [14:54:15] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw145[7-9] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1109413 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [14:56:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2194.codfw.wmnet with OS bookworm [14:58:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1073.eqiad.wmnet with OS bookworm [14:58:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444725 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube... [15:03:09] (03CR) 10TChin: [C:03+2] mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [15:03:46] (03CR) 10JMeybohm: [C:03+1] "🚢" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:04:28] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1099725|Enable Translate message bundle Scribunto library on MetaWiki (T379892)]] (duration: 32m 53s) [15:04:30] (03Merged) 10jenkins-bot: mw-content-history-reconcile-enrich: Enable K8 HA [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105667 (https://phabricator.wikimedia.org/T375176) (owner: 10TChin) [15:04:32] T379892: Initial roll-out of Scribunto library for accessing message bundles - https://phabricator.wikimedia.org/T379892 [15:04:46] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444746 (10phaultfinder) [15:04:50] (03CR) 10Kamila Součková: [C:03+2] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:04:57] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:05:49] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:06:06] !log UTC afternoon backport+config window done [15:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:37] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1457 to wikikube-worker1093 [15:06:57] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:07:54] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1458 to wikikube-worker1094 [15:08:17] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1459 to wikikube-worker1095 [15:09:02] !log kamila@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:09:03] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:09:16] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:09:21] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:10:27] 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339 (10Jelto) 03NEW [15:10:55] (03PS1) 10Cathal Mooney: WMCS: Modify QoS marking for Ceph OSD heartbeat traffic [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) [15:10:57] 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10444802 (10Jelto) [15:11:10] (03Merged) 10jenkins-bot: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:11:26] 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10444809 (10Jelto) Similar to issues in eqiad, like T381878 [15:12:42] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1458 to wikikube-worker1094 - kamila@cumin1002" [15:12:51] (03PS2) 10Cathal Mooney: WMCS: Modify QoS marking for Ceph OSD heartbeat traffic [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) [15:13:33] jelto: caught your wikikube-worker2192.mgmt.codfw.wmnet in sync-netbox-hiera, proceeding unless you stop me [15:13:35] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [15:14:07] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:14:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1458 to wikikube-worker1094 - kamila@cumin1002" [15:14:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:14:22] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1094 [15:14:43] kmila_: yes please proceed. Thank you! [15:15:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm [15:15:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube... [15:15:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1094 [15:16:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:29] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1095 [15:16:33] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:16:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1458 to wikikube-worker1094 [15:16:42] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1073.eqiad.wmnet with reason: host reimage [15:17:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1095 [15:18:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1459 to wikikube-worker1095 [15:18:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:59] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1093 [15:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444840 (10phaultfinder) [15:19:59] !log jelto@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker2192.codfw.wmnet with OS bookworm [15:20:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1073.eqiad.wmnet with reason: host reimage [15:20:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1093 [15:21:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1457 to wikikube-worker1093 [15:21:15] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1093.eqiad.wmnet wikikube-worker1094.eqiad.wmnet wikikube-worker1095.eqiad.wmnet on all recursors [15:21:17] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10444844 (10dcaro) Our current version of ceph does not support the `mon_use_min_delay_socket=true` option :/, so only for osds then. To set... [15:21:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1093.eqiad.wmnet wikikube-worker1094.eqiad.wmnet wikikube-worker1095.eqiad.wmnet on all recursors [15:21:54] !log homer 'lsw1-d3-codfw*' commit 'T377877' [15:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:57] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:22:08] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [15:22:11] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:22:17] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:22:52] (03CR) 10Cathal Mooney: [C:03+2] WMCS: Modify QoS marking for Ceph OSD heartbeat traffic [puppet] - 10https://gerrit.wikimedia.org/r/1109434 (https://phabricator.wikimedia.org/T371501) (owner: 10Cathal Mooney) [15:23:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1093.eqiad.wmnet with OS bookworm [15:23:20] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1093 [15:23:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1093 [15:23:32] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1094.eqiad.wmnet with OS bookworm [15:23:35] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1094 [15:23:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1094 [15:23:44] ���� [15:23:45] ���� [15:23:52] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1095.eqiad.wmnet with OS bookworm [15:23:55] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1095 [15:23:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1095 [15:23:56] 10ops-codfw, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10444858 (10Jelto) The following commands have to be executed when the host is back (just noting it down so I don't forget it): ` cookbook sre.host... [15:24:13] (03PS1) 10Bking: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) [15:24:20] that was me trying to see why I can't send messages to logmsgbot from the DNS host :) [15:24:35] !log homer 'lsw1-c5-codfw*' commit 'T377877' [15:24:37] sukhe: ok, glad I didn't break something :D [15:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:24] !log homer 'cr*codfw*' commit 'T377877' [15:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:34] (03PS2) 10Bking: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) [15:27:20] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2193-2194].codfw.wmnet [15:27:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2193-2194].codfw.wmnet [15:28:02] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [15:28:25] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383341 (10Jelto) 03NEW [15:28:29] (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [15:28:44] !log testing update from dns host [15:29:35] (03PS1) 10Muehlenhoff: Track LDAP access for fceratto [puppet] - 10https://gerrit.wikimedia.org/r/1109439 [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10444906 (10phaultfinder) [15:29:49] thanks, Lucas_WMDE [15:30:09] !log sukhe@dns1004: START - running authdns-update [15:30:35] (03CR) 10Muehlenhoff: [C:03+2] Track LDAP access for fceratto [puppet] - 10https://gerrit.wikimedia.org/r/1109439 (owner: 10Muehlenhoff) [15:31:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1069.eqiad.wmnet with OS bookworm [15:31:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10444916 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube... [15:31:47] !log sukhe@dns1004: END - running authdns-update [15:32:23] (03PS3) 10Ssingh: P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141 [15:33:03] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4776/co" [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh) [15:33:33] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1057.eqiad.wmnet with reason: host reimage [15:33:48] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:33:54] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:37:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1057.eqiad.wmnet with reason: host reimage [15:38:27] PROBLEM - Hadoop NodeManager on an-worker1154 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:38:52] (03PS1) 10Bking: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) [15:39:26] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:39:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:39:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1073.eqiad.wmnet with OS bookworm [15:39:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-wor... [15:40:21] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10444954 (10Jclark-ctr) 05Open→03Resolved Reimaged passed with no issues [15:44:27] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:45:53] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:46:00] !log tchin@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:46:37] (03PS4) 10Ssingh: P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141 [15:46:39] !log bking@an-airflow1005 stopping airflow-search services as part of k8s migration T380615 [15:46:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1243.eqiad.wmnet with OS bookworm [15:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:42] T380615: Migrate the airflow-search database to Kubernetes - https://phabricator.wikimedia.org/T380615 [15:46:48] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10445015 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host w... [15:47:31] (03CR) 10BBlack: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh) [15:48:56] (03CR) 10Ssingh: [C:03+2] P:dns::auth::update: log authdns-update run to SAL [puppet] - 10https://gerrit.wikimedia.org/r/1109141 (owner: 10Ssingh) [15:49:11] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1069.eqiad.wmnet with reason: host reimage [15:49:49] PROBLEM - Checks that the local airflow scheduler for airflow @search is working properly on an-airflow1005 is CRITICAL: CRITICAL: /usr/bin/env PYTHONPATH=/srv/deployment/airflow-dags/search AIRFLOW_HOME=/srv/airflow-search /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1005.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:50:09] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10445019 (10MatthewVernon) As perhaps expected, the final transaction before the incident is a DELETE of the various thumbnails of 300px-Gascones,_molino_(1... [15:50:18] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109436 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [15:50:23] PROBLEM - Disk space on ml-lab1001 is CRITICAL: DISK CRITICAL - free space: /srv 14831MiB (3% inode=94%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [15:52:01] RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2197) taken on 2025-01-09 15:14:09 (381 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [15:52:25] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [15:52:30] !log sukhe@dns1004: START - running authdns-update [15:52:40] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [15:53:18] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1069.eqiad.wmnet with reason: host reimage [15:54:07] !log sukhe@dns1004: END - running authdns-update [15:55:28] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:55:46] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:56:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:56:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1057.eqiad.wmnet with OS bookworm [15:56:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1073.eqiad.wmnet - https://phabricator.wikimedia.org/T381789#10445031 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube... [15:57:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1057.eqiad.wmnet - https://phabricator.wikimedia.org/T381676#10445032 (10Jclark-ctr) 05Open→03Resolved Reimaged server without issues. it was posted onto T381789 ticket by mistake [15:58:04] 06SRE, 10Observability-Metrics: Add slabinfo prometheus exporter - https://phabricator.wikimedia.org/T160071#10445038 (10tappof) https://github.com/prometheus/node_exporter/pull/2376 [15:58:33] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10445039 (10MoritzMuehlenhoff) [15:59:54] (03PS1) 10TChin: mw-content-history-reconcile-enrich: Add HA storageDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109448 (https://phabricator.wikimedia.org/T375176) [16:00:05] dduvall and dancy: May I have your attention please! Train log triage. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1600) [16:01:31] (03CR) 10Volans: [C:03+2] ownership: Data Platform cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445058 (10phaultfinder) [16:05:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1243.eqiad.wmnet with reason: host reimage [16:07:34] (03Merged) 10jenkins-bot: ownership: Data Platform cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [16:08:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1243.eqiad.wmnet with reason: host reimage [16:10:08] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445069 (10Jclark-ctr) Rebalanced AA breaker and BB breaker [16:11:48] jouncebot: nowandnext [16:11:48] For the next 0 hour(s) and 48 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1600) [16:11:49] In 0 hour(s) and 48 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1700) [16:11:51] (03PS1) 10Bking: dse-k8s-eqiad: empty out values-postgresql-airflow-search.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) [16:12:00] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:12:14] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: empty out values-postgresql-airflow-search.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:13:27] (03PS1) 10Jelto: Rename kubernetes20[49-52] to wikikube-worker219[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1109453 (https://phabricator.wikimedia.org/T377877) [16:14:53] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[49-52] to wikikube-worker219[5-8] [puppet] - 10https://gerrit.wikimedia.org/r/1109453 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [16:15:30] (03PS1) 10David Caro: ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) [16:15:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445106 (10phaultfinder) [16:15:47] (03PS2) 10Bking: dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) [16:16:15] (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:16:19] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:16:21] (03CR) 10CI reject: [V:04-1] ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) (owner: 10David Caro) [16:19:27] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: increase airflow-search pg instance disk size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109449 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:20:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [16:21:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:21:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1069.eqiad.wmnet with OS bookworm [16:21:15] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [16:21:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10445122 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-wor... [16:21:24] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1093.eqiad.wmnet with OS bookworm [16:21:24] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1094.eqiad.wmnet with OS bookworm [16:21:28] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1095.eqiad.wmnet with OS bookworm [16:21:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1069.eqiad.wmnet - https://phabricator.wikimedia.org/T381770#10445123 (10Jclark-ctr) 05Open→03Resolved flea power drain and Reimaged server [16:21:31] heads up, i'm going to use the remainder of this window to get wmf.11 back to group1 [16:23:26] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109455 (https://phabricator.wikimedia.org/T382362) [16:23:28] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109455 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [16:23:39] (03PS4) 10Fabfur: varnish: pass WME HEAD reqs to pass for ATS [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) [16:24:14] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109455 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [16:26:26] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101909 (https://phabricator.wikimedia.org/T381771) (owner: 10Fabfur) [16:27:04] (03PS3) 10Brouberol: dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:27:09] (03CR) 10Brouberol: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:27:40] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:27:52] (03CR) 10Bking: [C:03+2] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:27:57] (03CR) 10Stevemunene: [C:03+1] dse-k8s-eqiad: prepare to migrate airflow-search instance to k8s, part 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109441 (https://phabricator.wikimedia.org/T380615) (owner: 10Bking) [16:28:32] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [16:29:30] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [16:32:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382535#10445199 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:33:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [16:33:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1243.eqiad.wmnet with OS bookworm [16:33:16] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10445203 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikik... [16:33:33] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10445205 (10MatthewVernon) I've copied proxy-access and server logs from the frontends and serverlog from the backends onto cumin1002 to give myself a littl... [16:33:40] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: hw troubleshooting: "Comm Error: backplane 0" for wikikube-worker1243.eqiad.wmnet - https://phabricator.wikimedia.org/T383051#10445206 (10Jclark-ctr) 05Open→03Resolved Reimaged passed with no issues [16:35:07] (03PS1) 10JMeybohm: Support multiple kubernetes-client versions [debs/kubernetes] (v1.23) - 10https://gerrit.wikimedia.org/r/1109458 (https://phabricator.wikimedia.org/T341984) [16:40:00] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.11 refs T382362 [16:40:03] T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362 [16:43:24] dduvall: looking good? [16:43:31] 06SRE, 10SRE-swift-storage: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053#10445230 (10MatthewVernon) To summarise: - 07:19:14 - final successful PUT - 07:19:50 - final successful DELETE (recorded in databases OK) - 07:20:28... [16:43:38] cdanis: so far so good [16:44:31] jouncebot: nowandnext [16:44:32] For the next 0 hour(s) and 15 minute(s): Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1600) [16:44:32] In 0 hour(s) and 15 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1700) [16:44:43] dduvall: mind if I do a config deploy now? [16:46:54] yeah, no problem [16:47:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [16:47:16] (03CR) 10CDanis: [C:03+2] group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [16:47:22] The change '1109133' has been rejected (Code-Review -2) by 'CDanis' [16:47:24] lol [16:47:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cdanis@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [16:47:28] (03PS2) 10David Caro: ceph::conf: allow passing min_delay option [puppet] - 10https://gerrit.wikimedia.org/r/1109454 (https://phabricator.wikimedia.org/T371501) [16:48:02] (03PS1) 10Hnowlan: shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109459 (https://phabricator.wikimedia.org/T383317) [16:48:03] (03Merged) 10jenkins-bot: group1: enable OpenTelemetry exports [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109133 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [16:48:33] !log cdanis@deploy2002 Started scap sync-world: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] [16:48:37] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [16:50:27] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [16:50:43] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [16:53:42] !log cdanis@deploy2002 cdanis: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:53:46] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [16:53:57] !log cdanis@deploy2002 cdanis: Continuing with sync [16:54:07] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2022 - elukey@cumin1002" [16:54:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2022 - elukey@cumin1002" [16:54:12] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:54:12] !log elukey@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2022.codfw.wmnet 212.32.192.10.in-addr.arpa 2.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:54:15] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2022.codfw.wmnet 212.32.192.10.in-addr.arpa 2.1.2.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [16:54:15] !log elukey@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2022 [16:54:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2022 [16:54:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [16:54:42] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [16:54:42] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [16:55:39] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2022.codfw.wmnet with OS bookworm [16:55:43] !log elukey@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2022 [16:55:43] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2022 [17:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1700). [17:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:27] o/ [17:02:56] !log cdanis@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] (duration: 14m 22s) [17:02:59] T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model - https://phabricator.wikimedia.org/T340552 [17:03:30] o/ [17:04:05] jhathaway: one of the patches is for DNS rather than puppet, I hope that's OK [17:05:41] that is fine, however on the two puppet patches I don't see any reviews, I'm not sure if I have enough context to review them [17:09:13] jhathaway: do you know who should review them? [17:10:13] tgr|away: I'd suggest to ping somebody from #wikimedia-serviceops for those, so that they are aware and can provide assistance. I guess that the new vhost needs to be also deployed to k8s pods right? [17:10:28] (03PS1) 10CDanis: tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461 [17:10:56] yeah, would that involve a different piece of code? [17:11:29] (03PS2) 10CDanis: tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461 [17:11:31] lemme check [17:12:34] thank elukey, I'll ask in serviceops [17:12:42] IIUC the config needs to run on the deployment servers via puppet run, so the correspondent yaml files for helmfile are updated [17:13:02] and after that, a deploy would need to be kicked off to refresh the httpd config [17:14:12] (03CR) 10CDanis: [C:03+2] tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461 (owner: 10CDanis) [17:14:16] tgr|away: not sure how urgent this is but maybe we could follow up with serviceops to gather +1s and then deploy early next week? This seems something that needs to happen during a mediawiki maintenance window [17:14:30] Cc: jhathaway: --^ [17:14:43] early next week would be fine [17:14:50] just to be sure [17:15:19] (03Merged) 10jenkins-bot: tweak down sample rates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109461 (owner: 10CDanis) [17:15:20] ack super, going afk but ping me during the next days if anything is needed [17:15:25] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:15:29] do you mean a mediawiki infrastructure window? or should I schedule a custom one? [17:16:01] infra window yes, it seems a good one in my opinion [17:16:06] since we are adding a vhost etc.. [17:16:22] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:16:23] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:16:23] so in there we can pack puppet + mw deploy [17:16:31] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2022.codfw.wmnet with reason: host reimage [17:16:53] thx, I'll reschedule [17:16:57] np! [17:17:02] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:17:03] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:17:07] thanks, sorry for the delay [17:18:34] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:18:35] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:18:38] cdanis@deploy2002: Failed to log message to wiki. Somebody should check the error logs. [17:19:13] 06SRE, 10Ceph, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Configure DSCP marking for cloudceph* hosts - https://phabricator.wikimedia.org/T371501#10445361 (10cmooney) This seems to be working ok following the merge. Packets are being properly matched in the iptables rules and the DSCP m... [17:19:47] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:19:49] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:20:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2022.codfw.wmnet with reason: host reimage [17:20:56] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1093.eqiad.wmnet with OS bookworm [17:20:59] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1093 [17:20:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1093 [17:21:32] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:21:33] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:22:45] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:22:47] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:25:02] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:25:03] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:25:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:26:29] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:35:17] (03PS1) 10CDanis: mw-*: trace sampling rate: another tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109466 [17:35:47] (03CR) 10CDanis: [C:03+2] mw-*: trace sampling rate: another tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109466 (owner: 10CDanis) [17:37:06] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1094.eqiad.wmnet with OS bookworm [17:37:07] (03Merged) 10jenkins-bot: mw-*: trace sampling rate: another tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109466 (owner: 10CDanis) [17:37:09] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1094 [17:37:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1094 [17:37:29] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [17:37:31] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1095.eqiad.wmnet with OS bookworm [17:37:39] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1095 [17:37:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1095 [17:38:42] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [17:38:43] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [17:39:49] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [17:39:50] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:41:03] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2022.codfw.wmnet with OS bookworm [17:41:17] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:41:19] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:42:06] (03PS1) 10Ladsgroup: mariadb: Add file tables and OAuthRateLimiter table to the catalog [puppet] - 10https://gerrit.wikimedia.org/r/1109467 (https://phabricator.wikimedia.org/T363581) [17:42:28] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:42:29] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [17:44:03] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [17:44:04] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [17:45:16] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [17:45:17] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:47:19] (03PS1) 10Kamila Součková: kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) [17:47:32] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:47:33] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:47:39] (03CR) 10CI reject: [V:04-1] kubernetes: reclaim eqiad videoscaler hosts [puppet] - 10https://gerrit.wikimedia.org/r/1109469 (https://phabricator.wikimedia.org/T354791) (owner: 10Kamila Součková) [17:48:57] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:48:58] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [17:49:38] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [17:49:39] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [17:50:17] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [17:51:23] (03PS1) 10Kamila Součková: kubernetes: fix my previous host rename CR [puppet] - 10https://gerrit.wikimedia.org/r/1109471 (https://phabricator.wikimedia.org/T365571) [17:53:00] (03CR) 10Hnowlan: [C:03+1] kubernetes: fix my previous host rename CR [puppet] - 10https://gerrit.wikimedia.org/r/1109471 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:53:33] (03CR) 10Kamila Součková: [C:03+2] kubernetes: fix my previous host rename CR [puppet] - 10https://gerrit.wikimedia.org/r/1109471 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:55:34] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1093.eqiad.wmnet with OS bookworm [17:55:39] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1094.eqiad.wmnet with OS bookworm [17:55:48] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1095.eqiad.wmnet with OS bookworm [17:58:10] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1093.eqiad.wmnet with OS bookworm [17:58:13] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1093 [17:58:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1093 [17:58:17] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1094.eqiad.wmnet with OS bookworm [17:58:21] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1094 [17:58:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1094 [17:58:26] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1095.eqiad.wmnet with OS bookworm [17:58:30] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1095 [17:58:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1095 [18:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1800). [18:00:05] swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1800). nyaa~ [18:00:16] (03CR) 10BCornwall: [C:03+2] ncredir: Add wikimedia.ro/wikipedia.ro [puppet] - 10https://gerrit.wikimedia.org/r/1109123 (https://phabricator.wikimedia.org/T222080) (owner: 10BCornwall) [18:01:12] cdanis: how are things looking in terms of the trace sampling tuning you've been doing? [18:01:31] I have a change planned for the infra window, but can hold off for a bit if you need time [18:03:19] (03CR) 10BCornwall: [C:04-1] "Looks like you forgot to remove the original entries to -01 :)" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [18:08:48] (03PS1) 10CDanis: final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 [18:13:45] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1093.eqiad.wmnet with reason: host reimage [18:13:59] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1095.eqiad.wmnet with reason: host reimage [18:14:10] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1094.eqiad.wmnet with reason: host reimage [18:14:46] (03PS3) 10Scott French: mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) [18:17:02] (03CR) 10Scott French: [C:03+2] mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [18:17:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1093.eqiad.wmnet with reason: host reimage [18:17:37] (03CR) 10CDanis: [C:03+2] final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 (owner: 10CDanis) [18:18:10] (03Merged) 10jenkins-bot: mw-(web|api-ext): revert to multi-DC sizing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078481 (https://phabricator.wikimedia.org/T376519) (owner: 10Scott French) [18:19:27] (03PS2) 10CDanis: final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 [18:20:14] (03CR) 10CDanis: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 (owner: 10CDanis) [18:20:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1094.eqiad.wmnet with reason: host reimage [18:21:28] (03Merged) 10jenkins-bot: final tweak [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109473 (owner: 10CDanis) [18:21:33] coordinating with cdanis out of band, we'll be deploying both patches together to mw-web and mw-api-ext once 1109473 is merged [18:21:37] aaaand there is is [18:21:42] *it is [18:21:58] swfrench-wmf: merged and ready for you on deploy2002 [18:22:19] (03CR) 10Krinkle: ClusterConfig: add support for dumps trait (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [18:23:20] (03CR) 10Krinkle: ClusterConfig: add support for dumps trait (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [18:23:40] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [18:24:56] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [18:25:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1095.eqiad.wmnet with reason: host reimage [18:26:45] !log swfrench@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [18:27:51] !log swfrench@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [18:28:46] cdanis: FYI, I'll be spacing things out by 5-10m between eqiad and codfw just to validate that my maths weren't wildly off [18:28:52] ack! [18:33:43] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [18:35:22] (03CR) 10Cwhite: [C:03+2] prometheus: add ttl option to statsd-exporter, set to 30d [puppet] - 10https://gerrit.wikimedia.org/r/1105971 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [18:35:23] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [18:36:14] !log swfrench@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [18:36:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1093.eqiad.wmnet with OS bookworm [18:37:35] !log swfrench@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [18:38:11] cdanis: all yours! mw-api-int, mw-parsoid, mw-wikifunctions remain among those updated in 1109473 [18:38:33] whoops I mean jobrunner, not parsoid :) [18:38:43] yep :D thanks Scott! [18:39:50] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [18:40:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1094.eqiad.wmnet with OS bookworm [18:41:27] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [18:41:28] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [18:42:45] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [18:42:46] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [18:44:02] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [18:44:03] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [18:44:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1095.eqiad.wmnet with OS bookworm [18:45:32] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [18:45:34] !log cdanis@deploy2002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [18:46:39] !log cdanis@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [18:46:41] !log cdanis@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [18:49:58] !log cdanis@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [19:00:05] dduvall and dancy: Your horoscope predicts another MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T1900). [19:05:16] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:05:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:07:23] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdp) failed in ms-be1090 - https://phabricator.wikimedia.org/T382874#10445725 (10VRiley-WMF) We have recieved the part. Will update when this is completed [19:13:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10445755 (10Jhancock.wm) Service Request Number: 203753434 [19:18:22] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:19:43] (03PS1) 10TrainBranchBot: group2 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109480 (https://phabricator.wikimedia.org/T382362) [19:19:44] (03CR) 10TrainBranchBot: [C:03+2] group2 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109480 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [19:20:28] (03Merged) 10jenkins-bot: group2 to 1.44.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109480 (https://phabricator.wikimedia.org/T382362) (owner: 10TrainBranchBot) [19:22:22] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:25:32] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:34:05] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.44.0-wmf.11 refs T382362 [19:34:09] T382362: 1.44.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T382362 [19:38:38] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10445801 (10cmooney) @Andrew I've updated the switch config for this host to also trunk the //cloud-pirvate-b1-codfw// vlan, so should be ok on that front n... [19:43:32] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [19:44:53] (03PS1) 10CDanis: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 [19:45:06] (03PS2) 10CDanis: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 [19:46:57] (03PS3) 10CDanis: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 [19:48:04] (03CR) 10CDanis: [C:03+2] bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 (owner: 10CDanis) [19:49:01] (03Merged) 10jenkins-bot: bump jaeger resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109481 (owner: 10CDanis) [19:49:13] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:50:35] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10445805 (10Jhancock.wm) it is cabled up and connected to port 43 on the cloud switch [19:51:28] (03PS1) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) [19:52:04] (03PS1) 10CDanis: jaeger: 3Gi instead, 4Gi disallowed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109484 [19:52:17] (03PS2) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) [19:52:25] (03CR) 10CDanis: [C:03+2] jaeger: 3Gi instead, 4Gi disallowed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109484 (owner: 10CDanis) [19:53:33] (03Merged) 10jenkins-bot: jaeger: 3Gi instead, 4Gi disallowed [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109484 (owner: 10CDanis) [19:54:13] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:54:29] !log cdanis@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:01:18] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:05:34] !log dcausse@deploy2002 Started deploy [airflow-dags/search@718e870]: search: switch query_clicks to SparkSqlOperator [20:05:54] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [20:06:01] !log dcausse@deploy2002 Finished deploy [airflow-dags/search@718e870]: search: switch query_clicks to SparkSqlOperator (duration: 00m 27s) [20:10:30] PROBLEM - Hadoop NodeManager on an-worker1117 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:29:36] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@0e4370e]: Canary event fix [20:30:59] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@0e4370e]: Canary event fix (duration: 01m 23s) [20:38:30] RECOVERY - Hadoop NodeManager on an-worker1117 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [20:39:20] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 2 others: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10445898 (10bking) We are currently getting [[ https://puppet-compiler.wmflabs.org/output/1109483/2700/cloudelastic1011.eqiad.wmnet/change.cloudelastic1011.e... [20:40:26] (03PS3) 10Dzahn: certificates: add wiki[m|p]edia.ro to ncredir Letsencrypt cert 7 [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) [20:40:32] (03CR) 10Dzahn: "ugh, yea, I did. fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/1108859 (https://phabricator.wikimedia.org/T222080) (owner: 10Dzahn) [20:40:47] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080#10445900 (10BCornwall) @CRoslof I'm noticing that both wikimedia.org and wikipedia.ro have duplicate MarkMonitor entries - Could you please remove the inactive second one, please? Thanks! [20:51:12] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Idle - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:54:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 09 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1104732 (https://phabricator.wikimedia.org/T381379) (owner: 10Pppery) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T2100) [21:00:05] Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:10] here [21:07:30] RECOVERY - Hadoop NodeManager on an-worker1156 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:13:22] RECOVERY - Host mr1-esams.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.50 ms [21:18:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [21:19:46] PROBLEM - Host mr1-esams.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:25:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:32] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:32:32] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:33:34] 06SRE, 10observability, 10Observability-Logging, 10Wikimedia-Logstash, 13Patch-For-Review: Reduce the number of fields declared in OpenSearch by logstash - https://phabricator.wikimedia.org/T180051#10446087 (10andrea.denisse) [21:35:26] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.196 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:36:22] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:39:16] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:40:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:43:02] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Sync cloudelastic1011 status change after Netbox update - bking@cumin2002 - T378368" [21:43:10] T378368: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368 [21:44:24] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Sync cloudelastic1011 status change after Netbox update - bking@cumin2002 - T378368" [21:44:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [21:48:59] (03CR) 10Bartosz Dziewoński: [C:03+1] Update French wikinews license to CC-BY-SA 4.0 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [21:53:13] 06SRE, 10Observability-Logging, 06serviceops, 10WMF-General-or-Unknown: Re-consider ` >/dev/null 2>&1` as output of many cron'd MW maintenance scripts - https://phabricator.wikimedia.org/T187078#10446147 (10andrea.denisse) I think that having a list of the MW maintenance scripts that have this behavior wou... [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250109T2200) [22:02:17] 06SRE, 10Observability-Metrics: Include apache_exporter in puppet module httpd (was: apache) - https://phabricator.wikimedia.org/T187434#10446170 (10andrea.denisse) Hi, I’m having trouble understanding the goal of this task. Could you clarify if it involves adding an include profile::prometheus::apache_export... [22:03:57] 06SRE, 10Observability-Logging, 10Wikimedia-Apache-configuration: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#10446174 (10andrea.denisse) Is this related to T187434 ? [22:05:32] PROBLEM - Hadoop NodeManager on an-worker1090 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:09:32] RECOVERY - Hadoop NodeManager on an-worker1090 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:14:16] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:22:22] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:34:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [22:38:45] 06SRE, 10SRE-Access-Requests: Requesting shell access to analytics-privatedata for Katherine Graessle - https://phabricator.wikimedia.org/T383241#10446369 (10Dzahn) Hello @Kgraessle it looks to me like you already have shell access, an SSH key and membership in analytics-privatedata-users. Could you share d... [22:53:11] (03PS3) 10Bking: cloudelastic: add cloudelastic10[12] into production [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) [22:55:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1109483 (https://phabricator.wikimedia.org/T378368) (owner: 10Bking) [22:58:32] PROBLEM - Hadoop NodeManager on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:00:09] !log bking@puppetserver1001:~$ sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 T378368 [23:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:12] T378368: Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368 [23:00:23] !log bking@pcc-db1002.puppet-diffs.eqiad1.wikimedia.cloud sudo -u jenkins-deploy /usr/local/sbin/pcc_facts_processor T378368 [23:00:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:22] PROBLEM - Hadoop NodeManager on an-worker1147 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:02:32] RECOVERY - Hadoop NodeManager on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:22:22] RECOVERY - Hadoop NodeManager on an-worker1147 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:29:16] PROBLEM - Hadoop NodeManager on an-worker1124 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:31:32] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:44:16] RECOVERY - Hadoop NodeManager on an-worker1124 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:55:32] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process