[00:00:18] (03CR) 10Dzahn: "When looking at incident history I see that the page was only a manual page created by urbanecm with the text "Nearly complete Gerrit outa" [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [00:00:25] FIRING: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:00:56] (03CR) 10Dzahn: [C:04-1] gerrit: change blackbox checks to collaboration-services/task [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [00:05:25] RESOLVED: SystemdUnitFailed: prometheus_ferm_mss.service on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:09:54] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10482404 (10Jhancock.wm) yeah, found the original order. doesn't seem to be the case. T348059 i'm asking them if it's possibly the backplane or cabling issues. [00:10:27] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 618.55 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:15:19] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 559MiB (3% inode=39%): /tmp 559MiB (3% inode=39%): /var/tmp 559MiB (3% inode=39%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [00:26:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:31:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:35:19] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [00:36:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:38:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113236 [00:38:26] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113236 (owner: 10TrainBranchBot) [00:46:13] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:51:40] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on wikikube-worker1121:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:57:56] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1113236 (owner: 10TrainBranchBot) [01:08:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113240 [01:08:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113240 (owner: 10TrainBranchBot) [01:19:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10482567 (10phaultfinder) [01:28:05] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1113240 (owner: 10TrainBranchBot) [01:31:29] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:31:39] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:31:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:38:47] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53368 bytes in 4.409 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:39:21] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:39:29] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:55:19] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 557MiB (3% inode=39%): /tmp 557MiB (3% inode=39%): /var/tmp 557MiB (3% inode=39%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [01:55:57] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:00:57] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:06:13] RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 30.66 ms [02:12:29] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 12.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:12:37] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [02:15:19] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [02:23:52] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:33:52] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:33] (03CR) 10Bartosz Dziewoński: "Commented on the task. I'm sure we could make it work too, but I don't really want to restart the process of discovering and fixing new ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [02:50:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 22 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [02:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:52:09] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:58:58] (03PS1) 10Andrea Denisse: ipmi: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113245 (https://phabricator.wikimedia.org/T357099) [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:48] (03PS1) 10Andrea Denisse: clientbucket: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113246 (https://phabricator.wikimedia.org/T357099) [03:14:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:15:31] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:18:21] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:18:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:23:39] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:23:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10482650 (10phaultfinder) [03:26:33] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:37] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 09 Apr 2025 10:34:17 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:28:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:29:23] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:32:09] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:43:46] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:58:46] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:58:48] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:03:34] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:03:53] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:04:24] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:04:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:11:48] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:16:48] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:19:08] (03PS1) 10Marostegui: db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113252 (https://phabricator.wikimedia.org/T384376) [06:19:59] (03CR) 10Marostegui: [C:03+2] db2175: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113252 (https://phabricator.wikimedia.org/T384376) (owner: 10Marostegui) [06:20:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: rebuilding index [06:32:02] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384415 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [06:32:12] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384415 (10ops-monitoring-bot) 03NEW [06:35:19] PROBLEM - Disk space on grafana2001 is CRITICAL: DISK CRITICAL - free space: / 549MiB (3% inode=39%): /tmp 549MiB (3% inode=39%): /var/tmp 549MiB (3% inode=39%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [06:42:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:51:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1021 T384418', diff saved to https://phabricator.wikimedia.org/P72212 and previous config saved to /var/cache/conftool/dbconfig/20250122-065157-marostegui.json [06:52:02] T384418: decommission es1021.eqiad.wmnet - https://phabricator.wikimedia.org/T384418 [06:52:06] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384419 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [06:52:10] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384419 (10ops-monitoring-bot) 03NEW [06:52:45] (03PS1) 10Marostegui: es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113353 (https://phabricator.wikimedia.org/T384418) [06:52:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:10] (03CR) 10Marostegui: [C:03+2] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113353 (https://phabricator.wikimedia.org/T384418) (owner: 10Marostegui) [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T0700) [07:09:59] 06SRE, 06Data-Engineering, 10Dumps 2.0, 10Dumps-Generation: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098#10482785 (10Nemo_bis) Thanks for the update on XML data dumps list. I see there's progress on the other side: https://phabricator.wikimedia... [07:15:18] RECOVERY - Disk space on grafana2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=grafana2001&var-datasource=codfw+prometheus/ops [07:17:32] (03PS1) 10Marostegui: es1021: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113355 (https://phabricator.wikimedia.org/T384418) [07:22:17] ACKNOWLEDGEMENT - Dell PowerEdge RAID Controller on an-presto1014 is CRITICAL: communication: 0 OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T384420 https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [07:22:26] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384420 (10ops-monitoring-bot) 03NEW [07:40:39] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384420#10482803 (10Kizule) →14Duplicate dup:03T384419 [07:40:42] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384419#10482805 (10Kizule) [07:41:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384419#10482809 (10Kizule) →14Duplicate dup:03T384415 [07:41:50] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-presto1014 - https://phabricator.wikimedia.org/T384415#10482811 (10Kizule) [07:55:08] (03CR) 10Slyngshede: [V:03+1 C:03+2] P:idp theme support [puppet] - 10https://gerrit.wikimedia.org/r/1112699 (https://phabricator.wikimedia.org/T372892) (owner: 10Slyngshede) [07:55:40] (03CR) 10Slyngshede: [C:03+2] Escape filter character [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1112750 (owner: 10Slyngshede) [07:56:59] (03PS1) 10Raymond Ndibe: [toolforge] persist target logs in /var/log/pods in journald [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) [07:57:23] (03Merged) 10jenkins-bot: Escape filter character [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/1112750 (owner: 10Slyngshede) [07:58:49] FIRING: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:00:05] Amir1, Urbanecm, and awight: Time to do the UTC morning backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:11:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [08:16:02] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10482859 (10jcrespo) @DSantamaria Could you please check access to Superset? Otherwise, we will consider the ticket as resolved. [08:20:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [08:20:47] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10482861 (10jcrespo) Probably related: T384414 [08:22:31] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2019.codfw.wmnet to cluster codfw and group B [08:23:19] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10482863 (10jcrespo) [08:24:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2019.codfw.wmnet to cluster codfw and group B [08:24:38] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10482865 (10jcrespo) @SuzanneWood-WMDE Feel free to reach me in private or through IRC (jynus) if you need further clarification. [08:34:19] (03CR) 10Filippo Giunchedi: [C:03+2] thanos: further reduce trace sampling [puppet] - 10https://gerrit.wikimedia.org/r/1112700 (https://phabricator.wikimedia.org/T378190) (owner: 10Filippo Giunchedi) [08:40:59] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10482867 (10SuzanneWood-WMDE) @jcrespo - I added my public key here https://meta.wikimedia.org/wiki/User:Suzanne_Wood_(WMDE) [08:42:18] (03CR) 10Filippo Giunchedi: "See inline, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/1113245 (https://phabricator.wikimedia.org/T357099) (owner: 10Andrea Denisse) [08:42:28] (03CR) 10Filippo Giunchedi: [C:03+1] clientbucket: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113246 (https://phabricator.wikimedia.org/T357099) (owner: 10Andrea Denisse) [08:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10482870 (10phaultfinder) [08:47:41] (03CR) 10Filippo Giunchedi: [C:03+2] chartmuseum: remove icinga-based http checks [puppet] - 10https://gerrit.wikimedia.org/r/1113146 (https://phabricator.wikimedia.org/T384324) (owner: 10Filippo Giunchedi) [08:47:49] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10482877 (10jcrespo) [08:48:50] 06SRE, 10SRE-Access-Requests: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10482880 (10jcrespo) Thank you, compared and checked edit, proceeding with the patch. [08:48:59] (03CR) 10Vgutierrez: [C:03+2] acme_chief: Allow specifying an account per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113187 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [08:49:36] godog: please merge my CR if it's showing in your puppet-merge run [08:50:07] vgutierrez: it didn't! [08:50:16] cool, merging it now :D [08:59:16] (03PS1) 10Vgutierrez: hiera: Issue unified cert with pki.goog on acmechief-test [puppet] - 10https://gerrit.wikimedia.org/r/1113417 (https://phabricator.wikimedia.org/T384195) [09:00:02] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113417 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [09:00:04] brennen and jeena: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T0900). [09:08:11] (03PS1) 10Jcrespo: admin: Add suzannewood to the 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/1113418 (https://phabricator.wikimedia.org/T384018) [09:09:53] (03CR) 10Jcrespo: [C:03+1] admin: Add suzannewood to the 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/1113418 (https://phabricator.wikimedia.org/T384018) (owner: 10Jcrespo) [09:10:50] jouncebot: nowandnext [09:10:50] For the next 1 hour(s) and 49 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T0900) [09:10:50] In 1 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1100) [09:12:15] (03CR) 10AOkoth: "Ack. I'll abandon this." [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [09:12:32] (03PS1) 10Muehlenhoff: Move ganeti2021 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113419 [09:12:33] (03Abandoned) 10AOkoth: docs: alert only on active host [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [09:15:50] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [09:16:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10482918 (10ops-monitoring-bot) Draining ganeti2021.codfw.wmnet of running VMs [09:16:29] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: os upgrade [09:18:13] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10482925 (10MoritzMuehlenhoff) [09:19:54] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1113418 (https://phabricator.wikimedia.org/T384018) (owner: 10Jcrespo) [09:20:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [09:20:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [09:21:04] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10482946 (10ops-monitoring-bot) Draining ganeti2021.codfw.wmnet of running VMs [09:21:20] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1240.eqiad.wmnet with OS bookworm [09:25:13] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10482995 (10Jelto) [09:38:13] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1240.eqiad.wmnet with reason: host reimage [09:39:19] (03PS1) 10Jcrespo: admin: Deploying approved policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) [09:40:49] (03PS2) 10Jcrespo: admin: Deploying WMDE privatedata policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) [09:40:49] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1240.eqiad.wmnet with reason: host reimage [09:41:08] (03PS3) 10Jcrespo: admin: Deploy WMDE privatedata policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) [09:41:36] (03CR) 10Volans: [C:03+1] "I haven't tested it but the code LGTM, optional nit inline." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1092895 (https://phabricator.wikimedia.org/T379553) (owner: 10Cathal Mooney) [09:41:36] (03CR) 10Jcrespo: [C:04-1] "Not changed the wiki yet." [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) (owner: 10Jcrespo) [09:55:25] (03PS2) 10Clément Goubert: statsd-exporter: set ttl to 30d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [09:56:37] (03PS2) 10Vgutierrez: hiera: Issue unified cert with pki.goog on acmechief-test [puppet] - 10https://gerrit.wikimedia.org/r/1113417 (https://phabricator.wikimedia.org/T384195) [09:56:37] (03PS1) 10Vgutierrez: acme_chief: Allow setting key_types per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113422 (https://phabricator.wikimedia.org/T370837) [09:57:09] (03PS1) 10Urbanecm: ValidatorFactory: Allow extensions to register validators [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113423 (https://phabricator.wikimedia.org/T384246) [09:57:17] (03PS1) 10Urbanecm: ValidatorFactory: Allow extensions to register validators [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113424 (https://phabricator.wikimedia.org/T384246) [09:58:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113423 (https://phabricator.wikimedia.org/T384246) (owner: 10Urbanecm) [09:58:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113424 (https://phabricator.wikimedia.org/T384246) (owner: 10Urbanecm) [09:58:24] (03PS2) 10Vgutierrez: acme_chief: Allow setting key_types per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113422 (https://phabricator.wikimedia.org/T370837) [09:58:24] (03PS3) 10Vgutierrez: hiera: Issue unified cert with pki.goog on acmechief-test [puppet] - 10https://gerrit.wikimedia.org/r/1113417 (https://phabricator.wikimedia.org/T384195) [10:03:27] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1240.eqiad.wmnet with OS bookworm [10:04:11] (03CR) 10Clément Goubert: [C:03+1] statsd-exporter: set ttl to 30d [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105972 (https://phabricator.wikimedia.org/T359497) (owner: 10Cwhite) [10:05:26] (03CR) 10Jcrespo: [C:03+1] "Feel free to review wiki changes documenting this:" [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) (owner: 10Jcrespo) [10:06:10] (03PS4) 10Jcrespo: admin: Deploy WMDE privatedata policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) [10:06:29] (03CR) 10Jcrespo: [C:03+2] admin: Add suzannewood to the 'restricted' group [puppet] - 10https://gerrit.wikimedia.org/r/1113418 (https://phabricator.wikimedia.org/T384018) (owner: 10Jcrespo) [10:09:40] PROBLEM - Hadoop NodeManager on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:14:25] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for administrators of Indonesian projects - https://phabricator.wikimedia.org/T384135#10483227 (10SpartacksCompatriot) Thanks @Ladsgroup! I'm trying to get access to the list using my existing lists.wikimedia account, and I just realized that I put the wrong... [10:16:09] 06SRE, 06Infrastructure-Foundations, 10observability: LibreNMS changes on every puppet run since upgrade to 24.12 - https://phabricator.wikimedia.org/T384440 (10cmooney) 03NEW p:05Triage→03Medium [10:16:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113422 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [10:16:37] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113417 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [10:27:40] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to 'restricted' for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T384018#10483263 (10jcrespo) 05Open→03Resolved a:03jcrespo @SuzanneWood-WMDE your access has been deployed, although it may take a few minutes to replicate to... [10:29:17] 06SRE, 10Wikimedia-Mailing-lists: Mailing list for administrators of Indonesian projects - https://phabricator.wikimedia.org/T384135#10483271 (10Ladsgroup) {{done}} [10:29:40] RECOVERY - Hadoop NodeManager on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:32:39] (03PS1) 10Marostegui: Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113428 [10:33:05] (03CR) 10Marostegui: [C:03+2] Revert "db2175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113428 (owner: 10Marostegui) [10:33:25] (03CR) 10Jcrespo: "What's the status of this? I see @Thecipriani approved, but maybe I (or Chris) can setup a ticket so there is the right paperwork trace?" [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [10:33:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72213 and previous config saved to /var/cache/conftool/dbconfig/20250122-103342-root.json [10:38:24] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [10:38:31] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10483284 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe2806ef-4f5c-4485-981c-52b89f9e3154) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [10:38:52] !log disable-pupept on netflow7001 to run gnmic in foregrand for debug/development T369384 [10:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:57] T369384: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384 [10:45:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:48:37] (03PS1) 10Jelto: gerrit: add gerrit_abusers to block IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) [10:48:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72214 and previous config saved to /var/cache/conftool/dbconfig/20250122-104848-root.json [10:50:15] (03PS1) 10Mvolz: Fix CSS in Docker registry builder [puppet] - 10https://gerrit.wikimedia.org/r/1113430 [10:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:51:37] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4847/co" [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [10:51:40] PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [10:52:10] (03PS2) 10Mvolz: Fix CSS in Docker registry builder [puppet] - 10https://gerrit.wikimedia.org/r/1113430 [10:58:56] (03CR) 10CI reject: [V:04-1] ValidatorFactory: Allow extensions to register validators [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113423 (https://phabricator.wikimedia.org/T384246) (owner: 10Urbanecm) [10:59:56] !log Deploy schema change in codfw x1 with replication on the master dbmaint T381759 [10:59:57] (03CR) 10Clément Goubert: [C:03+1] shellbox-constraints: 1 eqiad replica on 8.1 (change 1/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:00] T381759: Add "event_is_test_event" field to "campaign_events" table - https://phabricator.wikimedia.org/T381759 [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1100) [11:00:07] (03CR) 10Clément Goubert: [C:03+1] shellbox-constraints: all eqiad replicas on 8.1 (change 2/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113218 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:00:22] (03CR) 10Clément Goubert: [C:03+1] shellbox-constraints: all replicas on PHP 8.1 (change 3/3) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113219 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [11:00:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:45] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load when logged in - https://phabricator.wikimedia.org/T381980#10483396 (10jcrespo) The page loaded for me while logged in, but it took almost 3 minutes: {F58247275} I believe this is j... [11:01:32] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10483407 (10jcrespo) [11:01:36] 06SRE, 10Wikimedia-Mailing-lists: https://lists.wikimedia.org/postorius/lists/mediawiki-announce.lists.wikimedia.org/ won't load when logged in - https://phabricator.wikimedia.org/T381980#10483410 (10jcrespo) →14Duplicate dup:03T353891 [11:03:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72215 and previous config saved to /var/cache/conftool/dbconfig/20250122-110354-root.json [11:07:40] RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:08:03] 06SRE, 10Wikimedia-Mailing-lists: Hang up on daily article lists - https://phabricator.wikimedia.org/T349406#10483425 (10jcrespo) I think the issue was real, just happened before only at certain times. Merging into the latest mailman slowness ticket. [11:08:19] 06SRE, 10Wikimedia-Mailing-lists: Hang up on daily article lists - https://phabricator.wikimedia.org/T349406#10483430 (10jcrespo) →14Duplicate dup:03T353891 [11:08:20] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#10483427 (10jcrespo) [11:08:49] RESOLVED: PuppetFailure: Puppet has failed on ml-lab1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:10:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2166 T383709', diff saved to https://phabricator.wikimedia.org/P72216 and previous config saved to /var/cache/conftool/dbconfig/20250122-111019-marostegui.json [11:10:24] T383709: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709 [11:10:50] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2166.codfw.wmnet with reason: Onsite work [11:11:19] (03PS2) 10Marostegui: es1021: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113355 (https://phabricator.wikimedia.org/T384418) [11:11:19] (03PS1) 10Marostegui: db2166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113432 (https://phabricator.wikimedia.org/T383709) [11:11:53] (03CR) 10Marostegui: [C:03+2] db2166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1113432 (https://phabricator.wikimedia.org/T383709) (owner: 10Marostegui) [11:11:57] (03CR) 10Marostegui: [C:03+2] es1021: Remove from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1113355 (https://phabricator.wikimedia.org/T384418) (owner: 10Marostegui) [11:13:03] (03CR) 10Fabfur: [C:03+1] geo-maps: put eqiad at lowest priority for T380858 [dns] - 10https://gerrit.wikimedia.org/r/1113205 (https://phabricator.wikimedia.org/T380858) (owner: 10Ssingh) [11:13:54] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 3 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10483483 (10Marostegui) @Papaul db2166 can be moved anytime. The host has been powered off. [11:14:16] 06SRE, 10Wikimedia-Mailing-lists: Undelivered mail posted to wikimediacz-l - https://phabricator.wikimedia.org/T348158#10483488 (10jcrespo) 05Open→03Declined I have to say that I have suffered similar cases where gmail "eats" an email without trace. Sadly, after 1 and a half years, we are unlikely to... [11:14:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove es1021 from dbctl T384418', diff saved to https://phabricator.wikimedia.org/P72217 and previous config saved to /var/cache/conftool/dbconfig/20250122-111428-root.json [11:14:33] (03PS4) 10Effie Mouzeli: php8.1-cli: introduce opcache and JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113124 (https://phabricator.wikimedia.org/T384294) [11:14:33] T384418: decommission es1021.eqiad.wmnet - https://phabricator.wikimedia.org/T384418 [11:14:46] (03PS2) 10Effie Mouzeli: php8.1: introduce JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) [11:16:20] 06SRE, 10Wikimedia-Mailing-lists: Add custom footer linking to Privacy Policy in Postorious and Hyperkitty - https://phabricator.wikimedia.org/T340375#10483496 (10jcrespo) I am merging T344000 here- suggesting having a contact method on the footer, too. [11:16:30] 06SRE, 10Wikimedia-Mailing-lists: Add custom footer linking to Privacy Policy in Postorious and Hyperkitty - https://phabricator.wikimedia.org/T340375#10483500 (10jcrespo) [11:16:32] 06SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org pages should have a "who to contact" link - https://phabricator.wikimedia.org/T344000#10483503 (10jcrespo) →14Duplicate dup:03T340375 [11:18:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72218 and previous config saved to /var/cache/conftool/dbconfig/20250122-111859-root.json [11:19:00] (03Merged) 10jenkins-bot: ValidatorFactory: Allow extensions to register validators [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1113424 (https://phabricator.wikimedia.org/T384246) (owner: 10Urbanecm) [11:19:26] (03CR) 10Urbanecm: [V:03+2] "passed on master and in wmf.12" [extensions/CommunityConfiguration] (wmf/1.44.0-wmf.13) - 10https://gerrit.wikimedia.org/r/1113423 (https://phabricator.wikimedia.org/T384246) (owner: 10Urbanecm) [11:20:32] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1113423|ValidatorFactory: Allow extensions to register validators (T384246)]], [[gerrit:1113424|ValidatorFactory: Allow extensions to register validators (T384246)]] [11:20:36] T384246: Allow extensions to register their own validators - https://phabricator.wikimedia.org/T384246 [11:20:55] (03PS3) 10Effie Mouzeli: php8.1: introduce JIT [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) [11:21:27] (03CR) 10Effie Mouzeli: php8.1: introduce JIT (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113138 (https://phabricator.wikimedia.org/T384294) (owner: 10Effie Mouzeli) [11:22:17] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2024 Nov-Jan), 07Unplanned-Sprint-Work: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10483521 (10KartikMistry) 05Open→03In progress [11:25:58] (03CR) 10Effie Mouzeli: [C:03+1] "makes sense, thanks for pointing it out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [11:27:27] 06SRE, 10Wikimedia-Mailing-lists: [[mail:]] should redirect to the main page https://lists.wikimedia.org/postorius/lists/ - https://phabricator.wikimedia.org/T309558#10483543 (10jcrespo) PrimeHunter: is the suggestion that we add, as a workaround, a 301 permanent redirect https://lists.wikimedia.org/postorius/... [11:27:37] 06SRE, 10Wikimedia-Mailing-lists: [[mail:]] should redirect to the main page https://lists.wikimedia.org/postorius/lists/ - https://phabricator.wikimedia.org/T309558#10483545 (10jcrespo) p:05Medium→03Low [11:28:21] (03PS3) 10Jelto: gerrit: add gerrit_abusers to block IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) [11:30:59] 06SRE, 10Wikimedia-Mailing-lists: Postorius (held and) reported full headers get mangled somewhere in the system - https://phabricator.wikimedia.org/T309492#10483554 (10jcrespo) Hello, @grin is this still happening in current versions of mailman installed at WMF servers? If not fixed, it may be and #upstream p... [11:32:27] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113423|ValidatorFactory: Allow extensions to register validators (T384246)]], [[gerrit:1113424|ValidatorFactory: Allow extensions to register validators (T384246)]] (duration: 11m 55s) [11:32:31] T384246: Allow extensions to register their own validators - https://phabricator.wikimedia.org/T384246 [11:34:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72220 and previous config saved to /var/cache/conftool/dbconfig/20250122-113404-root.json [11:40:52] PROBLEM - MariaDB Replica Lag: s1 on db1240 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 8530.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:42:11] ^that's me [11:42:15] downtime expired, extending it [11:43:20] !log testing acme-chief 0.38 in acmechief-test1001 [11:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:39] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10483614 (10Jelto) >>! In T383709#10480819, @Papaul wrote: > @Jelto thanks please let us know when best works fo... [11:43:49] (03PS1) 10Federico Ceratto: site.pp remove "Future" as db2233 is already the master [puppet] - 10https://gerrit.wikimedia.org/r/1113436 [11:45:13] (03CR) 10Federico Ceratto: "Updated" [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [11:46:16] (03CR) 10Marostegui: [C:03+1] "This looks good, but reminder: we have to merge this AFTER we've finished with the decommissioning script." [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [11:46:53] (03PS3) 10Vgutierrez: acme_chief: Allow setting key_types per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113422 (https://phabricator.wikimedia.org/T370837) [11:46:56] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113422 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [11:47:14] (03CR) 10Muehlenhoff: gerrit: add gerrit_abusers to block IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [11:48:35] (03PS4) 10Jelto: gerrit: add gerrit_abusers to block IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) [11:48:55] (03CR) 10Jelto: gerrit: add gerrit_abusers to block IPs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [11:49:43] (03CR) 10Vgutierrez: [C:03+2] acme_chief: Allow setting key_types per certificate [puppet] - 10https://gerrit.wikimedia.org/r/1113422 (https://phabricator.wikimedia.org/T370837) (owner: 10Vgutierrez) [11:50:25] (03PS2) 10Hnowlan: rest-gateway: add page/lint rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) [11:53:01] (03CR) 10Vgutierrez: [C:03+1] trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [11:54:08] (03CR) 10Vgutierrez: [C:03+2] hiera: Issue unified cert with pki.goog on acmechief-test [puppet] - 10https://gerrit.wikimedia.org/r/1113417 (https://phabricator.wikimedia.org/T384195) (owner: 10Vgutierrez) [11:57:27] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: add page/lint rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan) [11:59:15] (03PS1) 10Vgutierrez: Revert "hiera: Issue unified cert with pki.goog on acmechief-test" [puppet] - 10https://gerrit.wikimedia.org/r/1113439 [11:59:59] (03PS2) 10Federico Ceratto: site.pp, db2133.yaml: Remove db2133 [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) [12:00:03] (03CR) 10Vgutierrez: [C:03+2] Revert "hiera: Issue unified cert with pki.goog on acmechief-test" [puppet] - 10https://gerrit.wikimedia.org/r/1113439 (owner: 10Vgutierrez) [12:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1200). [12:01:21] (03CR) 10Marostegui: [C:03+1] "This looks good, but reminder: we have to merge this AFTER we've finished with the decommissioning script." [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [12:01:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:02:07] !log jynus@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1205.eqiad.wmnet with reason: os upgrade [12:05:28] !log root@cumin1002 START - Cookbook sre.hosts.reimage for host db1205.eqiad.wmnet with OS bookworm [12:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:44] (03PS1) 10Muehlenhoff: Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) [12:07:06] (03CR) 10CI reject: [V:04-1] Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [12:10:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [12:10:47] (03PS1) 10JMeybohm: Update coredns to 1.11.3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113445 (https://phabricator.wikimedia.org/T341984) [12:11:29] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti2021.codfw.wmnet with reason: remove from cluster for reimage [12:12:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10483696 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=00d48b0c-86e6-471d-a6ad-c116ef597e9d) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(... [12:13:39] (03CR) 10Mvolz: [C:03+2] Update Zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) (owner: 10Mvolz) [12:14:01] 10SRE-tools, 06Infrastructure-Foundations: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462 (10jcrespo) 03NEW [12:15:08] (03Merged) 10jenkins-bot: Update Zotero translators [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113137 (https://phabricator.wikimedia.org/T384165) (owner: 10Mvolz) [12:16:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:16:53] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/zotero: apply [12:17:15] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/zotero: apply [12:17:42] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/zotero: apply [12:18:08] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/zotero: apply [12:19:31] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/zotero: apply [12:20:00] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [12:21:42] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:22:18] !log root@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1205.eqiad.wmnet with reason: host reimage [12:24:02] !log disabling puppet on A:cp to test r/1113178 [12:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:46] (03CR) 10Hnowlan: [C:03+2] trafficserver: reoute testwiki citoid calls to rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [12:25:48] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1205.eqiad.wmnet with reason: host reimage [12:28:04] (03PS2) 10Slyngshede: Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 [12:28:57] (03CR) 10Muehlenhoff: [C:03+2] Move ganeti2021 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113419 (owner: 10Muehlenhoff) [12:30:22] (03PS3) 10Slyngshede: Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 [12:33:59] !log creating new schema of file tables everywhere (T368113) [12:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:04] T368113: Design and merge the new tables of file tables - https://phabricator.wikimedia.org/T368113 [12:34:59] (03PS1) 10Hnowlan: Revert "trafficserver: reoute testwiki citoid calls to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1113446 [12:35:53] (03CR) 10Hnowlan: [C:03+2] rest-gateway: add page/lint rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan) [12:37:14] (03CR) 10Hnowlan: [C:03+2] Revert "trafficserver: reoute testwiki citoid calls to rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/1113446 (owner: 10Hnowlan) [12:37:31] (03Merged) 10jenkins-bot: rest-gateway: add page/lint rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112797 (https://phabricator.wikimedia.org/T384216) (owner: 10Hnowlan) [12:38:06] (03CR) 10Mvolz: "When will this get deployed?" [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [12:41:20] (03CR) 10Hnowlan: [C:03+2] "This has already been deployed and since reverted - I'll follow up in the phab issue" [puppet] - 10https://gerrit.wikimedia.org/r/1113178 (https://phabricator.wikimedia.org/T361576) (owner: 10Hnowlan) [12:41:46] 10SRE-tools, 06Infrastructure-Foundations: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#10483791 (10Volans) p:05Triage→03Low If we want to catch the specific error of unauthorized that should be done in Spicera... [12:42:01] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#10483794 (10Volans) [12:42:52] (03CR) 10Muehlenhoff: [C:03+2] sre.hosts.reimage: Skip the vlan migration reminder for ganeti nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1113167 (owner: 10Muehlenhoff) [12:44:35] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10483800 (10MoritzMuehlenhoff) [12:47:53] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2021.codfw.wmnet with OS bookworm [12:48:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10483804 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bookworm [12:48:30] !log root@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1205.eqiad.wmnet with OS bookworm [12:51:34] (03PS3) 10Ladsgroup: Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) [12:51:41] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Add new file tables to WMCS views [puppet] - 10https://gerrit.wikimedia.org/r/1110046 (https://phabricator.wikimedia.org/T383491) (owner: 10Ladsgroup) [12:52:39] (03PS1) 10Federico Ceratto: Revert "db2189: Disable notications" [puppet] - 10https://gerrit.wikimedia.org/r/1113447 [12:54:42] (03CR) 10Federico Ceratto: [C:03+2] "Reverting to enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1113447 (owner: 10Federico Ceratto) [12:57:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 (owner: 10Slyngshede) [12:58:11] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [12:58:54] (03PS4) 10Slyngshede: Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 [13:00:13] 06SRE, 10Wikimedia-Mailing-lists: [[mail:]] should redirect to the main page https://lists.wikimedia.org/postorius/lists/ - https://phabricator.wikimedia.org/T309558#10483835 (10PrimeHunter) >>! In T309558#10483543, @jcrespo wrote: > PrimeHunter: is the suggestion that we add, as a workaround, a 301 permanent... [13:00:15] (03CR) 10Slyngshede: Add OIDC support to development environment (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 (owner: 10Slyngshede) [13:00:42] FIRING: JobUnavailable: Reduced availability for job gnmic in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:01:17] (03CR) 10Muehlenhoff: [C:03+1] Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 (owner: 10Slyngshede) [13:01:29] (03PS2) 10Muehlenhoff: Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) [13:01:52] (03CR) 10CI reject: [V:04-1] Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:03:27] (03PS1) 10Cathal Mooney: Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) [13:05:41] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [13:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:05:56] (03CR) 10Slyngshede: [C:03+2] Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 (owner: 10Slyngshede) [13:06:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10483855 (10cmooney) The above patch adds BGP stats collection to our current setup. Tested in Magru and working well, albeit with a few quirks disc... [13:08:44] !log repooling db2189 as per T384202 [13:08:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2189 slowly with 10 steps - Repool host after fixing indexes and performing OS updates [13:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:48] T384202: db2189 replication broken - https://phabricator.wikimedia.org/T384202 [13:10:03] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113450 (https://phabricator.wikimedia.org/T383892) [13:10:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:11:01] (03Merged) 10jenkins-bot: Add OIDC support to development environment [software/bitu] - 10https://gerrit.wikimedia.org/r/1112221 (owner: 10Slyngshede) [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:39] (03PS2) 10Cathal Mooney: WIP: Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) [13:12:00] (03CR) 10Cathal Mooney: "Guys hold off on any review I realised an error in my thinking." [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [13:12:57] (03PS1) 10Andrew Bogott: configure_cephosd_disks(): don't give up if scsi drive count is wrong [puppet] - 10https://gerrit.wikimedia.org/r/1113451 (https://phabricator.wikimedia.org/T383817) [13:15:52] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [13:16:09] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10483868 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ba072b6c-6957-428b-a932-dfcf0b3f8103) set by cmooney@cumin1002 for 2:00:... [13:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:19] (03CR) 10Máté Szabó: [C:03+1] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113450 (https://phabricator.wikimedia.org/T383892) (owner: 10STran) [13:24:11] jouncebot: nowandnext [13:24:11] No deployments scheduled for the next 0 hour(s) and 35 minute(s) [13:24:11] In 0 hour(s) and 35 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1400) [13:28:53] (03PS1) 10JMeybohm: Pin coredns version on all clustes to 0.3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113453 (https://phabricator.wikimedia.org/T341984) [13:28:54] (03PS1) 10JMeybohm: Update coredns to 1.11.3 / coredns helm chart 1.37.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113454 (https://phabricator.wikimedia.org/T341984) [13:29:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2133.codfw.wmnet [13:29:42] !log Deploying security patch for T384244 [13:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:29] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [13:30:42] (03PS10) 10JMeybohm: Update staging-codfw to k8s 1.31 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1112059 (https://phabricator.wikimedia.org/T341984) [13:32:06] (03CR) 10Marostegui: [C:03+1] "This can go ahead and get merged" [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [13:34:10] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [13:34:29] (03CR) 10STran: [C:03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113450 (https://phabricator.wikimedia.org/T383892) (owner: 10STran) [13:35:30] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113450 (https://phabricator.wikimedia.org/T383892) (owner: 10STran) [13:36:09] (03CR) 10Federico Ceratto: [C:03+1] site.pp, db2133.yaml: Remove db2133 [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [13:37:28] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [13:37:33] !log marostegui@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2133.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [13:37:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2133.codfw.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1002" [13:37:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:37:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2133.codfw.wmnet [13:37:59] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2021.codfw.wmnet with reason: host reimage [13:38:15] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [13:38:40] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:39:23] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [13:39:46] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [13:40:20] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [13:41:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2021.codfw.wmnet with reason: host reimage [13:44:07] (03CR) 10Federico Ceratto: [C:03+2] site.pp, db2133.yaml: Remove db2133 [puppet] - 10https://gerrit.wikimedia.org/r/1113183 (https://phabricator.wikimedia.org/T384343) (owner: 10Federico Ceratto) [13:46:24] !log Removing db2133 from zarcillo T384343 [13:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:28] T384343: decommission db2133.codfw.wmnet - https://phabricator.wikimedia.org/T384343 [13:51:35] (03PS3) 10Muehlenhoff: Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) [13:51:58] (03CR) 10CI reject: [V:04-1] Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [13:52:42] FIRING: JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:53:30] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10483996 (10Jhancock.wm) @Marostegui i'll move it this morning. @Jelto That time works for us. thank you both! [13:54:17] (03PS4) 10Muehlenhoff: Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) [13:56:44] !log Deployed security patch for T384244 [13:56:46] (03CR) 10David Caro: [toolforge] persist target logs in /var/log/pods in journald (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113412 (https://phabricator.wikimedia.org/T383081) (owner: 10Raymond Ndibe) [13:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:42] RESOLVED: JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:00:04] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10484020 (10Jhancock.wm) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1400). [14:00:05] MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:31] i can deploy today [14:00:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:00:34] MatmaRex: around? [14:00:41] hi [14:02:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2021.codfw.wmnet with OS bookworm [14:02:05] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10484025 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bookworm completed: - ganeti202... [14:03:30] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2133.codfw.wmnet - https://phabricator.wikimedia.org/T384343#10484031 (10FCeratto-WMF) [14:05:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [14:06:19] (03Abandoned) 10Bking: Fixing an improper merge of values.yaml [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092340 (owner: 10Aleksandar Mastilovic) [14:07:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:10:44] (03PS1) 10JMeybohm: Import upstream release 1.24.2 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) [14:13:04] (03PS1) 10DCausse: cirrus: drop cirrus_saneitize_jobs periodic job [puppet] - 10https://gerrit.wikimedia.org/r/1113461 [14:13:21] (03PS1) 10DCausse: cirrus-streaming-updater: index wikitech [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113462 [14:13:21] MatmaRex: oh, sorry, i missed the response [14:13:21] let's start [14:13:21] (03CR) 10Urbanecm: [C:03+2] Disable sidebar cache on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [14:13:21] (03PS1) 10DCausse: cirrus: stop writing to wikitech index from the MW JobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113463 [14:13:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [14:13:55] (03PS2) 10JMeybohm: Import upstream release 1.24.2 [debs/istioctl] - 10https://gerrit.wikimedia.org/r/1113460 (https://phabricator.wikimedia.org/T341984) [14:14:00] (03Merged) 10jenkins-bot: Disable sidebar cache on the auth domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1112219 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [14:14:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [14:14:30] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1112219|Disable sidebar cache on the auth domain (T383916)]] [14:14:34] T383916: Sidebar links are broken on shared domain - https://phabricator.wikimedia.org/T383916 [14:16:33] 14SRE-Sprint-Week-Sustainability-March2023, 06Data-Persistence-Automations, 06DBA, 13Patch-For-Review, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#10484085 (10FCeratto-WMF) a:03FCeratto... [14:17:11] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473 (10cmooney) 03NEW p:05Triage→03Low [14:17:17] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484099 (10cmooney) [14:17:34] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484103 (10cmooney) [14:18:45] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484104 (10cmooney) [14:19:06] (03CR) 10Bking: [C:03+1] airflow: DRY extra volume mounts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113198 (https://phabricator.wikimedia.org/T380619) (owner: 10Brouberol) [14:19:16] 06SRE, 06Infrastructure-Foundations, 10netops: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484107 (10cmooney) [14:20:54] !log urbanecm@deploy2002 urbanecm, matmarex: Backport for [[gerrit:1112219|Disable sidebar cache on the auth domain (T383916)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:20:59] T383916: Sidebar links are broken on shared domain - https://phabricator.wikimedia.org/T383916 [14:21:02] MatmaRex: can you test, please? [14:21:49] urbanecm: yep. looks fixed [14:21:53] !log urbanecm@deploy2002 urbanecm, matmarex: Continuing with sync [14:21:59] ty, proceeding [14:22:45] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10484115 (10MoritzMuehlenhoff) [14:22:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2021.codfw.wmnet to cluster codfw and group B [14:23:34] (03CR) 10Jelto: [C:03+2] gerrit: add gerrit_abusers to block IPs [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [14:23:40] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2021.codfw.wmnet to cluster codfw and group B [14:25:44] (03CR) 10Elukey: [C:03+1] Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [14:28:36] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1112219|Disable sidebar cache on the auth domain (T383916)]] (duration: 14m 06s) [14:28:40] T383916: Sidebar links are broken on shared domain - https://phabricator.wikimedia.org/T383916 [14:28:44] MatmaRex: should be live! anything else? [14:28:56] thanks, that's all for now [14:33:00] any time [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:56] (03CR) 10Andrea Denisse: [C:03+2] clientbucket: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113246 (https://phabricator.wikimedia.org/T357099) (owner: 10Andrea Denisse) [14:40:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10484175 (10MoritzMuehlenhoff) [14:41:47] (03CR) 10David Caro: [wmcs::kubeadm::core] remove kubeadm-flags.env (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1113194 (https://phabricator.wikimedia.org/T370245) (owner: 10Raymond Ndibe) [14:43:25] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10484178 (10MoritzMuehlenhoff) [14:44:46] (03PS1) 10Muehlenhoff: Switch ganeti2032 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1113465 [14:45:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [14:45:51] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10484191 (10ops-monitoring-bot) Draining ganeti2032.codfw.wmnet of running VMs [14:45:57] (03PS3) 10Cathal Mooney: Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) [14:46:42] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:01] (03PS1) 10DCausse: wdqs: bump to 0.3.154 and enable event utilities APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113466 (https://phabricator.wikimedia.org/T374919) [14:47:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [14:47:35] (03CR) 10Cathal Mooney: "Ok patchset 3 fixes my mistake should be good now. Phab task has more detail on what is going on and what output it'll produce." [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [14:48:44] (03PS2) 10Andrea Denisse: ipmi: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113245 (https://phabricator.wikimedia.org/T357099) [14:49:43] (03CR) 10Andrea Denisse: ipmi: Remove absented check_procs Icinga alert (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1113245 (https://phabricator.wikimedia.org/T357099) (owner: 10Andrea Denisse) [14:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:52:32] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2024 Nov-Jan), 07Unplanned-Sprint-Work: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10484254 (10Nikerabbit) [14:55:37] (03PS1) 10Andrew Bogott: deployment-prep hiera: remove uses of .eqiad.wmflabs tld [puppet] - 10https://gerrit.wikimedia.org/r/1113468 (https://phabricator.wikimedia.org/T380679) [14:56:23] (03PS2) 10Andrew Bogott: deployment-prep hiera: remove uses of .eqiad.wmflabs tld [puppet] - 10https://gerrit.wikimedia.org/r/1113468 (https://phabricator.wikimedia.org/T380679) [14:56:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd [14:56:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10484295 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to drbd [14:58:01] (03CR) 10Filippo Giunchedi: [C:03+1] ipmi: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113245 (https://phabricator.wikimedia.org/T357099) (owner: 10Andrea Denisse) [14:58:22] (03CR) 10Andrea Denisse: [C:03+2] ipmi: Remove absented check_procs Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1113245 (https://phabricator.wikimedia.org/T357099) (owner: 10Andrea Denisse) [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1500) [15:00:31] (03PS1) 10Ladsgroup: file migration: Set group0 to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113469 (https://phabricator.wikimedia.org/T384481) [15:03:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1243:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1243 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:04:57] (03CR) 10Muehlenhoff: [C:03+2] Manage tile invalidation with a separate Hiera setting [puppet] - 10https://gerrit.wikimedia.org/r/1113442 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:06:00] (03PS1) 10Kamila Součková: wikikube: rename mw14[76-81] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113470 (https://phabricator.wikimedia.org/T365571) [15:06:14] (03PS1) 10Slyngshede: Implement dialog for requesting permission [software/bitu] - 10https://gerrit.wikimedia.org/r/1113471 [15:06:14] (03PS1) 10Slyngshede: Alternative SSH key management [software/bitu] - 10https://gerrit.wikimedia.org/r/1113472 [15:06:42] FIRING: [3x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:51] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.01.11 - 2025.01.31), 13Patch-For-Review: Data Platform access streamlining for WMDE staff - https://phabricator.wikimedia.org/T381824#10484390 (10jcrespo) Feel free to review the patch above as well as the wiki changes documen... [15:07:44] (03CR) 10CI reject: [V:04-1] Alternative SSH key management [software/bitu] - 10https://gerrit.wikimedia.org/r/1113472 (owner: 10Slyngshede) [15:11:17] (03PS1) 10JMeybohm: Create a copy of the wikikube istio config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113473 (https://phabricator.wikimedia.org/T341984) [15:11:18] (03PS1) 10JMeybohm: Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) [15:11:51] !log switched dbctl pc section objects to flavor "parsercache" - T383324 [15:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:55] T383324: Prevent too many parsercache sections from being depooled - https://phabricator.wikimedia.org/T383324 [15:12:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10484428 (10joanna_borun) p:05Triage→03High [15:13:03] (03PS1) 10Bartosz Dziewoński: Use full URLs for wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) [15:15:21] (03PS1) 10Elukey: drivers.py: add container_limits to the Docker driver [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 [15:15:56] (03PS2) 10Bartosz Dziewoński: Use full URLs for wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) [15:19:45] (03PS1) 10Vgutierrez: site,hiera: Reimage lvs4010 as a liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) [15:20:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to drbd [15:21:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet [15:21:32] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10484461 (10ops-monitoring-bot) Draining ganeti2032.codfw.wmnet of running VMs [15:21:38] (03PS1) 10Jelto: gerrit: make sure to not drop empty sets [puppet] - 10https://gerrit.wikimedia.org/r/1113479 (https://phabricator.wikimedia.org/T348734) [15:21:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet [15:22:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [15:22:35] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update remaining Ganeti servers in codfw to Bookworm - https://phabricator.wikimedia.org/T382508#10484482 (10ops-monitoring-bot) VM aux-k8s-etcd2004.codfw.wmnet switching disk type to plain [15:22:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of aux-k8s-etcd2004.codfw.wmnet to plain [15:23:12] 06SRE, 06Traffic-Icebox: clean up testlb services - https://phabricator.wikimedia.org/T384486 (10Vgutierrez) 03NEW [15:23:56] 06SRE, 06Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492#10484504 (10Vgutierrez) 05Open→03Resolved a:03BBlack [15:24:17] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1113479 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [15:24:45] (03PS2) 10Vgutierrez: site,hiera: Reimage lvs4010 as a liberica LB [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) [15:24:46] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2189 slowly with 10 steps - Repool host after fixing indexes and performing OS updates [15:25:08] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113478 (https://phabricator.wikimedia.org/T384477) (owner: 10Vgutierrez) [15:25:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10484510 (10phaultfinder) [15:26:46] (03CR) 10Jelto: [V:03+1] "Unfortunately another fix is needed, because nftables has problems dropping empty sets if they are not created with the flag "dynamic". Th" [puppet] - 10https://gerrit.wikimedia.org/r/1113479 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [15:29:00] (03PS1) 10Vgutierrez: hieradata: Remove testlb and testlb6 from text svc [puppet] - 10https://gerrit.wikimedia.org/r/1113481 (https://phabricator.wikimedia.org/T384486) [15:29:38] (03PS2) 10Vgutierrez: hieradata: Remove testlb and testlb6 from text svc [puppet] - 10https://gerrit.wikimedia.org/r/1113481 (https://phabricator.wikimedia.org/T384486) [15:30:30] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113481 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [15:32:20] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10484543 (10Papaul) @Matthew_Clemente We tested some options yesterday in codfw. When we use 6ft power cables and 3m DAC cable for network we are able to... [15:34:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1113479 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [15:35:05] (03CR) 10Ssingh: [C:03+1] hieradata: Remove testlb and testlb6 from text svc [puppet] - 10https://gerrit.wikimedia.org/r/1113481 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [15:36:26] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: make sure to not drop empty sets [puppet] - 10https://gerrit.wikimedia.org/r/1113479 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [15:38:12] (03CR) 10Vgutierrez: [C:03+2] hieradata: Remove testlb and testlb6 from text svc [puppet] - 10https://gerrit.wikimedia.org/r/1113481 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [15:41:10] (03CR) 10DLynch: [C:03+1] "Sure thing" [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [15:42:18] (03PS1) 10Federico Ceratto: site.pp, db2134.yaml: db2134 [puppet] - 10https://gerrit.wikimedia.org/r/1113482 (https://phabricator.wikimedia.org/T384476) [15:42:57] FIRING: ProbeDown: Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:08] oh boy [15:43:09] what? :) [15:43:27] !incidents [15:43:28] 5625 (UNACKED) ProbeDown sre (185.15.59.225 ip4 text:80 probes/service http_text_ip4 esams) [15:43:28] 5624 (RESOLVED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [15:43:30] * Emperor here [15:43:31] !ack 5625 [15:43:31] 5625 (ACKED) ProbeDown sre (185.15.59.225 ip4 text:80 probes/service http_text_ip4 esams) [15:43:34] (03PS2) 10Andrew Bogott: configure_cephosd_disks(): Assume os drives are less than 1.5TB [puppet] - 10https://gerrit.wikimedia.org/r/1113451 (https://phabricator.wikimedia.org/T383817) [15:43:50] vgutierrez: seems like we should add a silence [15:43:53] how is that being triggered? [15:43:59] oh damn.. it's removing the IPs from the LVS? [15:44:08] I do wish more of these runbooks had _any_ content [15:44:08] yes [15:44:32] ...but is this in fact a false alarm? [15:44:43] Emperor: yep sorry it is, we are removing it so expected [15:44:46] adding a silence [15:44:51] Thanks. [15:44:57] thanks sukhe vgutierrez [15:45:04] (03CR) 10David Caro: [C:03+1] configure_cephosd_disks(): Assume os drives are less than 1.5TB [puppet] - 10https://gerrit.wikimedia.org/r/1113451 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott) [15:45:20] PROBLEM - PyBal IPVS diff check on lvs1017 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:45:24] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:45:25] sukhe: not sure why it failed though [15:45:30] ^^ that's me :) [15:45:36] totally expected [15:46:28] PROBLEM - PyBal IPVS diff check on lvs2011 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:47:04] jouncebot: nowandnext [15:47:04] For the next 0 hour(s) and 12 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1500) [15:47:04] In 2 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1800) [15:47:19] I take a break [15:47:57] FIRING: [18x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:48:11] !incidents [15:48:12] 5625 (ACKED) ProbeDown sre (185.15.59.225 ip4 text:80 probes/service http_text_ip4 esams) [15:48:12] 5624 (RESOLVED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [15:48:20] sorry about the noise :) [15:49:04] (03CR) 10Muehlenhoff: [C:03+1] "With the approvals all in Gerrit, this doesn't need the creation of a tracking task in Phab." [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [15:49:44] PROBLEM - PyBal IPVS diff check on lvs5006 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:49:44] PROBLEM - PyBal IPVS diff check on lvs6001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:50:22] PROBLEM - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:50:22] PROBLEM - PyBal IPVS diff check on lvs5004 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:52:56] PROBLEM - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:52:57] FIRING: [24x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:17] apparently silencing it is not enough [15:54:00] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary (T384486) [15:54:04] T384486: clean up testlb services - https://phabricator.wikimedia.org/T384486 [15:55:21] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1476-1481].eqiad.wmnet [15:55:22] PROBLEM - PyBal IPVS diff check on lvs4010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:55:22] PROBLEM - PyBal IPVS diff check on lvs7001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:55:36] (03CR) 10JMeybohm: [C:03+1] wikikube: rename mw14[76-81] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113470 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [15:55:40] (03PS1) 10Jelto: gerrit: enable gerrit_abusers also on gerrit production [puppet] - 10https://gerrit.wikimedia.org/r/1113486 (https://phabricator.wikimedia.org/T348734) [15:55:50] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1113486 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [15:55:57] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10484788 (10MatthewVernon) @Papaul did you mean to tag me? In any case, thanks for the update. I think we'll still need T384003 to be confident that we can... [15:56:08] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:56:16] PROBLEM - PyBal IPVS diff check on lvs4008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:56:21] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1017 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:22] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:22] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2011 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:22] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:22] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:22] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:22] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs4008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:23] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs4010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:23] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs5004 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:24] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs5006 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:24] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs6001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:56:25] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs7001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:57:12] PROBLEM - PyBal IPVS diff check on lvs7003 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:57:57] RESOLVED: [20x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:03] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1017 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:03] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2011 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:04] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs4008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:05] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs4010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:05] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs5004 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:06] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs5006 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:06] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs6001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:07] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs6003 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:07] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs7001 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:08] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs7003 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal Sukhbir Singh removing testlb* https://wikitech.wikimedia.org/wiki/PyBal [15:58:20] !incidents [15:58:20] 5625 (RESOLVED) ProbeDown sre (185.15.59.225 ip4 text:80 probes/service http_text_ip4 esams) [15:58:21] 5624 (RESOLVED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [15:58:32] sorry for the noise folks, didn't expect it to be this noisy :] [15:58:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1243:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1243 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:58:54] [24x] no problem [15:59:05] :P [15:59:22] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:59:43] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename mw14[76-81] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113470 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [15:59:55] (03PS1) 10Muehlenhoff: Extend comment [puppet] - 10https://gerrit.wikimedia.org/r/1113487 (https://phabricator.wikimedia.org/T381565) [15:59:59] (03CR) 10Jelto: "I think the alertmanager config is missing a team definition and routing for the team `collaboration-services-releng`. At least there are " [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [16:01:16] (03PS8) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [16:02:14] (03CR) 10Andrew Bogott: [C:03+2] configure_cephosd_disks(): Assume os drives are less than 1.5TB [puppet] - 10https://gerrit.wikimedia.org/r/1113451 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott) [16:02:29] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.restart-pybal (exit_code=99) rolling-restart of pybal on A:lvs-secondary (T384486) [16:02:33] 06SRE, 10SRE-Access-Requests: Add kemayo to the deployment group - https://phabricator.wikimedia.org/T384493 (10jcrespo) 03NEW [16:02:33] T384486: clean up testlb services - https://phabricator.wikimedia.org/T384486 [16:03:44] (03CR) 10Elukey: [C:03+1] Extend comment [puppet] - 10https://gerrit.wikimedia.org/r/1113487 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:03:56] (03PS2) 10Jcrespo: admin: Add kemayo to the deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [16:03:57] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1020].eqiad.wmnet,lvs5006.eqsin.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet,lvs4010.ulsfo.wmnet} and A:lvs (T384486) [16:03:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1476-1481].eqiad.wmnet [16:04:08] (03CR) 10CI reject: [V:04-1] admin: Add kemayo to the deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [16:04:41] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1476 to wikikube-worker1129 [16:04:58] (03CR) 10Jcrespo: "I had created the ticket (but not pushed submit) by the time Moritz commented. Can some of you sign as a "sponsor" to deploy. CDanis, mayb" [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [16:05:01] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:05:31] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:07:33] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host db2166 [16:07:39] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [16:07:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2166 [16:07:58] (03PS5) 10Jcrespo: admin: Deploy WMDE privatedata policy change to puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) [16:07:58] (03PS3) 10Jcrespo: admin: Add kemayo to the deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [16:08:02] RECOVERY - MariaDB Replica Lag: s1 on db1240 is OK: OK slave_sql_lag Replication lag: 42.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:08:03] 10ops-codfw, 10ops-eqiad, 06SRE, 10SRE-swift-storage, and 3 others: Supermicro's Config J hot swap behavior - https://phabricator.wikimedia.org/T383903#10484952 (10Papaul) @Matthew_Clemente yes the tag was for you. I will leave your guys work on the software side T384003 [16:08:12] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:08:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs[1020].eqiad.wmnet,lvs5006.eqsin.wmnet,lvs3010.esams.wmnet,lvs7003.magru.wmnet,lvs4010.ulsfo.wmnet} and A:lvs (T384486) [16:08:19] (03PS1) 10David Caro: ceph-pacific: add ceph-pacific packages to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113489 (https://phabricator.wikimedia.org/T306820) [16:08:21] T384486: clean up testlb services - https://phabricator.wikimedia.org/T384486 [16:08:46] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1476 to wikikube-worker1129 - kamila@cumin1002" [16:09:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1476 to wikikube-worker1129 - kamila@cumin1002" [16:09:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:09:06] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1477 to wikikube-worker1130 [16:09:06] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1129 [16:09:22] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:09:26] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:09:52] (03PS2) 10David Caro: ceph-pacific: add ceph-pacific packages to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113489 (https://phabricator.wikimedia.org/T306820) [16:10:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add kemayo to the deployment group - https://phabricator.wikimedia.org/T384493#10484964 (10jcrespo) @CDanis Wanna sign off as a sponsor? [16:10:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1129 [16:10:22] RECOVERY - PyBal IPVS diff check on lvs5006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:10:28] !log installing rsync regression updates on bullseye [16:10:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:58] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1476 to wikikube-worker1129 [16:11:22] (03CR) 10Jcrespo: [C:03+1] admin: Add kemayo to the deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [16:11:22] RECOVERY - PyBal IPVS diff check on lvs3010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:11:23] (03CR) 10Andrew Bogott: [C:03+1] ceph-pacific: add ceph-pacific packages to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113489 (https://phabricator.wikimedia.org/T306820) (owner: 10David Caro) [16:12:22] (03CR) 10David Caro: [C:03+2] ceph-pacific: add ceph-pacific packages to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1113489 (https://phabricator.wikimedia.org/T306820) (owner: 10David Caro) [16:13:16] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1477 to wikikube-worker1130 - kamila@cumin1002" [16:13:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1477 to wikikube-worker1130 - kamila@cumin1002" [16:13:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:32] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1130 [16:13:54] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1478 to wikikube-worker1131 [16:14:13] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:14:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1130 [16:15:02] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [16:15:02] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:15:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [16:15:04] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:15:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1477 to wikikube-worker1130 [16:15:53] (03CR) 10Ottomata: [C:03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1113420 (https://phabricator.wikimedia.org/T381824) (owner: 10Jcrespo) [16:17:30] RECOVERY - PyBal IPVS diff check on lvs7003 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:17:39] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1478 to wikikube-worker1131 - kamila@cumin1002" [16:17:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1478 to wikikube-worker1131 - kamila@cumin1002" [16:17:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:17:54] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1131 [16:18:07] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1479 to wikikube-worker1132 [16:18:08] RECOVERY - PyBal IPVS diff check on lvs4010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:18:28] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:18:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:19:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1131 [16:19:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1478 to wikikube-worker1131 [16:20:00] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10485024 (10Jhancock.wm) @Marostegui db2166 is moved, updated, and pinging. [16:20:07] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [16:20:17] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-high-traffic1 (T384486) [16:20:21] T384486: clean up testlb services - https://phabricator.wikimedia.org/T384486 [16:20:25] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10485035 (10Jhancock.wm) [16:20:35] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [16:22:19] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1479 to wikikube-worker1132 - kamila@cumin1002" [16:22:29] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1480 to wikikube-worker1133 [16:22:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1479 to wikikube-worker1132 - kamila@cumin1002" [16:22:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:22:34] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1132 [16:22:49] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:23:05] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1481 to wikikube-worker1134 [16:23:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1132 [16:23:42] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [16:23:57] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [16:24:20] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1479 to wikikube-worker1132 [16:25:20] RECOVERY - PyBal IPVS diff check on lvs1017 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:25:36] RECOVERY - PyBal IPVS diff check on lvs2011 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:25:37] (03PS14) 10FNegri: prometheus-node-kernel-panic: use prom labels [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) [16:25:37] (03PS9) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) [16:25:42] (03PS2) 10JMeybohm: Update wikikube istio 1.24.2 config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113474 (https://phabricator.wikimedia.org/T341984) [16:26:20] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-high-traffic1 (T384486) [16:26:24] T384486: clean up testlb services - https://phabricator.wikimedia.org/T384486 [16:26:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1480 to wikikube-worker1133 - kamila@cumin1002" [16:26:36] RECOVERY - PyBal IPVS diff check on lvs6001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:26:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1480 to wikikube-worker1133 - kamila@cumin1002" [16:26:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:26:48] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1133 [16:27:11] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:27:37] (03CR) 10FNegri: prometheus-node-kernel-panic: use prom labels (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1088602 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [16:28:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1133 [16:28:35] RECOVERY - PyBal IPVS diff check on lvs5004 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:28:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1480 to wikikube-worker1133 [16:29:05] RECOVERY - PyBal IPVS diff check on lvs4008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:29:05] RECOVERY - PyBal IPVS diff check on lvs3008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:29:05] RECOVERY - PyBal IPVS diff check on lvs7001 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:30:58] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1481 to wikikube-worker1134 - kamila@cumin1002" [16:31:02] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1481 to wikikube-worker1134 - kamila@cumin1002" [16:31:03] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:31:03] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1134 [16:31:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10485113 (10elukey) @Jhancock.wm @Papaul do you think that this test would be enough to "simulate" a usual hot swap for a broken disk? N... [16:32:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1134 [16:33:00] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1481 to wikikube-worker1134 [16:33:06] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1129.eqiad.wmnet wikikube-worker1130.eqiad.wmnet wikikube-worker1131.eqiad.wmnet wikikube-worker1132.eqiad.wmnet wikikube-worker1133.eqiad.wmnet wikikube-worker1134.eqiad.wmnet on all recursors [16:33:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1129.eqiad.wmnet wikikube-worker1130.eqiad.wmnet wikikube-worker1131.eqiad.wmnet wikikube-worker1132.eqiad.wmnet wikikube-worker1133.eqiad.wmnet wikikube-worker1134.eqiad.wmnet on all recursors [16:35:55] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1129.eqiad.wmnet with OS bookworm [16:35:59] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1129 [16:35:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1129 [16:36:04] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1130.eqiad.wmnet with OS bookworm [16:36:08] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1130 [16:36:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1130 [16:36:11] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1131.eqiad.wmnet with OS bookworm [16:36:14] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1131 [16:36:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1131 [16:36:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1132.eqiad.wmnet with OS bookworm [16:36:25] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1132 [16:36:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1132 [16:36:27] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1133.eqiad.wmnet with OS bookworm [16:36:30] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1133 [16:36:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1133 [16:36:31] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1134.eqiad.wmnet with OS bookworm [16:36:39] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1134 [16:36:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1134 [16:37:27] !log vgutierrez@cumin1002 START - Cookbook sre.dns.netbox [16:37:31] (03PS1) 10Kamila Součková: wikikube: rename mw14[82-88] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113497 (https://phabricator.wikimedia.org/T365571) [16:39:42] (03CR) 10CI reject: [V:04-1] wikikube: rename mw14[82-88] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113497 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [16:40:50] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10485166 (10DSantamaria) Solved! Thanks @jcrespo [16:41:06] (03PS10) 10FNegri: prometheus-node-kernel-panic: rename to "messages" [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) [16:41:07] (03PS1) 10FNegri: prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 [16:41:18] !log vgutierrez@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clean up test-lb IPs - vgutierrez@cumin1002" [16:41:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clean up test-lb IPs - vgutierrez@cumin1002" [16:41:23] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:42:21] (03CR) 10FNegri: prometheus-node-kernel-panic: rename to "messages" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1108091 (https://phabricator.wikimedia.org/T382961) (owner: 10FNegri) [16:42:36] (03CR) 10Jelto: [V:03+1 C:03+2] gerrit: enable gerrit_abusers also on gerrit production [puppet] - 10https://gerrit.wikimedia.org/r/1113486 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [16:42:45] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Add DSantamaria to WMF group for access to https://superset.wikimedia.org - https://phabricator.wikimedia.org/T384169#10485172 (10jcrespo) 05Open→03Resolved [16:46:53] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission mw2259,mw225[3-6] - https://phabricator.wikimedia.org/T384043#10485237 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [16:47:40] (03CR) 10Gergő Tisza: [C:03+1] Use full URLs for wgUploadNavigationUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113476 (https://phabricator.wikimedia.org/T383916) (owner: 10Bartosz Dziewoński) [16:49:17] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [16:49:42] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [16:49:58] (03PS1) 10Vgutierrez: cdn: Add roll-restart-tcp-mss-clamper cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1113501 (https://phabricator.wikimedia.org/T384486) [16:51:31] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1129.eqiad.wmnet with reason: host reimage [16:52:00] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1131.eqiad.wmnet with reason: host reimage [16:52:01] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1132.eqiad.wmnet with reason: host reimage [16:52:07] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1133.eqiad.wmnet with reason: host reimage [16:52:45] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1134.eqiad.wmnet with OS bookworm [16:52:51] (03PS1) 10Elukey: services: update kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113502 (https://phabricator.wikimedia.org/T384435) [16:53:04] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1134.eqiad.wmnet with OS bookworm [16:53:08] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1134 [16:53:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1134 [16:55:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1129.eqiad.wmnet with reason: host reimage [16:57:34] (03CR) 10Elukey: [C:03+2] services: update kartotherian's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113502 (https://phabricator.wikimedia.org/T384435) (owner: 10Elukey) [16:58:18] (03PS2) 10FNegri: prometheus-node-kernel-panic: remove "absent" lines [puppet] - 10https://gerrit.wikimedia.org/r/1113498 (https://phabricator.wikimedia.org/T382961) [16:58:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1131.eqiad.wmnet with reason: host reimage [16:59:40] (03CR) 10Ssingh: "Looks good! Two minor comments around pool/depool thresholds but feel free to resolve and merge." [cookbooks] - 10https://gerrit.wikimedia.org/r/1113501 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [17:01:44] (03PS2) 10Vgutierrez: cdn: Add roll-restart-tcp-mss-clamper cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1113501 (https://phabricator.wikimedia.org/T384486) [17:02:12] (03CR) 10Vgutierrez: "thx for the review sukhe" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113501 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [17:02:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1133.eqiad.wmnet with reason: host reimage [17:03:10] (03CR) 10Ssingh: [C:03+1] "🚢" [cookbooks] - 10https://gerrit.wikimedia.org/r/1113501 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [17:06:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1132.eqiad.wmnet with reason: host reimage [17:08:17] (03CR) 10Vgutierrez: [C:03+2] cdn: Add roll-restart-tcp-mss-clamper cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1113501 (https://phabricator.wikimedia.org/T384486) (owner: 10Vgutierrez) [17:08:51] (03PS1) 10Btullis: Raise the weight of all analytics mariadb replica srv records [dns] - 10https://gerrit.wikimedia.org/r/1113505 (https://phabricator.wikimedia.org/T382947) [17:08:57] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1134.eqiad.wmnet with reason: host reimage [17:12:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1134.eqiad.wmnet with reason: host reimage [17:14:14] (03CR) 10Btullis: [C:03+1] airflow-wmde: remove extra network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109926 (https://phabricator.wikimedia.org/T380613) (owner: 10Brouberol) [17:14:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1129.eqiad.wmnet with OS bookworm [17:14:47] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1130.eqiad.wmnet with OS bookworm [17:15:52] 06SRE, 06Infrastructure-Foundations, 10observability: LibreNMS changes on every puppet run since upgrade to 24.12 - https://phabricator.wikimedia.org/T384440#10485405 (10andrea.denisse) a:03andrea.denisse [17:16:18] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1130.eqiad.wmnet with OS bookworm [17:16:21] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1130 [17:16:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1130 [17:17:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1131.eqiad.wmnet with OS bookworm [17:21:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1133.eqiad.wmnet with OS bookworm [17:23:20] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [17:23:42] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [17:24:30] (03PS1) 10JMeybohm: Update istio to 1.24.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T341984) [17:25:25] (03PS2) 10JMeybohm: Update istio to 1.24.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1113507 (https://phabricator.wikimedia.org/T373526) [17:26:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1132.eqiad.wmnet with OS bookworm [17:29:20] (03PS1) 10FNegri: wmcs: update kernel alerts [alerts] - 10https://gerrit.wikimedia.org/r/1113508 (https://phabricator.wikimedia.org/T382961) [17:30:07] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission dbprov2001, dbprov2002 - https://phabricator.wikimedia.org/T383894#10485496 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:30:30] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10485502 (10andrea.denisse) Looking at the changelog I wonder if this issue could be related to this [[ https://github... [17:30:45] (03CR) 10JMeybohm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1113497 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:30:45] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2130.codfw.wmnet - https://phabricator.wikimedia.org/T383766#10485504 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:31:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1134.eqiad.wmnet with OS bookworm [17:31:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2131.codfw.wmnet - https://phabricator.wikimedia.org/T384001#10485518 (10Jhancock.wm) 05In progress→03Resolved a:03Jhancock.wm [17:32:15] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2132.codfw.wmnet - https://phabricator.wikimedia.org/T383697#10485527 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:32:24] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1130.eqiad.wmnet with reason: host reimage [17:32:30] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2133.codfw.wmnet - https://phabricator.wikimedia.org/T384343#10485535 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [17:35:34] (03CR) 10JMeybohm: [C:03+1] "not sure what CI is unhappy about exactly - but the change lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1113497 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [17:35:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1130.eqiad.wmnet with reason: host reimage [17:40:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10485640 (10phaultfinder) [17:42:05] (03PS1) 10David Caro: conftool: use unique names [puppet] - 10https://gerrit.wikimedia.org/r/1113509 [17:44:31] 10ops-codfw, 06SRE, 06collaboration-services, 06Data-Persistence, and 2 others: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10485666 (10Jhancock.wm) [17:47:14] (03CR) 10Giuseppe Lavagetto: [C:03+1] conftool: use unique names [puppet] - 10https://gerrit.wikimedia.org/r/1113509 (owner: 10David Caro) [17:47:19] (03CR) 10David Caro: [C:03+2] conftool: use unique names [puppet] - 10https://gerrit.wikimedia.org/r/1113509 (owner: 10David Caro) [17:51:27] (03PS1) 10Clare Ming: Enable ExLab test 1 experiment to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113511 (https://phabricator.wikimedia.org/T373715) [17:55:29] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [17:55:52] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [17:56:03] (03PS1) 10Clare Ming: Add a few more contextual attributes to web base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) [17:56:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1130.eqiad.wmnet with OS bookworm [17:58:13] !log tchin@deploy2002 Started deploy [airflow-dags/analytics@07104ff]: Deploying latest dags for analytics airflow instance T357684 [17:58:17] T357684: Dashboard and alerting of data quality metrics for wmf_content.mediawiki_content_history_v1 - https://phabricator.wikimedia.org/T357684 [17:58:50] !log tchin@deploy2002 Finished deploy [airflow-dags/analytics@07104ff]: Deploying latest dags for analytics airflow instance T357684 (duration: 01m 53s) [17:59:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10485738 (10kamila) [17:59:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10485740 (10phaultfinder) [18:00:05] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1800). [18:00:19] o/ [18:00:28] I'll get started shortly [18:01:21] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10485748 (10RobH) cp7001 cleared up, but cp7006 still in effect with a psu failure and voltage issue. opening case. [18:01:48] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:02:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [18:02:35] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:03:18] (03Merged) 10jenkins-bot: Add variables for incremental enrollment in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080388 (https://phabricator.wikimedia.org/T377042) (owner: 10Scott French) [18:03:47] !log swfrench@deploy2002 Started scap sync-world: Backport for [[gerrit:1080388|Add variables for incremental enrollment in PHP 8.1 (T377042)]] [18:03:52] T377042: Support cookie-driven fractional migration to PHP 8.1 deployments of mw-web and mw-api-ext - https://phabricator.wikimedia.org/T377042 [18:04:11] 06SRE, 06Infrastructure-Foundations, 10netops, 10observability: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10485755 (10cmooney) >>! In T384258#10485502, @andrea.denisse wrote: > Looking at the changelog I wonder if this issue... [18:08:32] !log swfrench@deploy2002 swfrench: Backport for [[gerrit:1080388|Add variables for incremental enrollment in PHP 8.1 (T377042)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:09:42] (03CR) 10Dzahn: [C:03+1] "thanks for this" [puppet] - 10https://gerrit.wikimedia.org/r/1113429 (https://phabricator.wikimedia.org/T348734) (owner: 10Jelto) [18:11:38] (03CR) 10Dzahn: [C:04-1] "That might be true but regardless I don't think the Gerrit slowness triggered an alert since it was slow but not down." [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [18:12:08] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113511 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [18:12:33] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1129-1134].eqiad.wmnet [18:12:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1129-1134].eqiad.wmnet [18:12:40] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [18:13:59] (03PS1) 10Andrew Bogott: cephosd partman: add some debug lines to the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1113515 [18:14:41] (03CR) 10Andrew Bogott: [C:03+2] cephosd partman: add some debug lines to the partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1113515 (owner: 10Andrew Bogott) [18:15:41] !log verified PHP_ENGINE / PHP_ENGINE_STICKY enrollment behavior in mwdebug - T377042 [18:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:44] T377042: Support cookie-driven fractional migration to PHP 8.1 deployments of mw-web and mw-api-ext - https://phabricator.wikimedia.org/T377042 [18:16:00] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:16:15] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:16:26] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10485851 (10RobH) 204411747 case opened for dell tech dispatch with part/psu to sp3. [18:16:34] !log swfrench@deploy2002 swfrench: Continuing with sync [18:17:12] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10485852 (10RobH) [18:17:15] 10ops-magru, 06DC-Ops: hw troubleshooting: Power supply failure (PSU) for cp7001.magru.wmnet and cp7006.magru.wmnet - https://phabricator.wikimedia.org/T381446#10485855 (10RobH) a:03RobH [18:22:14] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [18:22:26] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:22:42] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:23:14] (03CR) 10Clare Ming: "@phuedx@wikimedia.org @sfaci@wikimedia.org i'll revert this after dogfooding is done -- and later if we decide on some default contextual " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [18:23:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [18:24:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 22 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113511 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [18:25:51] !log swfrench@deploy2002 Finished scap sync-world: Backport for [[gerrit:1080388|Add variables for incremental enrollment in PHP 8.1 (T377042)]] (duration: 22m 03s) [18:25:56] T377042: Support cookie-driven fractional migration to PHP 8.1 deployments of mw-web and mw-api-ext - https://phabricator.wikimedia.org/T377042 [18:26:07] (03CR) 10BCornwall: [C:03+1] Raise the weight of all analytics mariadb replica srv records [dns] - 10https://gerrit.wikimedia.org/r/1113505 (https://phabricator.wikimedia.org/T382947) (owner: 10Btullis) [18:30:26] I'm finished with the infra window [18:32:05] (03PS1) 10Andrew Bogott: Revert "partman: change recipe for cloudcephosd1012" [puppet] - 10https://gerrit.wikimedia.org/r/1113518 [18:34:17] (03CR) 10CI reject: [V:04-1] Revert "partman: change recipe for cloudcephosd1012" [puppet] - 10https://gerrit.wikimedia.org/r/1113518 (owner: 10Andrew Bogott) [18:35:13] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [18:37:47] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1113518 (owner: 10Andrew Bogott) [18:43:48] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudnet1007-dev.eqiad.wmnet [18:43:50] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudnet1008-dev.eqiad.wmnet [18:46:44] (03CR) 10Jelto: "From my irc backlog the alert fired multiple times in `#wikimedia-operations` but not in `#wikimedia-sre-collab` or `#wikimedia-releng`:" [puppet] - 10https://gerrit.wikimedia.org/r/1113163 (owner: 10Jelto) [18:47:40] (03PS1) 10Andrew Bogott: remove refs to renamed cloudnet100[78]-dev [puppet] - 10https://gerrit.wikimedia.org/r/1113520 (https://phabricator.wikimedia.org/T382412) [18:47:42] (03PS1) 10Andrew Bogott: Initial setup for cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/1113521 (https://phabricator.wikimedia.org/T382412) [18:48:13] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10485972 (10CDanis) > All of this does suggest we should probably look at running distributed collectors as we move to productionize this, potentiall... [18:49:39] (03CR) 10Andrew Bogott: [C:03+2] remove refs to renamed cloudnet100[78]-dev [puppet] - 10https://gerrit.wikimedia.org/r/1113520 (https://phabricator.wikimedia.org/T382412) (owner: 10Andrew Bogott) [18:50:07] (03CR) 10CI reject: [V:04-1] Initial setup for cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/1113521 (https://phabricator.wikimedia.org/T382412) (owner: 10Andrew Bogott) [18:51:15] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [18:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:16] (03PS2) 10Andrew Bogott: Initial setup for cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/1113521 (https://phabricator.wikimedia.org/T382412) [18:54:16] (03PS1) 10Andrew Bogott: profile_lvs_realserver_spec.rb: remove refs to vanished hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113522 [18:56:02] (03CR) 10Ssingh: [C:03+1] "Both the IPs were removed today and the existing text-lb IPs are there so no concerns." [puppet] - 10https://gerrit.wikimedia.org/r/1113522 (owner: 10Andrew Bogott) [18:56:47] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [18:57:04] !log andrew@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet1007-dev.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:57:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudnet1007-dev.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1002" [18:57:09] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:57:10] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1007-dev.eqiad.wmnet [18:57:27] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10486011 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudnet1007-dev.eqiad.wmnet`... [18:57:43] (03PS1) 10AOkoth: apt: update gitlab-ce & gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/1113523 (https://phabricator.wikimedia.org/T384525) [18:57:59] (03CR) 10Andrew Bogott: [C:03+2] profile_lvs_realserver_spec.rb: remove refs to vanished hosts [puppet] - 10https://gerrit.wikimedia.org/r/1113522 (owner: 10Andrew Bogott) [18:58:07] (03CR) 10Andrew Bogott: [C:03+2] Initial setup for cloudnet100[34] [puppet] - 10https://gerrit.wikimedia.org/r/1113521 (https://phabricator.wikimedia.org/T382412) (owner: 10Andrew Bogott) [18:58:36] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1113518 (owner: 10Andrew Bogott) [18:59:08] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:59:08] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1008-dev.eqiad.wmnet [18:59:26] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10486026 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudnet1008-dev.eqiad.wmnet`... [18:59:33] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudnet1007-dev.eqiad.wmnet [19:00:05] brennen and jeena: gettimeofday() says it's time for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1900) [19:00:59] (03CR) 10Andrew Bogott: [C:03+2] Revert "partman: change recipe for cloudcephosd1012" [puppet] - 10https://gerrit.wikimedia.org/r/1113518 (owner: 10Andrew Bogott) [19:03:57] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [19:04:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10486041 (10phaultfinder) [19:05:28] o/ [19:05:54] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [19:06:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:08:27] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:08:28] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1007-dev.eqiad.wmnet [19:08:43] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10486077 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudnet1007-dev.eqiad.wmnet`... [19:09:00] !log andrew@cumin1002 START - Cookbook sre.hosts.decommission for hosts cloudnet1008-dev.eqiad.wmnet [19:09:46] (03CR) 10RLazarus: "Originally it was just waiting for your review. :) If that question means it looks good to you, I can rebase and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [19:13:17] jouncebot: nowandnext [19:13:17] For the next 1 hour(s) and 46 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T1900) [19:13:17] In 1 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T2100) [19:13:29] okay, I wait then [19:13:54] Amir1: doing some checking on a potential blocker (T384254) and will likely go ahead shortly. [19:13:54] T384254: Beta cluster log spam: MediaWiki\Extension\CommunityConfiguration\Access\MediaWikiConfigReader was unable to find GEInfoboxTemplatesTest in community configuration, returning configuration from the fallback config - https://phabricator.wikimedia.org/T384254 [19:14:18] no worries, mine can wait [19:15:04] !log andrew@cumin1002 START - Cookbook sre.dns.netbox [19:16:25] (03PS1) 10Cathal Mooney: Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) [19:17:23] !log andrew@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:24] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudnet1008-dev.eqiad.wmnet [19:17:38] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, and 2 others: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10486105 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1002 for hosts: `cloudnet1008-dev.eqiad.wmnet`... [19:22:37] (03CR) 10Cathal Mooney: [C:04-1] Delegate WMCS Eqiad ranges to OpenStack auth dns [dns] - 10https://gerrit.wikimedia.org/r/1113527 (https://phabricator.wikimedia.org/T380746) (owner: 10Cathal Mooney) [19:23:12] (03PS1) 10Andrew Bogott: partman_early_command: fix cephosd recipe [puppet] - 10https://gerrit.wikimedia.org/r/1113529 (https://phabricator.wikimedia.org/T383817) [19:25:16] (03CR) 10Andrew Bogott: [C:03+2] partman_early_command: fix cephosd recipe [puppet] - 10https://gerrit.wikimedia.org/r/1113529 (https://phabricator.wikimedia.org/T383817) (owner: 10Andrew Bogott) [19:26:07] (03PS4) 10Jcrespo: admin: Add kemayo to the deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [19:26:21] (03CR) 10CDanis: [C:03+2] admin: Add kemayo to the deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [19:28:03] (03CR) 10CDanis: [C:03+2] "Merged and will be live in half an hour at most :)" [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (https://phabricator.wikimedia.org/T384493) (owner: 10CDanis) [19:30:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10486139 (10phaultfinder) [19:31:33] Amir1: any idea how long yours would take? if it's relatively quick, you can probably go ahead [19:31:45] (03CR) 10AOkoth: miscweb: support os-reports deployment (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1098486 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [19:33:33] brennen: It's quick but I might need a rollback, let's see [19:35:25] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113469 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [19:36:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:36:18] (03Merged) 10jenkins-bot: file migration: Set group0 to write both [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113469 (https://phabricator.wikimedia.org/T384481) (owner: 10Ladsgroup) [19:36:49] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1113469|file migration: Set group0 to write both (T384481)]] [19:36:53] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [19:37:44] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [19:38:06] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [19:39:19] (03CR) 10CDanis: [C:03+1] drivers.py: add container_limits to the Docker driver (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1113477 (owner: 10Elukey) [19:39:48] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1113469|file migration: Set group0 to write both (T384481)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 23.44% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:41:26] (03CR) 10CDanis: [C:03+1] Add BGP data collection from network devices over GNMI [puppet] - 10https://gerrit.wikimedia.org/r/1113449 (https://phabricator.wikimedia.org/T369384) (owner: 10Cathal Mooney) [19:42:20] "SQL query did not specify the caller (guessed caller: LocalFile" [19:42:31] I will fix this [19:42:33] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [19:44:20] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10486200 (10VRiley-WMF) a:03VRiley-WMF [19:49:30] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113469|file migration: Set group0 to write both (T384481)]] (duration: 12m 41s) [19:49:34] T384481: Set new file tables to write both in production - https://phabricator.wikimedia.org/T384481 [19:50:44] brennen: I'm done, it might add tiny bit of warnings but that's fine (the warnings are something like "SQL query did not specify the caller"). I'm fixing it but it's in no way problematic (user-facing) [19:52:44] Amir1: thanks. [19:54:56] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [19:55:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [20:05:37] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:08:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10486305 (10phaultfinder) [20:10:57] (03CR) 10Dzahn: [C:03+2] apt: update gitlab-ce & gitlab-runner [puppet] - 10https://gerrit.wikimedia.org/r/1113523 (https://phabricator.wikimedia.org/T384525) (owner: 10AOkoth) [20:12:45] !log ebysans@deploy2002 Started deploy [analytics/refinery@28dce47]: Temp accounts deployment [analytics/refinery@28dce471] [20:13:15] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113538 [20:14:28] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113540 [20:15:00] !log ebysans@deploy2002 Finished deploy [analytics/refinery@28dce47]: Temp accounts deployment [analytics/refinery@28dce471] (duration: 02m 16s) [20:15:26] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [20:15:44] !log ebysans@deploy2002 Started deploy [analytics/refinery@28dce47] (thin): Temp accounts deployment THIN [analytics/refinery@28dce471] [20:15:44] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [20:16:22] !log ebysans@deploy2002 Finished deploy [analytics/refinery@28dce47] (thin): Temp accounts deployment THIN [analytics/refinery@28dce471] (duration: 00m 37s) [20:16:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:17:28] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1482-1488].eqiad.wmnet [20:17:28] (03PS3) 10CDanis: allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [20:17:33] (03CR) 10Clare Ming: [C:03+2] "lgtm!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113538 (owner: 10Santiago Faci) [20:17:41] !log ebysans@deploy2002 Started deploy [analytics/refinery@28dce47] (hadoop-test): Temp accounts deployment TEST [analytics/refinery@28dce471] [20:18:13] !log ebysans@deploy2002 Finished deploy [analytics/refinery@28dce47] (hadoop-test): Temp accounts deployment TEST [analytics/refinery@28dce471] (duration: 00m 32s) [20:18:33] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113538 (owner: 10Santiago Faci) [20:19:26] (03CR) 10Clare Ming: [C:03+2] "\o/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113540 (owner: 10Santiago Faci) [20:19:43] (03CR) 10CI reject: [V:04-1] allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 (owner: 10CDanis) [20:20:55] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113540 (owner: 10Santiago Faci) [20:20:55] !log amastilovic@deploy2002 Started deploy [airflow-dags/analytics@7a540d7]: (no justification provided) [20:21:25] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1482-1488].eqiad.wmnet [20:22:03] (03PS4) 10CDanis: allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [20:22:10] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:22:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 24.18% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:22:44] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:24:12] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [20:24:15] (03CR) 10Kamila Součková: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1113497 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [20:24:15] (03CR) 10CI reject: [V:04-1] allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 (owner: 10CDanis) [20:24:26] !log sfaci@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [20:25:30] (03PS5) 10CDanis: allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [20:26:04] !log amastilovic@deploy2002 Finished deploy [airflow-dags/analytics@7a540d7]: (no justification provided) (duration: 05m 15s) [20:26:32] !log amastilovic@deploy2002 Started deploy [airflow-dags/analytics@d7abfe2]: (no justification provided) [20:27:33] !log amastilovic@deploy2002 Finished deploy [airflow-dags/analytics@d7abfe2]: (no justification provided) (duration: 01m 02s) [20:28:06] (03CR) 10Kamila Součková: [C:03+2] wikikube: rename mw14[82-88] to wikikube-worker* [puppet] - 10https://gerrit.wikimedia.org/r/1113497 (https://phabricator.wikimedia.org/T365571) (owner: 10Kamila Součková) [20:28:08] !log dzahn@cumin2002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release 20250122 [20:30:24] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1482 to wikikube-worker1135 [20:30:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [20:30:28] e - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:30:28] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv [20:30:28] e - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:30:44] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:31:21] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:31:21] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:31:36] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:31:36] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:31:55] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:31:55] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:32:56] !log aokoth@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:33:12] dzahn@cumin2002 dzahn: The backup on gitlab1004 is complete, ready to proceed with upgrade. [20:34:26] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1482 to wikikube-worker1135 - kamila@cumin1002" [20:34:37] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1483 to wikikube-worker1136 [20:34:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1482 to wikikube-worker1135 - kamila@cumin1002" [20:34:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:34:42] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1135 [20:34:57] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:35:47] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1135 [20:36:26] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1482 to wikikube-worker1135 [20:37:37] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1484 to wikikube-worker1137 [20:38:32] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1483 to wikikube-worker1136 - kamila@cumin1002" [20:38:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1483 to wikikube-worker1136 - kamila@cumin1002" [20:38:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:38:49] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1136 [20:39:14] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:40:04] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1136 [20:40:17] !log dzahn@cumin2002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release 20250122 [20:40:22] !log aokoth@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: Security Update [20:40:43] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1483 to wikikube-worker1136 [20:42:47] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1484 to wikikube-worker1137 - kamila@cumin1002" [20:42:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1484 to wikikube-worker1137 - kamila@cumin1002" [20:42:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:42:52] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1137 [20:42:58] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1485 to wikikube-worker1138 [20:43:19] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:44:12] !log kamila@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-worker1137 [20:44:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw1486:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:45:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1484 to wikikube-worker1137 [20:45:33] (03PS6) 10CDanis: allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [20:47:21] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1485 to wikikube-worker1138 - kamila@cumin1002" [20:47:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1485 to wikikube-worker1138 - kamila@cumin1002" [20:47:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:47:37] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1138 [20:47:39] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1486 to wikikube-worker1139 [20:47:40] (03CR) 10CI reject: [V:04-1] allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 (owner: 10CDanis) [20:48:00] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:48:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1138 [20:49:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1485 to wikikube-worker1138 [20:49:56] (03PS7) 10CDanis: allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 [20:51:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1486 to wikikube-worker1139 - kamila@cumin1002" [20:51:43] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1487 to wikikube-worker1140 [20:51:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1486 to wikikube-worker1139 - kamila@cumin1002" [20:51:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:51:49] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1139 [20:52:03] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:52:05] (03CR) 10CI reject: [V:04-1] allow k8s service-runner apps to opt-in to auto-ECS [puppet] - 10https://gerrit.wikimedia.org/r/1112295 (owner: 10CDanis) [20:52:38] (03PS1) 10Scott French: shellbox-video: 3 codfw replicas on 8.1 (change 1/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) [20:52:39] (03PS1) 10Scott French: shellbox-video: 50% of codfw replicas to 8.1 (change 2/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113214 (https://phabricator.wikimedia.org/T377038) [20:52:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10486429 (10phaultfinder) [20:52:40] (03PS1) 10Scott French: shellbox-video: all codfw replicas to 8.1 (change 3/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113215 (https://phabricator.wikimedia.org/T377038) [20:52:41] (03PS1) 10Scott French: shellbox-video: all replicas on PHP 8.1 (change 4/4) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113216 (https://phabricator.wikimedia.org/T377038) [20:53:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1139 [20:53:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1486 to wikikube-worker1139 [20:56:25] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1487 to wikikube-worker1140 - kamila@cumin1002" [20:56:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1487 to wikikube-worker1140 - kamila@cumin1002" [20:56:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:56:40] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1488 to wikikube-worker1141 [20:56:41] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1140 [20:57:01] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [20:58:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1140 [20:58:15] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1137 [20:58:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1487 to wikikube-worker1140 [20:59:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1137 [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T2100). nyaa~ [21:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1488 to wikikube-worker1141 - kamila@cumin1002" [21:00:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1488 to wikikube-worker1141 - kamila@cumin1002" [21:00:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:00:51] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1141 [21:01:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1141 [21:02:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1488 to wikikube-worker1141 [21:02:47] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1135.eqiad.wmnet wikikube-worker1136.eqiad.wmnet wikikube-worker1137.eqiad.wmnet wikikube-worker1138.eqiad.wmnet wikikube-worker1139.eqiad.wmnet wikikube-worker1140.eqiad.wmnet wikikube-worker1141.eqiad.wmnet on all recursors [21:02:51] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1135.eqiad.wmnet wikikube-worker1136.eqiad.wmnet wikikube-worker1137.eqiad.wmnet wikikube-worker1138.eqiad.wmnet wikikube-worker1139.eqiad.wmnet wikikube-worker1140.eqiad.wmnet wikikube-worker1141.eqiad.wmnet on all recursors [21:03:24] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1136.eqiad.wmnet with OS bookworm [21:03:28] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1136 [21:03:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1136 [21:03:32] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1137.eqiad.wmnet with OS bookworm [21:03:35] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1137 [21:03:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1137 [21:03:42] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1138.eqiad.wmnet with OS bookworm [21:03:45] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1138 [21:03:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1138 [21:03:56] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1140.eqiad.wmnet with OS bookworm [21:03:59] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1140 [21:03:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1140 [21:04:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1139.eqiad.wmnet with OS bookworm [21:04:19] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1139 [21:04:19] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1139 [21:04:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1141.eqiad.wmnet with OS bookworm [21:04:24] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1141 [21:04:24] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1141 [21:04:32] o/ [21:04:38] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1135.eqiad.wmnet with OS bookworm [21:04:39] i'll self-deploy [21:04:41] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1135 [21:04:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1135 [21:06:02] (03PS2) 10Clare Ming: Enable ExLab test 1 experiment to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113511 (https://phabricator.wikimedia.org/T373715) [21:06:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113511 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [21:07:03] (03Merged) 10jenkins-bot: Enable ExLab test 1 experiment to wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113511 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [21:07:34] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1113511|Enable ExLab test 1 experiment to wikitech (T373715)]] [21:07:38] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [21:10:40] (03PS1) 10Dzahn: create spiderpig.wikimedia.org for releng's scap API server [dns] - 10https://gerrit.wikimedia.org/r/1113559 (https://phabricator.wikimedia.org/T383946) [21:11:12] (03PS2) 10Dzahn: create spiderpig.wikimedia.org for releng's scap API server [dns] - 10https://gerrit.wikimedia.org/r/1113559 (https://phabricator.wikimedia.org/T383946) [21:11:54] (03PS3) 10Dzahn: create spiderpig.wikimedia.org for releng's scap API server [dns] - 10https://gerrit.wikimedia.org/r/1113559 (https://phabricator.wikimedia.org/T383946) [21:12:12] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [21:12:44] (03PS1) 10Andrew Bogott: Revert "Revert "partman: change recipe for cloudcephosd1012"" [puppet] - 10https://gerrit.wikimedia.org/r/1113560 [21:13:00] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10486522 (10VRiley-WMF) Ran through decomission on both servers and moved them to the corrosponding locations cl... [21:13:52] !log cjming@deploy2002 cjming: Backport for [[gerrit:1113511|Enable ExLab test 1 experiment to wikitech (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:13:57] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [21:14:15] !log cjming@deploy2002 cjming: Continuing with sync [21:14:49] (03PS2) 10Clare Ming: Add a few more contextual attributes to web base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) [21:18:55] (03PS1) 10Dzahn: trafficserver: point spiderpig.wikimedia.org to deployment.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) [21:19:17] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1136.eqiad.wmnet with reason: host reimage [21:19:22] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1137.eqiad.wmnet with reason: host reimage [21:19:30] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1138.eqiad.wmnet with reason: host reimage [21:19:51] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1140.eqiad.wmnet with reason: host reimage [21:20:05] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1139.eqiad.wmnet with reason: host reimage [21:20:13] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1141.eqiad.wmnet with reason: host reimage [21:20:34] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1135.eqiad.wmnet with reason: host reimage [21:20:52] (03CR) 10Andrew Bogott: [C:03+2] Revert "Revert "partman: change recipe for cloudcephosd1012"" [puppet] - 10https://gerrit.wikimedia.org/r/1113560 (owner: 10Andrew Bogott) [21:20:56] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113511|Enable ExLab test 1 experiment to wikitech (T373715)]] (duration: 13m 22s) [21:21:00] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [21:21:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [21:22:01] (03Merged) 10jenkins-bot: Add a few more contextual attributes to web base [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113512 (https://phabricator.wikimedia.org/T373715) (owner: 10Clare Ming) [21:22:30] !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1113512|Add a few more contextual attributes to web base (T373715)]] [21:22:48] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1136.eqiad.wmnet with reason: host reimage [21:26:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1141.eqiad.wmnet with reason: host reimage [21:26:21] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye [21:27:12] !log cjming@deploy2002 cjming: Backport for [[gerrit:1113512|Add a few more contextual attributes to web base (T373715)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:27:16] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [21:27:33] !log cjming@deploy2002 cjming: Continuing with sync [21:29:17] (03CR) 10Ssingh: [C:03+1] create spiderpig.wikimedia.org for releng's scap API server [dns] - 10https://gerrit.wikimedia.org/r/1113559 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn) [21:29:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1135.eqiad.wmnet with reason: host reimage [21:31:57] (03CR) 10Ssingh: [C:03+1] "[only commenting on the trafficserver part, will leave the other to serviceops as you indicated]" [puppet] - 10https://gerrit.wikimedia.org/r/1113562 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn) [21:33:11] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1140.eqiad.wmnet with reason: host reimage [21:33:23] (03CR) 10Dzahn: [C:03+2] create spiderpig.wikimedia.org for releng's scap API server [dns] - 10https://gerrit.wikimedia.org/r/1113559 (https://phabricator.wikimedia.org/T383946) (owner: 10Dzahn) [21:33:51] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on netflow7001.magru.wmnet with reason: disabling alerts as I'm running gnmic manually rather than with systemd [21:34:02] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10486590 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe40d399-fce9-41c4-b12a-4bcb36770f4b) set by cmooney@cumin1002 for 1:00:... [21:34:11] !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1113512|Add a few more contextual attributes to web base (T373715)]] (duration: 11m 41s) [21:34:16] T373715: MPIC (aka EPIC): Create a plan for dogfooding the alpha release - https://phabricator.wikimedia.org/T373715 [21:34:32] !log dzahn@dns1004 START - running authdns-update [21:34:53] !log end of UTC late backport window [21:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:21] !log dzahn@dns1004 END - running authdns-update [21:36:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:37:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1137.eqiad.wmnet with reason: host reimage [21:40:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1138.eqiad.wmnet with reason: host reimage [21:42:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1136.eqiad.wmnet with OS bookworm [21:44:37] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1139.eqiad.wmnet with reason: host reimage [21:45:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1141.eqiad.wmnet with OS bookworm [21:47:18] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10486643 (10cmooney) >>! In T369384#10485972, @CDanis wrote: > The aux clusters are waiting for us :D and we do have one in codfw as well now. Yep i... [21:47:42] FIRING: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:49:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1135.eqiad.wmnet with OS bookworm [21:52:28] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1140.eqiad.wmnet with OS bookworm [21:54:43] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-01-15-052609 to 2025-01-22-203140 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113571 (https://phabricator.wikimedia.org/T383785) [21:54:46] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2025-01-08-143723 to 2025-01-22-212306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113572 (https://phabricator.wikimedia.org/T379331) [21:55:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1137.eqiad.wmnet with OS bookworm [21:57:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:59:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1138.eqiad.wmnet with OS bookworm [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T2200) [22:02:56] (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-01-15-052609 to 2025-01-22-203140 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113571 (https://phabricator.wikimedia.org/T383785) (owner: 10Jforrester) [22:03:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1139.eqiad.wmnet with OS bookworm [22:04:32] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-01-15-052609 to 2025-01-22-203140 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113571 (https://phabricator.wikimedia.org/T383785) (owner: 10Jforrester) [22:06:04] !log dmartin@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:06:45] !log dmartin@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:07:42] RESOLVED: [2x] JobUnavailable: Reduced availability for job gnmi in ops@magru - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:07:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:11:11] !log dmartin@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:12:13] !log dmartin@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:12:48] !log dmartin@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:13:45] !log dmartin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:17:52] (03CR) 10David Martin: [C:03+2] wikifunctions: Upgrade evaluators from 2025-01-08-143723 to 2025-01-22-212306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113572 (https://phabricator.wikimedia.org/T379331) (owner: 10Jforrester) [22:19:03] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2025-01-08-143723 to 2025-01-22-212306 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113572 (https://phabricator.wikimedia.org/T379331) (owner: 10Jforrester) [22:20:06] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1135-1141].eqiad.wmnet [22:20:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1135-1141].eqiad.wmnet [22:20:22] !log dmartin@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:20:56] !log dmartin@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:22:21] !log dmartin@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:23:12] !log dmartin@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:23:31] !log dmartin@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:24:33] !log dmartin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:35:38] (03PS1) 10Eevans: Add data-gateway listener to mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) [22:50:18] (03PS1) 10Scott French: Enroll 0.1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113566 (https://phabricator.wikimedia.org/T383845) [22:50:19] (03PS1) 10Scott French: Enroll 1% of client sessions in PHP 8.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1113567 (https://phabricator.wikimedia.org/T383845) [23:00:06] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250122T2300) [23:07:44] (03PS1) 10Andrea Denisse: librenms: Ensure the cache/data directory belongs to librenms [puppet] - 10https://gerrit.wikimedia.org/r/1113587 (https://phabricator.wikimedia.org/T384440) [23:07:44] (03CR) 10Andrea Denisse: "This is similar to the issue with the `sessions` directory." [puppet] - 10https://gerrit.wikimedia.org/r/1113587 (https://phabricator.wikimedia.org/T384440) (owner: 10Andrea Denisse) [23:20:02] (03CR) 10Scott French: [C:03+1] "Thanks, Eric!" [puppet] - 10https://gerrit.wikimedia.org/r/1113581 (https://phabricator.wikimedia.org/T368096) (owner: 10Eevans) [23:25:44] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113217 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:31:02] (03CR) 10Scott French: "Thanks in advance for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1113213 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [23:38:09] PROBLEM - Hadoop NodeManager on an-worker1163 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:40:09] RECOVERY - Hadoop NodeManager on an-worker1163 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [23:46:53] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10487032 (10cmooney) [23:57:38] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bullseye [23:58:14] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bullseye