[00:04:53] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] Netbox alerting: Add remaining netbox alert. [alerts] - 10https://gerrit.wikimedia.org/r/1092779 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[00:17:09] <wikibugs>	 (03PS1) 10EggRoll97: enwiki: Add abusefilter-access-protected-vars to EFH/EFM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332)
[00:38:30] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092960
[00:38:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092960 (owner: 10TrainBranchBot)
[00:47:05] <jinxer-wm>	 FIRING: [17x] ProbeDown: Service restbase2036-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:47:43] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:54:10] <jinxer-wm>	 FIRING: [17x] ProbeDown: Service restbase2036-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[01:08:26] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092964
[01:08:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092964 (owner: 10TrainBranchBot)
[01:09:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1092960 (owner: 10TrainBranchBot)
[01:23:21] <wikibugs>	 (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) (owner: 10EggRoll97)
[01:43:15] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1092964 (owner: 10TrainBranchBot)
[01:45:03] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1172 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 13 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[01:47:05] <jinxer-wm>	 FIRING: [15x] ProbeDown: Service restbase2036-b:9042 has failed probes (tcp_cassandra_b_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:05:03] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1172 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[02:19:03] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1172 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 14 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[02:24:10] <jinxer-wm>	 FIRING: [13x] ProbeDown: Service restbase2036-c:9042 has failed probes (tcp_cassandra_c_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:35:03] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1172 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:29] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1092416 (https://phabricator.wikimedia.org/T326373) (owner: 10Andrew Bogott)
[02:55:56] <wikibugs>	 (03PS6) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927)
[02:55:58] <wikibugs>	 (03CR) 10Andrew Bogott: resolvconf: don't update resolv.conf with 0 nameservers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[02:56:11] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[03:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:11:13] <wikibugs>	 (03PS1) 10Andrew Bogott: Initial insetup role for cloudcephosd2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1092983 (https://phabricator.wikimedia.org/T378825)
[03:16:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Initial insetup role for cloudcephosd2004-dev [puppet] - 10https://gerrit.wikimedia.org/r/1092983 (https://phabricator.wikimedia.org/T378825) (owner: 10Andrew Bogott)
[03:18:02] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10338946 (10Andrew) a:05Andrew→03None puppet is updated (although untested, for obvious reasons)
[03:44:54] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092994 (https://phabricator.wikimedia.org/T374034)
[03:53:27] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, November 21 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1092956 (https://phabricator.wikimedia.org/T380332) (owner: 10EggRoll97)
[03:55:31] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 4730 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[04:18:58] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897)
[05:31:14] <wikibugs>	 (03CR) 10Raymond Ndibe: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe)
[05:36:21] <wikibugs>	 (03PS2) 10Raymond Ndibe: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225)
[05:54:31] <icinga-wm>	 RECOVERY - Disk space on archiva1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[06:27:05] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:38:28] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304)
[06:39:07] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C:03+1] EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE))
[06:46:53] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1076 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T0700)
[07:00:51] <wikibugs>	 (03PS1) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284)
[07:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:05:57] <wikibugs>	 (03CR) 10Pppery: [C:04-1] "Stale, need to redo" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1089939 (owner: 10Pppery)
[07:07:37] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:10:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:13:23] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341 (10phaultfinder) 03NEW
[07:18:29] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10339044 (10phaultfinder)
[07:27:50] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323
[07:28:38] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323
[07:28:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1088323 (owner: 10Muehlenhoff)
[07:49:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] profile::ldap::bitu: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1092845 (owner: 10Muehlenhoff)
[07:56:43] <wikibugs>	 (03PS1) 10Slyngshede: Release v0.1.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:01:33] <wikibugs>	 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339063 (10MoritzMuehlenhoff) For Ganeti I propose the following plan. It allows us to keep all misc services running on magru, so no need to fiddle with...
[08:03:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove legacy logstash IDP service [puppet] - 10https://gerrit.wikimedia.org/r/1088323 (owner: 10Muehlenhoff)
[08:05:53] <wikibugs>	 (03PS2) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284)
[08:07:37] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:10:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:18:33] <hashar>	 !log Restarted CI Jenkins to upgrade Leastload plugin and remove the SSH server plugin
[08:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] profile::ldap::bitu: Remove obsolete check [puppet] - 10https://gerrit.wikimedia.org/r/1092845 (owner: 10Muehlenhoff)
[08:21:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 (owner: 10Slyngshede)
[08:21:42] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Release v0.1.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 (owner: 10Slyngshede)
[08:24:24] <wikibugs>	 (03CR) 10David Caro: profile::manifests::toolforge::bastion: harbor to /etc/toolforge/common.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1090520 (https://phabricator.wikimedia.org/T358225) (owner: 10Raymond Ndibe)
[08:24:56] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.1.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/1093250 (owner: 10Slyngshede)
[08:26:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Oops, yes of course :-) This should be good to merge now." [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[08:26:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265
[08:26:29] <wikibugs>	 (03PS2) 10Muehlenhoff: Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265
[08:27:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 (owner: 10Muehlenhoff)
[08:30:07] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:30:40] <wikibugs>	 (03PS3) 10Muehlenhoff: Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265
[08:33:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Revert "Remove obsolete package::builder role" [puppet] - 10https://gerrit.wikimedia.org/r/1093265 (owner: 10Muehlenhoff)
[08:34:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet
[08:35:14] <wikibugs>	 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339088 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs
[08:35:31] <wikibugs>	 (03PS1) 10Slyngshede: Bitu version 0.1.2 [dns] - 10https://gerrit.wikimedia.org/r/1093266
[08:35:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet
[08:35:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7004.magru.wmnet
[08:35:57] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7004.magru.wmnet
[08:36:00] <wikibugs>	 10ops-magru, 06SRE, 06Traffic: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339089 (10ops-monitoring-bot) Draining ganeti7004.magru.wmnet of running VMs
[08:37:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1093266 (owner: 10Slyngshede)
[08:37:57] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Bitu version 0.1.2 [dns] - 10https://gerrit.wikimedia.org/r/1093266 (owner: 10Slyngshede)
[08:42:11] <wikibugs>	 (03PS3) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284)
[08:43:40] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "native methods have been extensively tested, some bugs have been fixed by @rcoccioli@wikimedia.org along the way." [cookbooks] - 10https://gerrit.wikimedia.org/r/1087860 (owner: 10Volans)
[08:44:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of bast7001.wikimedia.org to plain
[08:46:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of bast7001.wikimedia.org to plain
[08:48:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of durum7002.magru.wmnet to plain
[08:51:18] <jayme>	 !log disabling puppet on all k8s controll planes for rollout of T380142
[08:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:22] <stashbot>	 T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142
[08:52:24] <wikibugs>	 (03PS4) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284)
[08:53:57] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] kubernetes::master: Don't override sa certificates on reimage [puppet] - 10https://gerrit.wikimedia.org/r/1092803 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm)
[08:55:22] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, 10GitLab (Infrastructure): Migrate gitlab storage to apus - https://phabricator.wikimedia.org/T378922#10339149 (10LSobanski)
[08:55:23] <wikibugs>	 (03Abandoned) 10LSobanski: Switch alerts deployment source to GitLab [puppet] - 10https://gerrit.wikimedia.org/r/982086 (https://phabricator.wikimedia.org/T349626) (owner: 10LSobanski)
[08:56:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7002.magru.wmnet to plain
[08:57:01] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T380341#10339152 (10LSobanski)
[08:58:15] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE))
[08:58:24] <wikibugs>	 (03CR) 10Aklapper: [C:03+2] "As this very code change has already been merged into the master branch in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikisourc" [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE))
[08:58:27] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:58:35] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[08:58:35] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on durum7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[08:58:35] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:00:05] <jouncebot>	 andre and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T0900).
[09:00:27] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:00:33] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:00:35] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on durum7002 is OK: OK: UP (pid=2382) and all threads (8) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[09:00:35] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:13:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7002.magru.wmnet to plain
[09:13:25] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[09:13:33] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) (owner: 10AikoChou)
[09:13:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7002.magru.wmnet to plain
[09:14:17] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1087 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[09:15:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of doh7002.wikimedia.org to plain
[09:15:29] <wikibugs>	 (03Merged) 10jenkins-bot: EditionLookup: Update EntityLookup calls [extensions/Wikisource] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093172 (https://phabricator.wikimedia.org/T380304) (owner: 10Thiemo Kreuz (WMDE))
[09:18:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7002.wikimedia.org to plain
[09:19:32] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:19:33] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on doh7002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:19:35] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on doh7002 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[09:19:35] <icinga-wm>	 PROBLEM - BFD status on asw1-b4-magru.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:20:16] <logmsgbot>	 !log aklapper@deploy2002 Started scap sync-world: Backport for [[gerrit:1093172|EditionLookup: Update EntityLookup calls (T380304)]]
[09:20:20] <stashbot>	 T380304: Wikisource extension: Error: Call to undefined method Wikibase\Client\WikibaseClient::getRestrictedEntityLookup() - https://phabricator.wikimedia.org/T380304
[09:20:27] <icinga-wm>	 PROBLEM - BGP status on asw1-b4-magru.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:20:30] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:20:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of prometheus7001.magru.wmnet to plain
[09:21:27] <icinga-wm>	 RECOVERY - BGP status on asw1-b4-magru.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:21:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of prometheus7001.magru.wmnet to plain
[09:21:33] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on doh7002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[09:21:35] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on doh7002 is OK: OK: UP (pid=2373) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[09:21:35] <icinga-wm>	 RECOVERY - BFD status on asw1-b4-magru.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:23:54] <wikibugs>	 (03PS7) 10DCausse: Migrate package to opensearch [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson)
[09:23:55] <wikibugs>	 (03PS1) 10DCausse: Update README and gitreview [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1093270
[09:24:58] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "nice!" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1080749 (https://phabricator.wikimedia.org/T372769) (owner: 10Ebernhardson)
[09:26:41] <logmsgbot>	 !log aklapper@deploy2002 aklapper, thiemowmde: Backport for [[gerrit:1093172|EditionLookup: Update EntityLookup calls (T380304)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:26:52] <stashbot>	 T380304: Wikisource extension: Error: Call to undefined method Wikibase\Client\WikibaseClient::getRestrictedEntityLookup() - https://phabricator.wikimedia.org/T380304
[09:27:15] <logmsgbot>	 !log aklapper@deploy2002 aklapper, thiemowmde: Continuing with sync
[09:27:58] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:28:42] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:30:35] <wikibugs>	 (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[09:31:43] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update article-country response schema [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093006 (https://phabricator.wikimedia.org/T371897) (owner: 10Kevin Bazira)
[09:32:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[09:33:17] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams
[09:33:31] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams
[09:33:49] <logmsgbot>	 !log aklapper@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093172|EditionLookup: Update EntityLookup calls (T380304)]] (duration: 13m 33s)
[09:33:53] <stashbot>	 T380304: Wikisource extension: Error: Call to undefined method Wikibase\Client\WikibaseClient::getRestrictedEntityLookup() - https://phabricator.wikimedia.org/T380304
[09:34:44] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:35:27] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:37:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 99.34% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[09:38:15] <akosiaris>	 !log decommission cxserver endpoints /api/rest_v1/list/(pair|tool|languagepairs) from RESTBase T375616
[09:38:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:19] <stashbot>	 T375616: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616
[09:40:44] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093271 (https://phabricator.wikimedia.org/T375663)
[09:40:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093271 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot)
[09:41:05] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:41:33] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093271 (https://phabricator.wikimedia.org/T375663) (owner: 10TrainBranchBot)
[09:43:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272
[09:44:00] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:44:37] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[09:49:24] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Let's go!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[09:51:45] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Canary cephosd1001 to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[09:52:12] <logmsgbot>	 !log aklapper@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.4  refs T375663
[09:52:16] <stashbot>	 T375663: 1.44.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T375663
[09:55:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:55:43] <wikibugs>	 (03CR) 10Fabfur: cache: install lshw from bullseye-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur)
[09:55:57] <wikibugs>	 (03PS4) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295)
[09:58:14] <wikibugs>	 (03PS1) 10Slyngshede: P:idm enable bitu-account-manager permission request. [puppet] - 10https://gerrit.wikimedia.org/r/1093275
[10:04:57] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur)
[10:05:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:06:04] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1092914 (https://phabricator.wikimedia.org/T370745) (owner: 10CDanis)
[10:06:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline (feel free to ignore)" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur)
[10:10:23] <wikibugs>	 (03CR) 10Fabfur: cache: install lshw from bullseye-backports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur)
[10:10:36] <wikibugs>	 (03PS5) 10Fabfur: cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295)
[10:10:45] <wikibugs>	 (03CR) 10Slyngshede: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff)
[10:11:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:13:50] <wikibugs>	 (03PS1) 10JMeybohm: Fix permissions and notify of kube-publish-sa-cert [puppet] - 10https://gerrit.wikimedia.org/r/1093280 (https://phabricator.wikimedia.org/T380142)
[10:14:19] <wikibugs>	 (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093280 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm)
[10:16:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:17:20] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] Fix permissions and notify of kube-publish-sa-cert [puppet] - 10https://gerrit.wikimedia.org/r/1093280 (https://phabricator.wikimedia.org/T380142) (owner: 10JMeybohm)
[10:17:21] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071610 (https://phabricator.wikimedia.org/T363210) (owner: 10JMeybohm)
[10:18:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1014.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs10
[10:18:19] <icinga-wm>	 .wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:18:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs10
[10:18:21] <icinga-wm>	 .wmnet, wdqs1019.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:18:28] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[10:19:46] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[10:21:06] <wikibugs>	 (03PS2) 10Muehlenhoff: Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272
[10:21:16] <jynus>	 is the wdqs expected?
[10:21:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:21:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:21:27] <jynus>	 ok
[10:21:51] <dcausse>	 jynus: no most likely due to a single client abusing the service, looking
[10:22:01] <jynus>	 I see
[10:22:20] <jynus>	 let me know if I can help in any way
[10:22:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1091249 (https://phabricator.wikimedia.org/T379927) (owner: 10Andrew Bogott)
[10:22:46] <effie>	 !log removing leadership from kafka-main1001 - T363214
[10:22:48] <dcausse>	 jynus: sure, thanks for the offer!
[10:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:51] <stashbot>	 T363214: kafka-main100[6789] and kafka-main1010 implementation tracking - https://phabricator.wikimedia.org/T363214
[10:22:58] <wikibugs>	 (03CR) 10Muehlenhoff: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff)
[10:23:09] <dcausse>	 going to wait a bit before investigating sparql query logs
[10:24:03] <dcausse>	 seems like it's recovering... (crossing fingers)
[10:26:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur)
[10:27:05] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:28:30] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache: install lshw from bullseye-backports [puppet] - 10https://gerrit.wikimedia.org/r/1092887 (https://phabricator.wikimedia.org/T380295) (owner: 10Fabfur)
[10:32:05] <jinxer-wm>	 FIRING: [14x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:33:13] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on kafka-main[1001,1006].eqiad.wmnet with reason: Hardware refresh
[10:33:21] <jayme>	 !log re-enabled puppet on all k8s controll planes for rollout of T380142
[10:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:25] <stashbot>	 T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it - https://phabricator.wikimedia.org/T380142
[10:33:28] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on kafka-main[1001,1006].eqiad.wmnet with reason: Hardware refresh
[10:33:30] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd1001.eqiad.wmnet} and (A:cephosd)
[10:34:32] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams
[10:35:17] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) (owner: 10AikoChou)
[10:35:53] <icinga-wm>	 PROBLEM - BFD status on lsw1-e1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:36:18] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update reference-quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092866 (https://phabricator.wikimedia.org/T378939) (owner: 10AikoChou)
[10:36:19] <icinga-wm>	 PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:37:04] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams
[10:38:26] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru
[10:38:41] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_magru
[10:39:55] <icinga-wm>	 RECOVERY - BFD status on lsw1-e1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:42:19] <icinga-wm>	 RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:43:40] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd1001.eqiad.wmnet} and (A:cephosd)
[10:50:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1020.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1019.eqiad.wmnet are mar
[10:50:21] <icinga-wm>	  but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:50:49] <dcausse>	 sigh...
[10:51:12] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service aux-k8s-ctrl1003:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:51:19] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1019.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1018.eqiad.wmnet, wdqs1019.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1020.eqiad.wmnet are marked down but pooled: aux-k8s-ctrl_6443: Serv
[10:51:19] <icinga-wm>	 k8s-ctrl1002.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs1018.eqiad.wmnet, wdqs1020.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:53:02] <jayme>	 !incidents
[10:53:02] <sirenbot>	 5464 (ACKED)  [2x] ProbeDown sre (aux-k8s-ctrl1003:6443 probes/custom eqiad)
[10:53:02] <XioNoX>	 dcausse: hey hey, you know what's up or should we start investigating?
[10:53:06] <jynus>	 I am here
[10:53:09] <jynus>	 I was acking it
[10:53:16] <jayme>	 probe down is potentially me
[10:54:08] <jynus>	 I am checking traffic impact
[10:54:44] <jynus>	 lots of probes at 50%
[10:54:54] <jayme>	 then it's not me 
[10:55:15] <jynus>	 this one is the only one that goes at 60&
[10:55:19] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:55:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:55:52] <jynus>	 issues started around 10:15
[10:56:09] <jynus>	 potentially before, around 8:48
[10:56:12] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service aux-k8s-ctrl1003:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:56:33] <jynus>	 is this a network availability issue, a host availability?
[10:56:58] <dcausse>	 XioNoX: re wdqs, most probably a single client, investigation might take time since I have to go through query logs :/
[10:57:05] <jynus>	 or a monitoring issue?
[10:58:17] <jynus>	 does anyone see service problems other than the failing probes?
[10:59:10] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:59:40] <XioNoX>	 jayme: ^ is that you?
[10:59:46] <jayme>	 I don't think so
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1100)
[11:00:06] <jayme>	 all apiservers respond to queries just fine
[11:00:49] <jynus>	 XioNoX: on the layer that you know, do you see any connectivity issue between prometheus and those reported failed services?
[11:00:59] <jynus>	 I will check prometheus meanwhile
[11:02:05] <jinxer-wm>	 FIRING: [16x] ProbeDown: Service kubestagemaster1004:6443 has failed probes (http_staging_eqiad_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:02:52] <XioNoX>	 jynus: looking
[11:03:06] <jynus>	 if the services themselves look fine, it is either the connectivity or the source host/service
[11:03:42] <jynus>	 XioNoX: there is increase in network activity on prometheus codfw hosts, maybe a clue
[11:03:57] <jayme>	 there are a ton of things in the probedown alert (36) - but even when expanding the card in the UI I don't see a kube apiserver there
[11:04:46] <jynus>	 maybe prometheus is getting saturated
[11:05:45] <jynus>	 there was a spike of writes on prometheus too
[11:05:50] <XioNoX>	 https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&from=now-12h&to=now
[11:05:57] <XioNoX>	 yeah something is up
[11:06:49] <XioNoX>	 but started a while ago
[11:06:57] <claime>	 I'm getting intermittent connection issues between deploy and the kubemaster
[11:06:58] <jynus>	 "socket: permission denied"
[11:07:03] <XioNoX>	 and slowly went up
[11:07:52] <claime>	 cgoubert@deploy2002:~$ kubectl get job mw-script.codfw.kftdqx7r -o yaml                                                                                           
[11:07:54] <claime>	 The connection to the server kubemaster.svc.codfw.wmnet:6443 was refused - did you specify the right host or port?      kj
[11:07:58] <jynus>	 was there any recent firewall update?
[11:08:03] <claime>	 next invocation went through
[11:08:41] <topranks>	 there was an update to the envoy firewall rules yesterday 
[11:08:54] <topranks>	 not sure if it's relevant, I'd expect it to have bitten us right away if so 
[11:08:55] <jynus>	 this seems more recent, a few hours at most
[11:09:04] <topranks>	 yeah
[11:09:30] <XioNoX>	 from the monitoring stuff started between 7:30/8:30UTC
[11:09:36] <jynus>	 got a signal here, unsure if relevant: https://logstash.wikimedia.org/goto/987de189fd67c42d8080ae3ab22dbf54
[11:09:39] <XioNoX>	 maybe a bit before
[11:10:39] <wikibugs>	 (03PS1) 10Tiziano Fogli: opensearch: reduce noise of PrometheusRuleEvaluationFailures [alerts] - 10https://gerrit.wikimedia.org/r/1093302 (https://phabricator.wikimedia.org/T374178)
[11:10:44] <jynus>	 nah, it happens all the time, it is noise
[11:11:00] <XioNoX>	 no smoking gun on the network metrics so far
[11:11:26] <jayme>	 claime: I did trigger a restart of all kube-apiserver processes via a puppet change ~30min ago
[11:18:39] <XioNoX>	 looking at https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2 (logstash logs) there are no smoking gun neither
[11:22:39] <akosiaris>	 !log decommission cxserver endpoints /api/rest_v1/transform/html/from, /api/rest_v1/transform/word/from from RESTBase T375616
[11:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:22:44] <stashbot>	 T375616: Block RESTBase cxserver v1 endpoints in favor of the new endpoints - https://phabricator.wikimedia.org/T375616
[11:24:41] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_magru
[11:27:32] <jynus>	 So the only anomalies I can see are for small spikes of wdqs errors, but those happen before
[11:28:22] <jynus>	 the only issues I see ongoing are wdqs and etherpad not near 100% availability
[11:28:44] <wikibugs>	 (03CR) 10Máté Szabó: [C:03+1] prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli)
[11:28:46] <jynus>	 but issues with those services wouldn't explain the probe failures
[11:30:02] <XioNoX>	 jynus: from there https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?orgId=1&from=now-24h&to=now&viewPanel=2 it looks like something is funky with prometheus itself
[11:30:19] <XioNoX>	 both eqiad and codfw prom endpoints
[11:30:33] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru
[11:31:06] <jynus>	 XioNoX: I think you are right, some kind of overload
[11:31:17] <jynus>	 so probes could have failed from client side
[11:31:38] <jynus>	 let's ping observability and keep an eye on it
[11:32:10] <XioNoX>	 jynus: +1
[11:32:33] <XioNoX>	 jynus: who from o11y is on the closest timezone?
[11:32:51] <jynus>	 I am talking on their channel
[11:36:46] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2143.codfw.wmnet with OS bookworm
[11:37:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2144.codfw.wmnet with OS bookworm
[11:38:17] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2145.codfw.wmnet with OS bookworm
[11:38:44] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2146.codfw.wmnet with OS bookworm
[11:39:11] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2147.codfw.wmnet with OS bookworm
[11:39:43] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2148.codfw.wmnet with OS bookworm
[11:40:03] <jinxer-wm>	 FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[11:40:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2149.codfw.wmnet with OS bookworm
[11:55:20] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxykafka: working on TLS client authentication to kafka [puppet] - 10https://gerrit.wikimedia.org/r/1090915 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[11:56:09] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2144.codfw.wmnet with reason: host reimage
[11:56:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2143.codfw.wmnet with reason: host reimage
[11:57:25] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2145.codfw.wmnet with reason: host reimage
[11:57:38] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2146.codfw.wmnet with reason: host reimage
[11:58:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2147.codfw.wmnet with reason: host reimage
[11:59:13] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2148.codfw.wmnet with reason: host reimage
[11:59:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2149.codfw.wmnet with reason: host reimage
[12:00:04] <jouncebot>	 mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1200).
[12:00:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:01:42] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2144.codfw.wmnet with reason: host reimage
[12:02:21] <wikibugs>	 (03PS1) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776)
[12:04:32] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2149.codfw.wmnet with reason: host reimage
[12:05:52] <wikibugs>	 (03PS1) 10Fabfur: Revert "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1093321
[12:06:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "you need to pass the chosen user from haproxykafka profile to the haproxykafka class, where right now is taking the default hardcoded valu" [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[12:06:51] <wikibugs>	 (03PS1) 10Cathal Mooney: Temporarily change cumin installserver alias to not include mgaru [puppet] - 10https://gerrit.wikimedia.org/r/1093322 (https://phabricator.wikimedia.org/T376737)
[12:07:55] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2148.codfw.wmnet with reason: host reimage
[12:08:12] <sukhe>	 !log disable puppet on cumin2002 to test cumin alias for A:installserver
[12:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:19] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "haproxykafka: working on TLS client authentication to kafka" [puppet] - 10https://gerrit.wikimedia.org/r/1093321 (owner: 10Fabfur)
[12:09:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job haproxykafka in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:10:09] <wikibugs>	 (03CR) 10Slyngshede: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff)
[12:10:45] <jinxer-wm>	 FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:11:49] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2147.codfw.wmnet with reason: host reimage
[12:14:43] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.hosts.dhcp for host cp7007.magru.wmnet
[12:15:14] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2145.codfw.wmnet with reason: host reimage
[12:15:45] <jinxer-wm>	 FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:16:32] <logmsgbot>	 !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp7007.magru.wmnet
[12:16:59] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.dhcp for host cp7007.magru.wmnet
[12:18:53] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2146.codfw.wmnet with reason: host reimage
[12:19:09] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host cp7007.magru.wmnet
[12:19:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job haproxykafka in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:20:57] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye
[12:21:13] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339739 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp7007.magru.wmnet with...
[12:21:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2144.codfw.wmnet with OS bookworm
[12:22:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2143.codfw.wmnet with reason: host reimage
[12:22:21] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[12:23:33] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[12:23:56] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2149.codfw.wmnet with OS bookworm
[12:26:13] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli)
[12:26:29] <wikibugs>	 (03CR) 10Effie Mouzeli: [V:03+2 C:03+2] prometheus-mcrouter-exporter: update to v0.4.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1092338 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli)
[12:26:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2148.codfw.wmnet with OS bookworm
[12:28:23] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093323
[12:31:09] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2147.codfw.wmnet with OS bookworm
[12:33:35] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[12:33:59] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.57 ms
[12:34:11] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:34:11] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:34:11] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:34:11] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:34:11] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:34:11] <icinga-wm>	 PROBLEM - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:34:30] <jynus>	 ?
[12:34:40] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2145.codfw.wmnet with OS bookworm
[12:34:51] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[12:36:27] <wikibugs>	 (03PS3) 10Muehlenhoff: Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272
[12:36:47] <wikibugs>	 (03CR) 10Muehlenhoff: Add cn=wmf as managed group for idm-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff)
[12:38:25] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339847 (10RobH) Work rescheduled after conversation with both @ssingh and @MoritzMuehlenhoff regarding ganeti host cadence and swa...
[12:38:27] <sukhe>	 !log re-enable puppet on cumin2002
[12:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:19] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2146.codfw.wmnet with OS bookworm
[12:39:28] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093325
[12:40:23] <icinga-wm>	 PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[12:40:45] <jinxer-wm>	 RESOLVED: [3x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:41:03] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2150.codfw.wmnet with OS bookworm
[12:41:30] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2151.codfw.wmnet with OS bookworm
[12:41:31] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2143.codfw.wmnet with OS bookworm
[12:42:12] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2152.codfw.wmnet with OS bookworm
[12:42:46] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2153.codfw.wmnet with OS bookworm
[12:43:31] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2154.codfw.wmnet with OS bookworm
[12:44:02] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10339856 (10ssingh) Thanks @RobH , sounds good!  >>! In T376737#10339847, @RobH wrote: > Work rescheduled after conversation with bo...
[12:44:06] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2155.codfw.wmnet with OS bookworm
[12:46:10] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7007.magru.wmnet with reason: host reimage
[12:49:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet
[12:49:54] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10339906 (10ops-monitoring-bot) Draining ganeti1017.eqiad.wmnet of running VMs
[12:50:01] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[12:50:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7007.magru.wmnet with reason: host reimage
[12:51:18] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[12:53:45] <Niharika>	 Is enwiki down for anyone else?
[12:53:57] <Niharika>	 Can't access it from Singapore atm
[12:54:14] <Niharika>	 Our servers are currently under maintenance or experiencing a technical problem. Please try again in a few minutes.
[12:54:36] <Niharika>	 Nvm, back up
[12:54:57] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:55:17] <jynus>	 acked
[12:55:21] * Emperor here
[12:55:22] <XioNoX>	 hey hey
[12:55:23] <jynus>	 Niharika: on it
[12:55:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet
[12:55:43] <Emperor>	 !incidents
[12:55:43] <sirenbot>	 5465 (ACKED)  [4x] ProbeDown sre (text-https:443 probes/service)
[12:55:43] <sirenbot>	 5464 (RESOLVED)  [2x] ProbeDown sre (aux-k8s-ctrl1003:6443 probes/custom eqiad)
[12:56:45] <Emperor>	 looks like that p.aged everyone immediately?
[12:57:05] <jinxer-wm>	 RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:57:15] <moritzm>	 seems so, yes
[12:57:16] <elukey>	 I got a page yes
[12:58:12] <Amir1>	 me too
[12:58:15] <elukey>	 was it a temp blip?
[12:58:18] <akosiaris>	 Yup
[12:58:32] <akosiaris>	 Yup to the "paged everyone"
[12:58:46] <jynus>	 Niharika: can you confirm it works now?
[12:59:01] <jynus>	 I acked it relativelly fast
[12:59:13] <jynus>	 could be what I said about rounting issues the other day?
[12:59:14] <Emperor>	 I'm going to update T371244
[12:59:14] <stashbot>	 T371244: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244
[12:59:32] <Emperor>	 I see in the VO log "User escalator_sysuser routed incident #5465 from SRE:SRE Business Hours (Escalation) to SRE:SRE Batphone (Escalation)"
[12:59:34] <jynus>	 which I attributed to a UI bug, but may be something ongoing
[12:59:52] <elukey>	 the VO app tells me that the Batphone is on-call now
[12:59:57] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:00:16] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2150.codfw.wmnet with reason: host reimage
[13:00:23] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2151.codfw.wmnet with reason: host reimage
[13:00:42] <Emperor>	 !oncall-now
[13:00:43] <sirenbot>	 Oncall now for team SRE, rotation business_hours:
[13:00:43] <sirenbot>	 X.ioNoX, j.ynus
[13:01:20] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2152.codfw.wmnet with reason: host reimage
[13:01:28] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2153.codfw.wmnet with reason: host reimage
[13:01:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet
[13:01:47] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10339960 (10ops-monitoring-bot) Draining ganeti1017.eqiad.wmnet of running VMs
[13:01:55] <elukey>	 weird, the VO website doesn't
[13:02:14] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2154.codfw.wmnet with reason: host reimage
[13:02:21] <elukey>	 ok maybe the very nice and intuitive UI of the app is fooling me, not sure
[13:02:58] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2155.codfw.wmnet with reason: host reimage
[13:03:40] <Emperor>	 elukey: I've added your observation to my note on T371244
[13:03:58] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2150.codfw.wmnet with reason: host reimage
[13:04:27] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10339967 (10MatthewVernon) I think this happened again today, with [[ https://portal.victorops.com/ui/wikimedia/incident/5465/details | incident 5465 ]] - everyone w...
[13:04:46] <claime>	 elukey: I think batphone is always "on-call" for escalation purposes but I may be wrong
[13:04:59] <claime>	 VO is so clear on what's happening /s
[13:06:46] <wikibugs>	 (03PS2) 10Anzx: knwiki: update portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366)
[13:07:13] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2153.codfw.wmnet with reason: host reimage
[13:07:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) (owner: 10Anzx)
[13:11:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2155.codfw.wmnet with reason: host reimage
[13:12:22] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] idm: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1092842 (owner: 10Muehlenhoff)
[13:14:15] <wikibugs>	 (03PS1) 10Effie Mouzeli: kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214)
[13:14:15] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2154.codfw.wmnet with reason: host reimage
[13:16:01] <wikibugs>	 (03PS2) 10Effie Mouzeli: kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214)
[13:17:23] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2152.codfw.wmnet with reason: host reimage
[13:17:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7007.magru.wmnet with OS bullseye
[13:17:42] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340002 (10RobH)
[13:17:45] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340017 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp7007.magru.wmnet with OS...
[13:18:45] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[13:19:34] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] kafka-main: Replace kafka-main1001 with kafka-main1006 [puppet] - 10https://gerrit.wikimedia.org/r/1093330 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[13:20:19] <wikibugs>	 (03CR) 10Btullis: [C:03+2] "I'm happy that this works as expected now." [puppet] - 10https://gerrit.wikimedia.org/r/1089716 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:20:47] <wikibugs>	 (03PS1) 10Muehlenhoff: idp-test: Set Envoy firewall config in nftables-compatible syntax [puppet] - 10https://gerrit.wikimedia.org/r/1093332
[13:21:01] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2151.codfw.wmnet with reason: host reimage
[13:23:07] <elukey>	 claime: yes yes definitely, the escalation is always "on-call", I thought I'd seen EMEA as well for batphone, but probably got fooled by the UI
[13:23:20] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2150.codfw.wmnet with OS bookworm
[13:24:39] <wikibugs>	 (03PS1) 10Btullis: Upgrade the remainder of the cephosd cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259)
[13:26:00] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2153.codfw.wmnet with OS bookworm
[13:26:00] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:26:02] <wikibugs>	 (03PS57) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129)
[13:26:16] <elukey>	 besides the VO issue, do we know what happened?
[13:26:26] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-main-eqiad
[13:28:28] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:28:41] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade.
[13:29:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:29:51] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Upgrade the remainder of the cephosd cluster to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1093333 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis)
[13:29:52] <logmsgbot>	 !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[13:30:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093332 (owner: 10Muehlenhoff)
[13:31:18] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2155.codfw.wmnet with OS bookworm
[13:33:08] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2154.codfw.wmnet with OS bookworm
[13:36:21] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-main-eqiad
[13:38:00] <Niharika>	 jynus: sorry, I got pulled into a meeting. Yeah, it worked after a minute or so
[13:38:10] <Niharika>	 Thanks for tackling it
[13:38:41] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2152.codfw.wmnet with OS bookworm
[13:38:46] <effie>	 !log putting kafka-main1006.eqiad.wmnet in production 
[13:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:22] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2151.codfw.wmnet with OS bookworm
[13:44:35] <claime>	 !log homer 'lsw1-d1-codfw*' commit 'T377028'
[13:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:39] <stashbot>	 T377028: wikikube-worker21[36-55] implementation tracking - https://phabricator.wikimedia.org/T377028
[13:45:34] <claime>	 !log homer 'lsw1-b2-codfw*' commit 'T377028'
[13:45:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:47] <claime>	 !log homer 'lsw1-d6-codfw*' commit 'T377028'
[13:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:55] <claime>	 !log homer 'lsw1-c7-codfw*' commit 'T377028'
[13:47:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:37] <claime>	 !log homer 'lsw1-b7-codfw*' commit 'T377028'
[13:48:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:21] <claime>	 !log homer 'lsw1-d5-codfw*' commit 'T377028'
[13:49:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:03] <claime>	 !log homer 'lsw1-c4-codfw*' commit 'T377028'
[13:50:03] <jinxer-wm>	 RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[13:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:07] <stashbot>	 T377028: wikikube-worker21[36-55] implementation tracking - https://phabricator.wikimedia.org/T377028
[13:50:43] <claime>	 !log homer 'lsw1-d7-codfw*' commit 'T377028'
[13:50:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:25] <claime>	 !log homer 'lsw1-c2-codfw*' commit 'T377028'
[13:51:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:13] <claime>	 !log homer 'lsw1-d2-codfw*' commit 'T377028'
[13:52:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:52:58] <claime>	 !log homer 'lsw1-b4-codfw*' commit 'T377028'
[13:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:34] <wikibugs>	 (03PS58) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129)
[13:53:39] <claime>	 !log homer 'lsw1-d4-codfw*' commit 'T377028'
[13:53:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2136-2139,2141-2155].codfw.wmnet
[13:56:02] <logmsgbot>	 !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2136-2139,2141-2155].codfw.wmnet
[13:57:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:57:34] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214)
[13:58:10] <wikibugs>	 (03PS1) 10David Caro: toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633)
[13:58:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[13:59:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb)
[14:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1400).
[14:00:05] <jouncebot>	 albertoleoncio and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:14] <wikibugs>	 (03PS2) 10David Caro: toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633)
[14:00:24] <albertoleoncio>	 Hi!
[14:00:31] <wikibugs>	 (03CR) 10David Caro: "Note that this depends on https://gitlab.wikimedia.org/repos/cloud/toolforge/api-gateway/-/merge_requests/53" [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[14:02:18] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:02:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:02:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:02:56] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:02:58] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:03:32] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:03:34] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:03:49] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:03:51] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[14:04:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[14:04:30] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[14:04:46] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[14:04:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[14:04:56] <wikibugs>	 (03PS2) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776)
[14:05:05] <logmsgbot>	 !log jiji@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[14:05:06] <logmsgbot>	 !log jiji@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:05:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[14:05:44] <logmsgbot>	 !log jiji@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:05:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[14:06:23] <logmsgbot>	 !log jiji@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[14:07:47] <albertoleoncio>	 Ping...
[14:12:53] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] haproxykafka: fix permissions on ssl files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[14:15:38] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[14:16:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.8 point update - https://phabricator.wikimedia.org/T379600#10340289 (10MoritzMuehlenhoff)
[14:17:04] <wikibugs>	 (03PS3) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776)
[14:17:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[14:17:31] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff)
[14:17:45] <wikibugs>	 (03CR) 10Ssingh: "I think we should keep this if we need it next week but most likely we will use install7001." [puppet] - 10https://gerrit.wikimedia.org/r/1093322 (https://phabricator.wikimedia.org/T376737) (owner: 10Cathal Mooney)
[14:17:52] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093332 (owner: 10Muehlenhoff)
[14:18:03] <wikibugs>	 (03PS1) 10Ssingh: Revert "Change insrallserver in magru to point to eqiad insrall server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340
[14:18:21] <wikibugs>	 (03PS1) 10Ssingh: Revert "magru: use eqiad's installserver temporarily for testing" [puppet] - 10https://gerrit.wikimedia.org/r/1093341
[14:18:38] <wikibugs>	 (03CR) 10Fabfur: haproxykafka: fix permissions on ssl files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[14:19:20] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "magru: use eqiad's installserver temporarily for testing" [puppet] - 10https://gerrit.wikimedia.org/r/1093341 (owner: 10Ssingh)
[14:20:38] <wikibugs>	 (03PS4) 10Fabfur: haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776)
[14:20:40] <wikibugs>	 (03PS2) 10Papaul: Revert "Change insallserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh)
[14:21:11] <wikibugs>	 (03CR) 10Papaul: [C:03+1] Revert "Change insallserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh)
[14:21:46] <wikibugs>	 (03PS3) 10Papaul: Revert "Change installserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh)
[14:22:38] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] Revert "Change installserver in magru to point to eqiad install server" [homer/public] - 10https://gerrit.wikimedia.org/r/1093340 (owner: 10Ssingh)
[14:23:05] <sukhe>	 !log running homer on asw*magru*
[14:23:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:26] <jynus>	 !log starting resharding of commons backup files into new host backup1010  T376892
[14:23:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:23:29] <stashbot>	 T376892: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892
[14:23:55] <jynus>	 ^ XioNoX expect increase internal traffic for a few days (probably unnoticed, but FYI)
[14:24:01] <jynus>	 *increased
[14:24:32] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100[2-4].eqiad.wmnet} and (A:cephosd)
[14:24:43] <XioNoX>	 ok
[14:25:14] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-11-12-161156 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093347 (https://phabricator.wikimedia.org/T377547)
[14:25:18] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-13-145636 to 2024-11-18-142635 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093348 (https://phabricator.wikimedia.org/T376938)
[14:25:19] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-18-142635 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093349 (https://phabricator.wikimedia.org/T378044)
[14:25:25] <cdanis>	 jouncebot: nowandnext
[14:25:25] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1400)
[14:25:26] <jouncebot>	 In 0 hour(s) and 34 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1500)
[14:25:43] <logmsgbot>	 !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7007.magru.wmnet [reason: host reimaged]
[14:26:12] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[14:26:56] <cdanis>	 Lucas_WMDE: urbanecm: TheresNoTime: is the backport window being used?
[14:27:36] <icinga-wm>	 PROBLEM - BFD status on lsw1-e2-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:27:38] <icinga-wm>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:28:36] <wikibugs>	 (03PS1) 10Muehlenhoff: debmonitor: Restrict access to Envoy port [puppet] - 10https://gerrit.wikimedia.org/r/1093350
[14:29:46] <cdanis>	 !log T380226 💙cdanis@mwmaint2002.codfw.wmnet ~ 🕤☕ mwscript sql.php --wiki=commonswiki  --cluster=extension1  /srv/mediawiki/php-1.44.0-wmf.4/extensions/JsonConfig/sql/mysql/tables-generated.sql
[14:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:29:52] <stashbot>	 T380226: Install globaljsonlinks* tables on X1 for use with commons commons for Charts deployment - https://phabricator.wikimedia.org/T380226
[14:30:36] <icinga-wm>	 RECOVERY - BFD status on lsw1-e2-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:30:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1093350 (owner: 10Muehlenhoff)
[14:30:38] <icinga-wm>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:35:01] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: update articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1092994 (https://phabricator.wikimedia.org/T374034) (owner: 10AikoChou)
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:41:44] <icinga-wm>	 PROBLEM - BGP status on lsw1-e3-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:42:36] <icinga-wm>	 PROBLEM - BFD status on lsw1-e3-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:43:44] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:45:36] <icinga-wm>	 RECOVERY - BFD status on lsw1-e3-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:45:44] <icinga-wm>	 RECOVERY - BGP status on lsw1-e3-eqiad.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:48:19] <wikibugs>	 (03PS1) 10Klausman: ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354
[14:48:37] <wikibugs>	 (03PS1) 10Btullis: Failover analytics-hive to standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1093355 (https://phabricator.wikimedia.org/T377938)
[14:49:20] <wikibugs>	 (03PS4) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576)
[14:49:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add cn=wmf as managed group for idm-test [puppet] - 10https://gerrit.wikimedia.org/r/1093272 (owner: 10Muehlenhoff)
[14:49:30] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093356 (https://phabricator.wikimedia.org/T373776)
[14:49:32] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "Verified by hotpatching:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman)
[14:50:53] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hadoop.roll-restart-workers (exit_code=99) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade.
[14:53:53] <JennH>	 !log power cycling unresponsive mgmt switch in codfw: msw-c3-codfw 
[14:53:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:20] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776)
[14:56:36] <wikibugs>	 (03PS1) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333)
[14:57:18] <wikibugs>	 (03Abandoned) 10Arlolra: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093356 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra)
[14:57:23] <logmsgbot>	 !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[14:57:34] <icinga-wm>	 PROBLEM - BGP status on lsw1-f1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:57:40] <icinga-wm>	 PROBLEM - BFD status on lsw1-f1-eqiad.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:58:51] <wikibugs>	 (03PS1) 10Muehlenhoff: idm-test: Fix syntax for wmf group config [puppet] - 10https://gerrit.wikimedia.org/r/1093360
[14:59:19] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10340445 (10elukey) @Jclark-ctr the host is provisioned, next step is the number 2 in T370453#10326159, lemme know if you want me to do it or not!
[14:59:50] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1093360 (owner: 10Muehlenhoff)
[15:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1500)
[15:00:20] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10340460 (10herron) escalator_sysuser is our account for the vo-escalate service which runs from the active alert host.  vo-escalate checks every 15 seconds looking...
[15:00:40] <icinga-wm>	 RECOVERY - BFD status on lsw1-f1-eqiad.mgmt is OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:00:42] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] haproxykafka: fix permissions on ssl files [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[15:03:34] <icinga-wm>	 RECOVERY - BGP status on lsw1-f1-eqiad.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:04:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] idm-test: Fix syntax for wmf group config [puppet] - 10https://gerrit.wikimedia.org/r/1093360 (owner: 10Muehlenhoff)
[15:04:22] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100[2-4].eqiad.wmnet} and (A:cephosd)
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:07:37] <wikibugs>	 (03CR) 10Fabfur: [C:04-1] "To be merged after https://phabricator.wikimedia.org/T380373" [puppet] - 10https://gerrit.wikimedia.org/r/1093317 (https://phabricator.wikimedia.org/T379776) (owner: 10Fabfur)
[15:07:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) (owner: 10Arlolra)
[15:07:51] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade evaluators from 2024-11-12-161156 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093347 (https://phabricator.wikimedia.org/T377547) (owner: 10Jforrester)
[15:08:52] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-11-12-161156 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093347 (https://phabricator.wikimedia.org/T377547) (owner: 10Jforrester)
[15:09:24] <urandom>	 !log bootstrapping cassandra, restbase2037-{a,b,c} — T380236
[15:09:26] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:29] <stashbot>	 T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236
[15:09:50] <icinga-wm>	 RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-Z on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-Z 432 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:10:14] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:12:30] <icinga-wm>	 PROBLEM - Host lsw1-c4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:12:40] <icinga-wm>	 PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:13:00] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:13:30] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Add feature for a user to request removal of an LDAP group - https://phabricator.wikimedia.org/T380382 (10MoritzMuehlenhoff) 03NEW
[15:13:36] <wikibugs>	 06SRE, 10Bitu, 06Infrastructure-Foundations: Add feature for a user to request removal of an LDAP group - https://phabricator.wikimedia.org/T380382#10340537 (10MoritzMuehlenhoff) p:05Triage→03Low
[15:13:41] <wikibugs>	 07sre-alert-triage, 10SRE Observability (FY2024/2025-Q2): Alert in need of triage: JobUnavailable - https://phabricator.wikimedia.org/T380022#10340525 (10lmata) a:03tappof
[15:13:51] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:14:28] <wikibugs>	 (03PS4) 10Ssingh: wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724)
[15:14:48] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:15:55] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:17:19] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-13-145636 to 2024-11-18-142635 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093348 (https://phabricator.wikimedia.org/T376938) (owner: 10Jforrester)
[15:17:52] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge:haproxy: add api gateway health check [puppet] - 10https://gerrit.wikimedia.org/r/1093339 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[15:18:03] <wikibugs>	 (03PS5) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576)
[15:18:29] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-13-145636 to 2024-11-18-142635 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093348 (https://phabricator.wikimedia.org/T376938) (owner: 10Jforrester)
[15:18:49] <wikibugs>	 (03CR) 10Ssingh: "Some updates:" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh)
[15:19:20] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:19:55] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:21:24] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[15:21:28] <wikibugs>	 (03CR) 10Elukey: "test-cookbooked, it seems working fine, lemme know!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey)
[15:22:03] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:22:56] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:23:01] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:23:48] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:24:37] <wikibugs>	 (03PS5) 10Brouberol: airflow-analytics-test: deploy the scheduler and kerberos components [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093177 (https://phabricator.wikimedia.org/T380284)
[15:24:37] <wikibugs>	 (03PS1) 10Brouberol: Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284)
[15:24:39] <wikibugs>	 (03PS1) 10Brouberol: airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284)
[15:24:40] <wikibugs>	 (03PS1) 10Brouberol: airflow: allow multiple DAG folders to be pulled in [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284)
[15:25:04] <wikibugs>	 (03CR) 10JMeybohm: [C:04-1] sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey)
[15:25:22] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-18-142635 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093349 (https://phabricator.wikimedia.org/T378044) (owner: 10Jforrester)
[15:26:15] <wikibugs>	 (03PS6) 10Ssingh: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724)
[15:26:35] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-18-142635 to 2024-11-19-132736 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093349 (https://phabricator.wikimedia.org/T378044) (owner: 10Jforrester)
[15:26:39] <wikibugs>	 (03CR) 10Elukey: sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey)
[15:26:55] <wikibugs>	 (03CR) 10Ssingh: P:hardware::check: add profile to check HW configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1092324 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh)
[15:27:10] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:28:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340624 (10Papaul) @Jhancock.wm @Clement_Goubert the interface on the switch side is up ` xe-0/0/26       up    up   wikikube-worker2140
[15:28:52] <icinga-wm>	 PROBLEM - Hadoop DataNode on an-worker1135 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[15:29:11] <wikibugs>	 (03CR) 10Subramanya Sastry: [C:03+1] Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra)
[15:31:52] <icinga-wm>	 RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms
[15:31:53] <jynus>	 !log starting resharding of commons backup files into new host backup2010  T376892
[15:31:54] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db1206 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 325.16 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:57] <stashbot>	 T376892: Expand media backup storage available space to 960 TB per datacenter - https://phabricator.wikimedia.org/T376892
[15:32:04] <icinga-wm>	 RECOVERY - Host lsw1-c4-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms
[15:32:32] <arnaudb>	 I'll downtime the host, it serves a very tiny fraction of "legit" mysql traffic
[15:32:41] <arnaudb>	 and is impacted by running dumps
[15:33:22] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] hiera: set do_ipv6_primary_ra for all LVS in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1091243 (https://phabricator.wikimedia.org/T358260) (owner: 10Ssingh)
[15:33:38] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: host overworked by dumps - T368098
[15:33:42] <stashbot>	 T368098: Dumps generation cause disruption to the production environment - https://phabricator.wikimedia.org/T368098
[15:33:51] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: host overworked by dumps - T368098
[15:35:46] <wikibugs>	 (03CR) 10Ebernhardson: [V:03+2 C:03+2] Repoint .gitreview at new repo [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1092936 (owner: 10Ebernhardson)
[15:37:22] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:37:40] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:39:18] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340666 (10Clement_Goubert) i just managed to mount the ip adresses on the other interface `eno12399np0` and the link is up. Looks like the wrong one go...
[15:39:56] <icinga-wm>	 PROBLEM - Host lsw1-c4-codfw.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:40:09] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:40:31] <wikibugs>	 (03PS1) 10Brouberol: an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284)
[15:40:40] <icinga-wm>	 PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[15:41:44] <wikibugs>	 (03PS2) 10Brouberol: an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284)
[15:41:57] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-19-132736 to 2024-11-19-140330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093369
[15:42:19] <wikibugs>	 (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-19-132736 to 2024-11-19-140330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093369 (owner: 10Jforrester)
[15:43:26] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-19-132736 to 2024-11-19-140330 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093369 (owner: 10Jforrester)
[15:43:44] <icinga-wm>	 RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-Y on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-Y 502 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:43:46] <icinga-wm>	 RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.20 ms
[15:43:52] <icinga-wm>	 RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-Z on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-Z 434 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:43:52] <icinga-wm>	 RECOVERY - ps1-c3-codfw-infeed-load-tower-A-phase-X on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-A-phase-X 473 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:43:52] <icinga-wm>	 RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-Y on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-Y 523 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:43:52] <icinga-wm>	 RECOVERY - ps1-c3-codfw-infeed-load-tower-B-phase-X on ps1-c3-codfw is OK: SNMP OK - ps1-c3-codfw-infeed-load-tower-B-phase-X 483 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:43:52] <icinga-wm>	 RECOVERY - Hadoop DataNode on an-worker1135 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process
[15:43:56] <icinga-wm>	 RECOVERY - Host lsw1-c3-codfw.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.76 ms
[15:44:02] <icinga-wm>	 RECOVERY - BGP status on lsw1-c3-codfw.mgmt is OK: BGP OK - up: 2, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:44:02] <icinga-wm>	 RECOVERY - Juniper alarms on lsw1-c3-codfw.mgmt is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[15:44:05] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:44:34] <logmsgbot>	 !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:44:47] <logmsgbot>	 !log dancy@deploy2002 Started scap sync-world: no-op deployment for testing.
[15:45:00] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (DIFF 6 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:45:09] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release wikifunctions/main-orchestrator on k8s-staging@eqiad in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:46:30] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340713 (10Papaul) @Clement_Goubert on your output below  you was looking at the second interface (eno12409np1) ` root@wikikube-worker2140:~# ethtool en...
[15:47:51] <wikibugs>	 (03CR) 10Btullis: an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:48:08] <logmsgbot>	 !log dancy@deploy2002 Finished scap sync-world: no-op deployment for testing. (duration: 03m 21s)
[15:49:22] <wikibugs>	 (03CR) 10Btullis: [C:03+1] Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:49:34] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:49:36] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:49:36] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340717 (10Clement_Goubert) Yes, `eno12409np1` was the one where the IPs were originally mounted when I encountered the issue. In order to troubleshoot,...
[15:49:40] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:50:09] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:50:13] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:50:21] <James_F>	 brouberol: We're in our window right now.
[15:50:26] <logmsgbot>	 !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:50:35] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:51:21] <wikibugs>	 (03CR) 10Btullis: airflow: allow multiple DAG folders to be pulled in (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:51:23] <logmsgbot>	 !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:51:32] <brouberol>	 James_F: oh so I should hold off all deployments, even to dse-k8s-eqiad?
[15:51:32] <wikibugs>	 (03Merged) 10jenkins-bot: Airflow: add missing hive connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093364 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:51:34] <wikibugs>	 (03Merged) 10jenkins-bot: airflow: upgrade base image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093365 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:51:43] <James_F>	 brouberol: Ideally. We'll be done in 8 mins.
[15:51:48] <brouberol>	 no worries
[15:51:52] <James_F>	 Well, we'll be done in a few seconds actually. :-)
[15:51:54] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10340716 (10RobH) Confirmed new window with Willy and sent update to ticket:    > Support, Can we shift this to work on Monday, Nove...
[15:52:29] <James_F>	 brouberol: Over to you.
[15:52:46] <wikibugs>	 (03CR) 10Btullis: an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:53:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340723 (10Papaul) 05Open→03Resolved  glad all is working> I am resolving this task. Thank you
[15:53:27] <brouberol>	 👍 thanks!
[15:55:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340731 (10Clement_Goubert) 05Resolved→03Open @Papaul sorry for the misunderstanding, but it's not resolved. The interface that is supposed to have...
[15:55:45] <wikibugs>	 (03CR) 10Brouberol: airflow: allow multiple DAG folders to be pulled in (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093366 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:56:51] <wikibugs>	 (03PS3) 10Brouberol: an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284)
[15:56:55] <wikibugs>	 (03CR) 10Brouberol: an-test-client1002: ensure that airflow services are absent (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:57:35] <wikibugs>	 (03CR) 10Btullis: [C:03+1] an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:57:48] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] an-test-client1002: ensure that airflow services are absent [puppet] - 10https://gerrit.wikimedia.org/r/1093368 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[15:58:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: Link down for wikikube-worker2140.codfw.wmnet - https://phabricator.wikimedia.org/T380265#10340746 (10Papaul) @Clement_Goubert got you know i will fix it in netbox. Sorry i misunderstood you.
[16:01:47] <wikibugs>	 (03PS1) 10Brouberol: an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284)
[16:02:06] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: db1246 temporary insetup [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215)
[16:02:07] <wikibugs>	 (03CR) 10Arnaudb: "as discussed on SRE foundation https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-sre-foundations/20241120.txt" [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb)
[16:04:14] <wikibugs>	 (03CR) 10Btullis: "I don't think we need to do this. We can just manually delete the services." [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[16:04:49] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman)
[16:05:00] <wikibugs>	 (03CR) 10Brouberol: "I think we do, as puppet is currently broken on the host:" [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[16:06:22] <wikibugs>	 (03CR) 10Klausman: [V:03+1 C:03+2] ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman)
[16:07:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1093372 (https://phabricator.wikimedia.org/T374215) (owner: 10Arnaudb)
[16:08:02] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh)
[16:09:59] <wikibugs>	 (03Merged) 10jenkins-bot: ml-staging/experimental: Fix wrongly-scoped limitranges [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093354 (owner: 10Klausman)
[16:10:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet
[16:10:41] <wikibugs>	 (03CR) 10Btullis: [C:03+1] an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[16:11:42] <wikibugs>	 (03PS2) 10Brouberol: an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284)
[16:12:11] <wikibugs>	 (03PS3) 10Brouberol: an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284)
[16:12:56] <wikibugs>	 (03CR) 10Brouberol: "I've made it so that we can have an empty hash of airflow instances, which will sidestep the associated puppet resource creation." [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[16:14:49] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] an-test-client1002: disable puppet management of airflow services [puppet] - 10https://gerrit.wikimedia.org/r/1093373 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[16:15:49] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[16:17:55] <wikibugs>	 (03CR) 10JHathaway: sre.hosts.{dhcp,reimage}: force tftp as default option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1092802 (https://phabricator.wikimedia.org/T363576) (owner: 10Elukey)
[16:17:58] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[16:19:33] <wikibugs>	 (03Merged) 10jenkins-bot: Update various kafka-main connection strings for kafka-main1006 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093337 (https://phabricator.wikimedia.org/T363214) (owner: 10Effie Mouzeli)
[16:19:36] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Setup backup1010 as the 6th media backup host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892)
[16:20:42] <wikibugs>	 (03CR) 10Jcrespo: [C:04-2] "Do not merge- transfer is ongoing, and database and software package needs to be updated before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo)
[16:21:03] <wikibugs>	 (03PS1) 10Brouberol: hotfix: prevent puppet resource creation when no airflow instances are specified [puppet] - 10https://gerrit.wikimedia.org/r/1093378 (https://phabricator.wikimedia.org/T380284)
[16:21:14] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/benthos-cache-invalidator: apply
[16:21:42] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/benthos-cache-invalidator: apply
[16:21:59] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[16:22:19] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:22:32] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[16:22:54] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[16:23:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:23:38] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] hotfix: prevent puppet resource creation when no airflow instances are specified [puppet] - 10https://gerrit.wikimedia.org/r/1093378 (https://phabricator.wikimedia.org/T380284) (owner: 10Brouberol)
[16:23:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: apply
[16:25:00] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Setup backup1010 as the 6th media backup host in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1093377 (https://phabricator.wikimedia.org/T376892)
[16:25:19] <wikibugs>	 (03PS1) 10Jcrespo: mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892)
[16:25:33] <wikibugs>	 (03CR) 10Jcrespo: [C:04-2] mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo)
[16:25:51] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[16:26:01] <wikibugs>	 (03PS2) 10Jcrespo: mediabackup: Setup backup2010 as the 6th media backup host in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892)
[16:26:07] <wikibugs>	 (03CR) 10Jcrespo: "Do not merge- transfer is ongoing, and database and software package needs to be updated before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1093379 (https://phabricator.wikimedia.org/T376892) (owner: 10Jcrespo)
[16:27:19] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply
[16:27:20] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply
[16:28:37] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply
[16:28:54] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db1206 is OK: OK slave_sql_lag Replication lag: 55.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:30:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[16:34:50] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[16:35:02] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[16:35:11] <logmsgbot>	 !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[16:35:24] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[16:35:25] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[16:35:35] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[16:35:48] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[16:35:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 98.77% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[16:35:57] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[16:36:38] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[16:37:25] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[16:37:48] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: apply
[16:38:10] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: apply
[16:38:11] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[16:38:42] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[16:48:16] <dcaro>	 idp.wikimedia.org is returning 503, anyone knows if there's an outage going? (two people known to have issues for now)
[16:49:03] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-debug: update prometheus-mcrouter-exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093382 (https://phabricator.wikimedia.org/T380212)
[16:55:19] <wikibugs>	 (03PS1) 10David Caro: toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633)
[16:55:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[16:56:38] <wikibugs>	 (03PS2) 10David Caro: toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633)
[16:56:59] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: update prometheus-mcrouter-exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093382 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli)
[16:58:28] <wikibugs>	 06SRE, 10envoy, 06serviceops, 06Traffic: Upgrade Envoy to >= 1.24 - https://phabricator.wikimedia.org/T380211#10340940 (10jijiki) p:05Triage→03Medium
[16:59:42] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: update prometheus-mcrouter-exporter image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093382 (https://phabricator.wikimedia.org/T380212) (owner: 10Effie Mouzeli)
[17:00:09] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:00:41] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:01:02] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:02:23] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:02:26] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:02:58] <wikibugs>	 (03PS1) 10Reedy: UcfirstOverrides: Fix indenting of comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093387
[17:03:24] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:04:06] <sukhe>	 !log restart tomcat on idp2004
[17:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:05:54] <wikibugs>	 (03PS1) 10Máté Szabó: Configure instrument for the Incident Reporting System [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823)
[17:07:04] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 128, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:07:06] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:07:39] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-debug: enable mcrouter container in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093390
[17:08:06] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wmflib: add facter data for lshw -class memory [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh)
[17:09:41] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1092844 (https://phabricator.wikimedia.org/T378724) (owner: 10Ssingh)
[17:12:46] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-debug: enable mcrouter container in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093390 (owner: 10Effie Mouzeli)
[17:12:53] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4]: Regular analytics weekly train BIS [analytics/refinery@295d5a44]
[17:14:01] <wikibugs>	 (03Merged) 10jenkins-bot: mw-debug: enable mcrouter container in pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1093390 (owner: 10Effie Mouzeli)
[17:16:34] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4]: Regular analytics weekly train BIS [analytics/refinery@295d5a44] (duration: 03m 41s)
[17:16:42] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:17:01] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4] (thin): Regular analytics weekly train BIS THIN [analytics/refinery@295d5a44]
[17:19:41] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:19:58] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:20:02] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:20:13] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:22:03] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4] (thin): Regular analytics weekly train BIS THIN [analytics/refinery@295d5a44] (duration: 05m 02s)
[17:22:58] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be1005 - https://phabricator.wikimedia.org/T370453#10341044 (10elukey) Quick note about the reimage step - due to a bug in Supermicro's BMC firmware (at least, this is what we suspect) the first reimage ru...
[17:27:17] <effie>	 !jouncebot  now
[17:27:17] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[17:27:31] <effie>	 jouncebot:  now
[17:27:31] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 32 minute(s)
[17:27:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[17:28:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[17:28:49] <logmsgbot>	 !log joal@deploy2002 Started deploy [analytics/refinery@295d5a4] (hadoop-test): Regular analytics weekly train BIS TEST [analytics/refinery@295d5a44]
[17:28:54] <wikibugs>	 (03CR) 10David Caro: [C:03+2] toolforge:haproxy: monitor the https port, not the internal one [puppet] - 10https://gerrit.wikimedia.org/r/1093384 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[17:32:25] <logmsgbot>	 !log joal@deploy2002 Finished deploy [analytics/refinery@295d5a4] (hadoop-test): Regular analytics weekly train BIS TEST [analytics/refinery@295d5a44] (duration: 03m 36s)
[17:38:45] <wikibugs>	 (03CR) 10Máté Szabó: [C:04-2] "DNM due to pending L3SC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093389 (https://phabricator.wikimedia.org/T372823) (owner: 10Máté Szabó)
[17:43:06] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03)
[17:45:29] <wikibugs>	 (03PS1) 10Btullis: Update spark shufflers on the test cluster to deploy version 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040)
[17:47:34] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1093394 (https://phabricator.wikimedia.org/T380040) (owner: 10Btullis)
[17:49:07] <jinxer-wm>	 FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[17:50:13] <wikibugs>	 (03PS1) 10David Caro: toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633)
[17:50:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[17:52:08] <wikibugs>	 (03PS1) 10Joal: Bump gobblin-wmf jar to newest version [puppet] - 10https://gerrit.wikimedia.org/r/1093396 (https://phabricator.wikimedia.org/T376144)
[17:52:16] <wikibugs>	 (03PS2) 10David Caro: toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633)
[17:52:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[17:54:15] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Bump gobblin-wmf jar to newest version [puppet] - 10https://gerrit.wikimedia.org/r/1093396 (https://phabricator.wikimedia.org/T376144) (owner: 10Joal)
[17:55:20] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] Failover analytics-hive to standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1093355 (https://phabricator.wikimedia.org/T377938) (owner: 10Btullis)
[17:55:50] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Failover analytics-hive to standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/1093355 (https://phabricator.wikimedia.org/T377938) (owner: 10Btullis)
[17:59:07] <jinxer-wm>	 RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures
[17:59:25] <wikibugs>	 (03PS3) 10David Caro: toolforge:haproxy: use the external name and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633)
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1800)
[18:03:53] <wikibugs>	 (03PS4) 10David Caro: toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633)
[18:04:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[18:05:07] <wikibugs>	 (03PS5) 10David Caro: toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633)
[18:06:25] <wikibugs>	 (03PS6) 10David Caro: toolforge:haproxy: use the external name and ip and force tls [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633)
[18:07:12] <wikibugs>	 (03PS1) 10C. Scott Ananian: Turn on Parsoid fragment support everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093399 (https://phabricator.wikimedia.org/T374661)
[18:09:47] <wikibugs>	 (03PS1) 10Greg Grossmeier: CSP for banner preview: allow remind me later SMS host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232)
[18:09:54] <wikibugs>	 (03CR) 10David Caro: [C:03+2] "I should add a test to this :/, tested in toolsbeta now" [puppet] - 10https://gerrit.wikimedia.org/r/1093395 (https://phabricator.wikimedia.org/T348633) (owner: 10David Caro)
[18:26:57] <wikibugs>	 (03CR) 10Herron: [C:03+2] "self-merging to get these VM builds started" [puppet] - 10https://gerrit.wikimedia.org/r/1092922 (https://phabricator.wikimedia.org/T378986) (owner: 10Herron)
[18:30:28] <wikibugs>	 (03PS1) 10C. Scott Ananian: Deploy Parsoid Read Views to de/ja/ru wikivoyage, incubatorwiki and dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093405 (https://phabricator.wikimedia.org/T375394)
[18:50:25] <wikibugs>	 (03PS1) 10Herron: site: fix site prefix typo in hostname [puppet] - 10https://gerrit.wikimedia.org/r/1093407
[18:51:34] <wikibugs>	 (03CR) 10Herron: [C:03+2] site: fix site prefix typo in hostname [puppet] - 10https://gerrit.wikimedia.org/r/1093407 (owner: 10Herron)
[18:52:25] <wikibugs>	 06SRE, 06SRE-OnFire, 06SRE Observability: VictorOps paged batphone immediately rather than after 5m - https://phabricator.wikimedia.org/T371244#10341512 (10lmata) Case 3622388 created! on splunk support added @andrea.denisse @colewhite and @herron as watchers in case support responds while i'm away.
[18:52:44] <wikibugs>	 (03PS1) 10Jdlrobson: Temporarily disable dark mode for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765)
[18:53:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, November 20 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson)
[18:58:44] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl2002.codfw.wmnet
[18:58:45] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.dns.netbox
[18:59:17] <wikibugs>	 10ops-magru, 06SRE, 06Traffic, 13Patch-For-Review: magru: Incorrect racking for magru hosts (F-25G and Custom Config interchanged) - https://phabricator.wikimedia.org/T376737#10341561 (10RobH) New work window confirmed by ascenty:  Comentário gerado em Smart Hands: Hello,    > We received the Ticket and sc...
[19:00:05] <jouncebot>	 andre and brennen: Time to do the MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1900).
[19:00:32] <brennen>	 nothing for this window.
[19:03:16] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002"
[19:03:20] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002"
[19:03:20] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:03:20] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl2002.codfw.wmnet on all recursors
[19:03:23] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl2002.codfw.wmnet on all recursors
[19:03:49] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002"
[19:03:53] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2002.codfw.wmnet - herron@cumin1002"
[19:04:33] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm
[19:04:43] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341601 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm
[19:08:28] <inflatador>	 !log bking@krb1001 add kerberos keytab for blunderbuss https://phabricator.wikimedia.org/P71106 T371994
[19:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:08:36] <stashbot>	 T371994: Deploy the HDFS synchronizer (blunderbuss) service to the dse-k8s cluster - https://phabricator.wikimedia.org/T371994
[19:12:15] <urandom>	 !log bootstrapping cassandra, restbase2038-{a,b,c} — T380236
[19:12:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:21] <stashbot>	 T380236: Refresh restbase202[1-3] w/ restbase203[6-8] - https://phabricator.wikimedia.org/T380236
[19:14:35] <wikibugs>	 (03CR) 10Wfan: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093401 (https://phabricator.wikimedia.org/T380232) (owner: 10Greg Grossmeier)
[19:17:55] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl2002.codfw.wmnet with reason: host reimage
[19:20:42] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl2002.codfw.wmnet with reason: host reimage
[19:35:01] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm
[19:35:02] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl2002.codfw.wmnet
[19:35:08] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341673 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-ctrl2002.codfw.wmnet with OS bookworm completed: - aux-k8s-ctrl2002 (**PASS**)   - R...
[19:41:24] <dancy>	 jouncebot nowandnext
[19:41:24] <jouncebot>	 For the next 1 hour(s) and 18 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T1900)
[19:41:24] <jouncebot>	 In 1 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2100)
[19:42:06] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.126.0" for 209 hosts
[19:47:17] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye
[19:47:32] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341702 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye
[19:51:08] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.126.0" for 1 hosts
[19:52:32] <logmsgbot>	 !log hashar@deploy2002 Started deploy [integration/docroot@1627206]: build: update mediawiki-codesniffer to 45.0.0 & prevent LibUp from removing a phpcs rule
[19:52:43] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [integration/docroot@1627206]: build: update mediawiki-codesniffer to 45.0.0 & prevent LibUp from removing a phpcs rule (duration: 00m 10s)
[19:59:53] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8699 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:03:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es2041.codfw.wmnet with OS bookworm
[20:03:40] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10341785 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm
[20:04:53] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8681 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:05:39] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye
[20:05:47] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341789 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex...
[20:08:33] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye
[20:08:39] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye
[20:10:13] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.126.0" for 1 hosts
[20:11:50] <wikibugs>	 (03PS1) 10Fabfur: benthos: WIP for haproxy debug functions [puppet] - 10https://gerrit.wikimedia.org/r/1093413 (https://phabricator.wikimedia.org/T329332)
[20:13:13] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.ganeti.makevm for new host aux-k8s-ctrl2003.codfw.wmnet
[20:13:15] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.dns.netbox
[20:13:53] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8670 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:14:53] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2025 is OK: DISK OK - free space: /srv/cassandra/instance-data 14024 MB (33% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[20:26:48] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002"
[20:28:02] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002"
[20:28:02] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:28:02] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.dns.wipe-cache aux-k8s-ctrl2003.codfw.wmnet on all recursors
[20:28:05] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aux-k8s-ctrl2003.codfw.wmnet on all recursors
[20:28:32] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002"
[20:28:36] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM aux-k8s-ctrl2003.codfw.wmnet - herron@cumin1002"
[20:30:01] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye
[20:30:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex...
[20:30:28] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye
[20:30:42] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye
[20:32:47] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.hosts.reimage for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm
[20:32:53] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by herron@cumin1002 for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm
[20:39:42] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.126.0" for 1 hosts
[20:40:38] <logmsgbot>	 !log dancy@deploy2002 Installation of scap version "4.126.0" completed for 1 hosts
[20:44:29] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[20:47:15] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2041.codfw.wmnet with OS bookworm
[20:47:33] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06Data-Persistence-SRE, and 3 others: Q2:rack/setup/install es204[1-6] - https://phabricator.wikimedia.org/T378146#10341929 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es2041.codfw.wmnet with OS bookworm executed...
[20:48:00] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[20:48:02] <logmsgbot>	 !log herron@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-ctrl2003.codfw.wmnet with reason: host reimage
[20:48:05] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[20:48:42] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[20:49:07] <Nemoralis>	 o/
[20:49:41] <Nemoralis>	 jouncebot: next
[20:49:41] <jouncebot>	 In 0 hour(s) and 10 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2100)
[20:49:49] * anzx 👋
[20:51:41] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-ctrl2003.codfw.wmnet with reason: host reimage
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2100).
[21:00:05] <jouncebot>	 albertoleoncio, arlolra, Nemoralis, anzx, and bwang: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:10] <anzx>	 o/
[21:00:20] <albertoleoncio>	 Hi!
[21:00:21] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye
[21:00:26] <Nemoralis>	 o/
[21:00:35] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex...
[21:01:11] <Nemoralis>	 quick question, my patch depends on another patch (on the WikimediaMessages repo). Should I add that to the deployment window?
[21:01:22] <Nemoralis>	 https://phabricator.wikimedia.org/T379317
[21:03:39] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye
[21:03:49] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10341965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye
[21:05:08] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm
[21:05:08] <logmsgbot>	 !log herron@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aux-k8s-ctrl2003.codfw.wmnet
[21:05:14] <wikibugs>	 06SRE, 10vm-requests, 07Kubernetes: codfw: (2x) aux-k8s-ctrl nodes - https://phabricator.wikimedia.org/T378986#10341979 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by herron@cumin1002 for host aux-k8s-ctrl2003.codfw.wmnet with OS bookworm completed: - aux-k8s-ctrl2003 (**PASS**)   - R...
[21:05:18] <albertoleoncio>	 Nemoralis: I think you need to ask someone to merge the messages first, wait about a week, and then deploy the config change
[21:05:42] <albertoleoncio>	 But I'm not really sure
[21:06:00] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.124.0" for 209 hosts
[21:07:01] <cscott>	 i'm here for the arlolra patch (arlo will probably also be around)
[21:08:29] <cjming>	 hi - pardon lateness - i can deploy if needed
[21:08:47] <logmsgbot>	 !log dancy@deploy2002 Installing scap version "4.124.0" for 209 hosts
[21:08:56] <albertoleoncio>	 Thanks!
[21:10:09] <wikibugs>	 (03PS3) 10Albertoleoncio: [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090)
[21:10:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10342002 (10phaultfinder)
[21:10:56] <cjming>	 cscott: presumably the 2 backports can go out together?
[21:11:31] <cscott>	 cjming: yes, arlo/i are backporting a mediawiki-vendor patch to bump parsoid. i believe that's done by putting the two patches together on the same `scap backport` command
[21:12:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio)
[21:12:21] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio)
[21:12:53] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8673 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[21:13:25] <wikibugs>	 (03Merged) 10jenkins-bot: [ptwiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091810 (https://phabricator.wikimedia.org/T380090) (owner: 10Albertoleoncio)
[21:13:54] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1091810|[ptwiki] Enable the CampaignEvents extension (T380090)]]
[21:13:58] <stashbot>	 T380090: Enable CampaignEvents Extension on ptwiki - https://phabricator.wikimedia.org/T380090
[21:14:13] <cjming>	 albertoleoncio: on test servers if you'd like to verify
[21:14:18] <cjming>	 lmk if/when to sync
[21:14:29] <albertoleoncio>	 Let me check
[21:14:44] <cjming>	 oh whoops - sorry - just a sec
[21:14:57] <bwang>	 hello! sorry im late , but im ready to deploy
[21:15:27] <cjming>	 albertoleoncio: hold on a sec - should be ready soon
[21:15:45] <albertoleoncio>	 ok
[21:15:53] * cjming waves to bwang
[21:16:39] <cjming>	 bwang: no worries - i'm running thru the queue in order
[21:19:20] <albertoleoncio>	 cjming: Working now!
[21:19:30] <cjming>	 really?
[21:19:37] <albertoleoncio>	 Yep
[21:19:54] <logmsgbot>	 !log cjming@deploy2002 cjming, albertoleoncio: Backport for [[gerrit:1091810|[ptwiki] Enable the CampaignEvents extension (T380090)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:19:55] <cjming>	 cool - so i will sync!
[21:19:58] <stashbot>	 T380090: Enable CampaignEvents Extension on ptwiki - https://phabricator.wikimedia.org/T380090
[21:20:13] <logmsgbot>	 !log cjming@deploy2002 cjming, albertoleoncio: Continuing with sync
[21:20:14] <albertoleoncio>	 On k8s-mwdebug, I mean =D
[21:23:44] <cjming>	 Nemoralis: regarding your dependent patch - i'm not sure either - hopefully someone here can confirm
[21:28:58] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1091810|[ptwiki] Enable the CampaignEvents extension (T380090)]] (duration: 15m 04s)
[21:29:02] <stashbot>	 T380090: Enable CampaignEvents Extension on ptwiki - https://phabricator.wikimedia.org/T380090
[21:29:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T379668#10342045 (10phaultfinder)
[21:30:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra)
[21:30:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) (owner: 10Arlolra)
[21:30:14] <Nemoralis>	 cjming: I mean, we can deploy it but only the message names (keys) will be visible until the messages are synchronized
[21:30:22] <Nemoralis>	 unless we backport messages too lol
[21:30:51] <cjming>	 Nemoralis: do you want to add the messages patch to the queue?
[21:31:06] <cjming>	 albertoleoncio: should be live!
[21:31:06] <Nemoralis>	 I can if it is possible
[21:31:30] <logmsgbot>	 !log jhathaway@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2005.codfw.wmnet with OS bullseye
[21:31:34] <cjming>	 np - i'm doing cscott's/arlo's backports now - we can do yours after
[21:31:40] <albertoleoncio>	 cjming: Its live already, since some minutes ago :-)
[21:31:41] <cscott>	 thanks!
[21:31:43] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342060 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye ex...
[21:32:12] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye
[21:32:13] <Nemoralis>	 cjming: I am adding then
[21:32:20] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye
[21:32:21] <cjming>	 sure - np
[21:33:39] <Nemoralis>	 done
[21:34:24] <cjming>	 cool
[21:38:00] <bwang>	 just lmk when its my turn :)
[21:38:17] <cjming>	 will do!
[21:40:28] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[21:41:09] <cjming>	 oof - cscott your patches are going to take a while to merge -- sorry everyone - i should have merged them at the top of the hour
[21:41:55] <cscott>	 no worries i'm here
[21:42:53] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2025 is OK: DISK OK - free space: /srv/cassandra/instance-data 13401 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[21:43:40] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage
[21:44:16] <Nemoralis1>	 is it because it is dependency update?
[21:46:12] <cjming>	 not sure what you're asking but the vendor/core backports merge estimates are 20+ minutes (shorter now)
[21:46:37] <Nemoralis>	 .
[21:47:04] <cjming>	 once they're thru tho, the rest of the config patches in the queue should be zippy
[21:47:30] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage
[21:50:38] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[21:52:49] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[21:55:37] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a7 [vendor] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093358 (https://phabricator.wikimedia.org/T373776) (owner: 10Arlolra)
[21:55:54] <arlolra>	 boom
[21:56:00] <cjming>	 lol
[21:56:50] <cjming>	 the other one still has a bit to go
[21:57:53] <cjming>	 fwiw i'm happy to go long and do the rest of the patches if no one needs the window after this
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241120T2200)
[22:00:53] <cjming>	 UTC late backport window is running a little over - is that ok?
[22:01:05] <Nemoralis>	 no problem for me
[22:01:05] <cscott>	 ok with me
[22:02:30] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.21.0-a7 [core] (wmf/1.44.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1093359 (https://phabricator.wikimedia.org/T380333) (owner: 10Arlolra)
[22:02:31] <cjming>	 Abstract Wikipedia team - lmk if you are waiting -- otherwise i'll continue onward
[22:02:38] <cjming>	 finally
[22:02:59] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:03:02] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093358|Bump wikimedia/parsoid to 0.21.0-a7 (T373776 T380333)]], [[gerrit:1093359|Bump wikimedia/parsoid to 0.21.0-a7 (T380333)]]
[22:03:08] <stashbot>	 T373776: Parsoid does not correctly render <noinclude> if used with templates - https://phabricator.wikimedia.org/T373776
[22:03:09] <stashbot>	 T380333: CTT midweek deploy - https://phabricator.wikimedia.org/T380333
[22:03:48] <Jdlrobson>	 cjming: the patch that bwang is deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1093408 is pretty important to go out today - dark mode for anonymous users is rendering the wrong colors in certain places which is important we fix. I was in meetings but can also take over for bwang if he needs to run
[22:04:34] <cjming>	 Jdlrobson: sounds good
[22:05:54] <Jdlrobson>	 thanks cjming  :)
[22:06:12] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:08:47] <logmsgbot>	 !log cjming@deploy2002 arlolra, cjming: Backport for [[gerrit:1093358|Bump wikimedia/parsoid to 0.21.0-a7 (T373776 T380333)]], [[gerrit:1093359|Bump wikimedia/parsoid to 0.21.0-a7 (T380333)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:08:49] <cjming>	 arlolra, cscott - on mwdebug - lmk when i can sync
[22:08:51] <stashbot>	 T373776: Parsoid does not correctly render <noinclude> if used with templates - https://phabricator.wikimedia.org/T373776
[22:08:52] <stashbot>	 T380333: CTT midweek deploy - https://phabricator.wikimedia.org/T380333
[22:09:01] <arlolra>	 ok, thanks, testing
[22:09:34] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhathaway@cumin2002"
[22:11:20] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhathaway@cumin2002"
[22:11:21] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2005.codfw.wmnet with OS bullseye
[22:11:28] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye co...
[22:11:55] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8702 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[22:12:16] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2005.codfw.wmnet with OS bullseye
[22:12:24] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342174 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye
[22:13:02] <arlolra>	 cjming: ok, lgtm
[22:13:07] <cjming>	 oh good
[22:13:11] <logmsgbot>	 !log cjming@deploy2002 arlolra, cjming: Continuing with sync
[22:14:00] <wikibugs>	 (03PS3) 10NMW03: Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317)
[22:16:15] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:16:24] <bwang>	 im still here as well to help test
[22:17:22] <cjming>	 great - i'm just going to keep plowing thru the queue - should be relatively quick
[22:17:29] <anzx>	 cscott: could you review https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Scribunto/+/1093422?forceReload=true should this be removed after namespace is live on wiki https://github.com/wikimedia/operations-mediawiki-config/blob/ebe7b5ea3a09cd6e334dda4128df5c7e9f45e2b3/wmf-config/core-Namespaces.php#L2906
[22:18:45] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:18:49] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release blunderbuss/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=blunderbuss - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:20:14] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093358|Bump wikimedia/parsoid to 0.21.0-a7 (T373776 T380333)]], [[gerrit:1093359|Bump wikimedia/parsoid to 0.21.0-a7 (T380333)]] (duration: 17m 11s)
[22:20:19] <stashbot>	 T373776: Parsoid does not correctly render <noinclude> if used with templates - https://phabricator.wikimedia.org/T373776
[22:20:19] <stashbot>	 T380333: CTT midweek deploy - https://phabricator.wikimedia.org/T380333
[22:20:55] <icinga-wm>	 PROBLEM - Cassandra instance data free space on restbase2025 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 8715 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[22:21:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03)
[22:21:47] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:22:13] <arlolra>	 thanks cjming
[22:22:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add contact form for U4C [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03)
[22:22:36] <cjming>	 arlolra: yw! glad it worked out
[22:22:51] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1091868|Add contact form for U4C (T379317)]]
[22:22:51] <Nemoralis>	 cjming: don't forget the strings lol
[22:23:01] <cscott>	 anzx: yes, i think the mediawiki-config clause can be deleted once scribunto is defining the namespace itself.
[22:23:03] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:23:05] <stashbot>	 T379317: Contact form requested - U4C - https://phabricator.wikimedia.org/T379317
[22:23:26] <cjming>	 Nemoralis: i think your other patch just needs a merge -- and then if you want to backport, you need to set those patches up
[22:23:49] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release blunderbuss/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=blunderbuss - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:24:16] <anzx>	 cscott: thanks 
[22:24:53] <logmsgbot>	 !log jhathaway@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage
[22:25:08] <Nemoralis>	 cjming: really? I don't think so. messages are usually updated during the mediawiki train. I don't think it will work with just merging 
[22:25:09] <cjming>	 Nemoralis: i'm going to go ahead and sync your config patch and move on -- lmk if you need backports for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1091869
[22:25:42] <cjming>	 if you do ^^, please create those patches and i can do those when they're ready
[22:25:42] <Nemoralis>	 I think I do, otherwise contact form will display message keys instead of its actual content 
[22:25:58] <Nemoralis>	 what patches?
[22:26:06] <Reedy>	 That config patch shouldn't have been merged without the strings being merged...
[22:26:22] <Reedy>	 as it's completely useless standalone
[22:26:38] <Nemoralis>	 yes, that's what I was asking at first
[22:27:03] <cjming>	 oh whoops - can we just merge https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1091869?  and do backports for 1.44.0-wmf.4 and 3?
[22:27:36] <Reedy>	 Well, no one has even reviewed the master patch yet
[22:27:37] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2005.codfw.wmnet with reason: host reimage
[22:28:08] <logmsgbot>	 !log cjming@deploy2002 nmw03, cjming: Backport for [[gerrit:1091868|Add contact form for U4C (T379317)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:28:13] <stashbot>	 T379317: Contact form requested - U4C - https://phabricator.wikimedia.org/T379317
[22:28:51] <cjming>	 ok - then i'm going to not sync the config patch
[22:29:28] <cjming>	 Nemoralis: if you can get your master patch merged and backports set up, i'm happy to do those after i finish up the rest of the queue
[22:29:38] <Nemoralis>	 contact form works fine btw, we just need strings https://i.imgur.com/BrmouqZ.png
[22:29:49] <Nemoralis>	 cjming: do you mean strings by master patch?
[22:31:23] <cjming>	 Nemoralis: master patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMessages/+/1091869 << this needs merging and then if you want 1.44.0-wmf.4 and 1.44.0-wmf.3 to have the changes, we need those patches too
[22:31:35] <Nemoralis>	 alright
[22:31:37] <logmsgbot>	 !log cjming@deploy2002 Sync cancelled.
[22:32:45] <wikibugs>	 (03PS1) 10TrainBranchBot: Revert "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093446
[22:32:46] <wikibugs>	 (03CR) 10TrainBranchBot: "cjming@deploy2002 created a revert of this change as I91382012f2d2d4cc23f4c8f6699d7bffc0be2462" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1091868 (https://phabricator.wikimedia.org/T379317) (owner: 10NMW03)
[22:33:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093446 (owner: 10TrainBranchBot)
[22:33:26] <cjming>	 anzx: are you still around? 
[22:33:39] <anzx>	 cjming: yes
[22:34:23] <cjming>	 cool - will do yours now, then bwang's, then Nemoralis' if they're ready
[22:34:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add contact form for U4C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093446 (owner: 10TrainBranchBot)
[22:34:44] <wikibugs>	 (03PS3) 10Anzx: knwiki: update portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366)
[22:34:57] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093446|Revert "Add contact form for U4C"]]
[22:35:53] <Nemoralis>	 I am not sure yet, let me check if I can find someone from the LPL
[22:37:06] <cjming>	 Nemoralis: apologies if i led you astray
[22:37:16] <Nemoralis>	 no worries :)
[22:37:55] <icinga-wm>	 RECOVERY - Cassandra instance data free space on restbase2025 is OK: DISK OK - free space: /srv/cassandra/instance-data 14918 MB (35% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data
[22:39:12] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:39:20] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:40:37] <logmsgbot>	 !log cjming@deploy2002 trainbranchbot, cjming: Backport for [[gerrit:1093446|Revert "Add contact form for U4C"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:40:42] <logmsgbot>	 !log cjming@deploy2002 trainbranchbot, cjming: Continuing with sync
[22:41:19] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:41:55] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:49:15] <logmsgbot>	 !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2005.codfw.wmnet with OS bullseye
[22:49:20] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093446|Revert "Add contact form for U4C"]] (duration: 14m 22s)
[22:49:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) (owner: 10Anzx)
[22:50:21] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342292 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin2002 for host thanos-be2005.codfw.wmnet with OS bullseye co...
[22:50:35] <wikibugs>	 (03Merged) 10jenkins-bot: knwiki: update portal namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093328 (https://phabricator.wikimedia.org/T380366) (owner: 10Anzx)
[22:51:00] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093328|knwiki: update portal namespace (T380366)]]
[22:51:04] <stashbot>	 T380366: knwiki: update portal namespace - https://phabricator.wikimedia.org/T380366
[22:52:29] <wikibugs>	 (03PS6) 10Bking: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994)
[22:52:58] <brett>	 !log Import libvmod-querysort 0.4-3 into varnish-staging apt component
[22:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:55:09] <wikibugs>	 (03PS2) 10Jdlrobson: Temporarily disable dark mode for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765)
[22:55:25] <logmsgbot>	 !log cjming@deploy2002 cjming, anzx: Backport for [[gerrit:1093328|knwiki: update portal namespace (T380366)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:55:28] <anzx>	 cjming: checking
[22:55:36] <cjming>	 anzx: ty
[22:55:41] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install thanos-be2005 - https://phabricator.wikimedia.org/T370452#10342303 (10jhathaway) @elukey thanos-be2005 is now re-imaging without any user intervention. It wasn't quite as easy as just running the re-image script...
[22:55:55] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 100% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[22:56:22] <anzx>	 cjming: looks good 
[22:56:25] <logmsgbot>	 !log cjming@deploy2002 cjming, anzx: Continuing with sync
[22:56:41] <icinga-wm>	 PROBLEM - Check size of conntrack table on krb1001 is CRITICAL: CRITICAL: nf_conntrack is 90 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[22:57:41] <icinga-wm>	 RECOVERY - Check size of conntrack table on krb1001 is OK: OK: nf_conntrack is 74 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[23:00:56] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 91.24% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[23:03:18] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093328|knwiki: update portal namespace (T380366)]] (duration: 12m 17s)
[23:03:22] <stashbot>	 T380366: knwiki: update portal namespace - https://phabricator.wikimedia.org/T380366
[23:03:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson)
[23:03:41] <anzx>	 cjming: thank you for deploy
[23:03:50] <bwang>	 @jdlrobson, can you take over the deploy validation?
[23:03:55] <cjming>	 anzx: yw!
[23:04:06] <wikibugs>	 (03Merged) 10jenkins-bot: Temporarily disable dark mode for anonymous users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1093408 (https://phabricator.wikimedia.org/T379765) (owner: 10Jdlrobson)
[23:04:36] <logmsgbot>	 !log cjming@deploy2002 Started scap sync-world: Backport for [[gerrit:1093408|Temporarily disable dark mode for anonymous users (T379765)]]
[23:04:40] <stashbot>	 T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765
[23:04:49] <cjming>	 bwang: sorry it's taken so long - should be verifiable soon on test servers - probably in a minute or 2 cc Jdlrobson
[23:04:58] <bwang>	 sounds good i can stay on then
[23:05:31] <Jdlrobson>	 👍
[23:08:32] <logmsgbot>	 !log cjming@deploy2002 jdlrobson, cjming: Backport for [[gerrit:1093408|Temporarily disable dark mode for anonymous users (T379765)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:08:35] <cjming>	 bwang: please check on test servers and lmk when to sync
[23:09:58] <bwang>	 which server?
[23:10:14] <cjming>	 mwdebug
[23:10:30] <cjming>	 do you have the browser extension?
[23:10:41] <bwang>	 ok i got it! thank you 
[23:10:43] <bwang>	 it looks good
[23:10:53] <cjming>	 awesome
[23:10:57] <cjming>	 syncing
[23:10:59] <logmsgbot>	 !log cjming@deploy2002 jdlrobson, cjming: Continuing with sync
[23:11:01] <Jdlrobson>	 cjming: yep lgtm too
[23:11:15] <cjming>	 should be live shortly
[23:14:09] <cjming>	 Nemoralis: looks like the master patch is still being reviewed - is it safe to say it's not going to be ready?
[23:16:19] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:17:29] <cjming>	 i can stay on for a bit longer if you think it (and backports) will be ready -- otherwise i might suggest getting the backports set up for the next available window...  i am also unclear if there needs to be time for the strings to propagate before the config is merged
[23:17:42] <logmsgbot>	 !log cjming@deploy2002 Finished scap sync-world: Backport for [[gerrit:1093408|Temporarily disable dark mode for anonymous users (T379765)]] (duration: 13m 06s)
[23:17:47] <stashbot>	 T379765: Nov 19: Vector 2022 Deployments - https://phabricator.wikimedia.org/T379765
[23:19:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:22:02] <cjming>	 !log end of UTC late backport window
[23:22:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:25:55] <Jdlrobson>	 thanks cjming 
[23:49:47] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1143 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration