[00:04:25] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[00:06:10] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[00:29:56] <Tamzin>	 hello friends! getting a lot of replag on enwiki right now. since wikimediastatus.net isn't showing any issues, just checking in to see if that's something y'all're aware of :) 
[00:38:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105092
[00:38:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105092 (owner: 10TrainBranchBot)
[00:39:55] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:49:02] <wikibugs>	 (03PS1) 10Tim Starling: Revert "Use PHP type declarations" [extensions/TimedMediaHandler] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105097 (https://phabricator.wikimedia.org/T382385)
[00:55:43] <wikibugs>	 (03CR) 10Tim Starling: [C:03+2] Revert "Use PHP type declarations" [extensions/TimedMediaHandler] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105097 (https://phabricator.wikimedia.org/T382385) (owner: 10Tim Starling)
[00:58:17] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105092 (owner: 10TrainBranchBot)
[01:09:07] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105099
[01:09:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105099 (owner: 10TrainBranchBot)
[01:14:36] <AntiComposite>	 created https://phabricator.wikimedia.org/T382388 for the problem Tamzin mentioned since it's user-visible
[01:15:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Use PHP type declarations" [extensions/TimedMediaHandler] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105097 (https://phabricator.wikimedia.org/T382385) (owner: 10Tim Starling)
[01:17:40] <wikibugs>	 (03Abandoned) 10Cwhite: dashboards: sudo set noninteractive flag [puppet] - 10https://gerrit.wikimedia.org/r/888740 (https://phabricator.wikimedia.org/T329688) (owner: 10Cwhite)
[01:18:07] <logmsgbot>	 !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1105097|Revert "Use PHP type declarations" (T382385)]]
[01:18:11] <stashbot>	 T382385: Typed property MediaWiki\\TimedMediaHandler\\WebVideoTranscode\\WebVideoTranscodeJob::$targetEncodeFile must not be accessed before initialization - https://phabricator.wikimedia.org/T382385
[01:25:08] <wikibugs>	 (03CR) 10Cwhite: "These are CI test fixtures.  The host number does not need to be a real host, but should be close enough to adequately exercise the logsta" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans)
[01:26:32] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1105097|Revert "Use PHP type declarations" (T382385)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[01:26:36] <stashbot>	 T382385: Typed property MediaWiki\\TimedMediaHandler\\WebVideoTranscode\\WebVideoTranscodeJob::$targetEncodeFile must not be accessed before initialization - https://phabricator.wikimedia.org/T382385
[01:27:06] <logmsgbot>	 !log tstarling@deploy2002 tstarling: Continuing with sync
[01:27:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10411469 (10cmooney)
[01:30:25] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105099 (owner: 10TrainBranchBot)
[01:32:42] <logmsgbot>	 !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105097|Revert "Use PHP type declarations" (T382385)]] (duration: 14m 35s)
[01:32:47] <stashbot>	 T382385: Typed property MediaWiki\\TimedMediaHandler\\WebVideoTranscode\\WebVideoTranscodeJob::$targetEncodeFile must not be accessed before initialization - https://phabricator.wikimedia.org/T382385
[01:55:12] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[02:17:38] <wikibugs>	 (03PS10) 10Aleksandar Mastilovic: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking)
[02:18:24] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "Thanks for the comments brouberol! I've updated the merge request." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking)
[02:19:43] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[02:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[02:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:19] <icinga-wm>	 PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[03:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:10:19] <icinga-wm>	 RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process
[03:24:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:12:33] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[04:13:54] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[04:59:13] <logmsgbot>	 !log tstarling@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[05:00:43] <logmsgbot>	 !log tstarling@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[05:00:55] <logmsgbot>	 !log tstarling@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[05:01:29] <logmsgbot>	 !log tstarling@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[05:13:47] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release mw-videoscaler/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-videoscaler - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[05:30:27] <wikibugs>	 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T382392 (10phaultfinder) 03NEW
[05:40:01] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:42:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "While the patch might be correct, I should not that we need the same patch in the mediawiki helm chart for it to have any effect in produc" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński)
[05:55:12] <jinxer-wm>	 FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[06:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[06:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[06:21:23] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T0700)
[07:24:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:34:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1104945 (owner: 10Muehlenhoff)
[07:34:31] <wikibugs>	 (03PS1) 10Slyngshede: Release v0.1.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269
[07:39:12] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[07:39:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 (owner: 10Slyngshede)
[07:42:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Ceph roles [puppet] - 10https://gerrit.wikimedia.org/r/1105270
[07:42:09] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Release v0.1.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 (owner: 10Slyngshede)
[07:44:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1105271
[07:46:33] <wikibugs>	 (03Merged) 10jenkins-bot: Release v0.1.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 (owner: 10Slyngshede)
[07:55:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105271 (owner: 10Muehlenhoff)
[07:57:46] <wikibugs>	 (03PS1) 10Slyngshede: IDM - 0.1.6 update [dns] - 10https://gerrit.wikimedia.org/r/1105273
[07:59:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1105273 (owner: 10Slyngshede)
[08:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T0800). Please do the needful.
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:00:41] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDM - 0.1.6 update [dns] - 10https://gerrit.wikimedia.org/r/1105273 (owner: 10Slyngshede)
[08:03:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2070-2071].codfw.wmnet
[08:04:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2070-2071].codfw.wmnet
[08:09:02] <wikibugs>	 (03PS3) 10Volans: ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258)
[08:09:16] <wikibugs>	 (03CR) 10Volans: ownership: Traffic cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:10:00] <wikibugs>	 (03CR) 10Volans: [C:03+2] ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:10:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2071.codfw.wmnet with OS bookworm
[08:10:43] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2070.codfw.wmnet with OS bookworm
[08:11:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2070
[08:11:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2070
[08:11:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2071
[08:11:03] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2071
[08:11:27] <jinxer-wm>	 FIRING: [5x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:14:27] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:15:54] <wikibugs>	 (03Merged) 10jenkins-bot: ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:16:27] <jinxer-wm>	 RESOLVED: [7x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:28:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2070.codfw.wmnet with reason: host reimage
[08:28:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2071.codfw.wmnet with reason: host reimage
[08:29:41] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[08:30:10] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[08:31:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2070.codfw.wmnet with reason: host reimage
[08:33:47] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release mw-videoscaler/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-videoscaler - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[08:35:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2071.codfw.wmnet with reason: host reimage
[08:48:48] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[08:51:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2070.codfw.wmnet with OS bookworm
[08:54:16] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:54:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2071.codfw.wmnet with OS bookworm
[08:55:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2070.codfw.wmnet
[08:55:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2070.codfw.wmnet
[08:56:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] thanos-store: enable caching bucket [puppet] - 10https://gerrit.wikimedia.org/r/1105037 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron)
[08:57:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2068-2069].codfw.wmnet
[08:58:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2068-2069].codfw.wmnet
[08:59:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Enable management of cn=wmf for production IDMs [puppet] - 10https://gerrit.wikimedia.org/r/1104970 (owner: 10Muehlenhoff)
[08:59:37] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2069.codfw.wmnet with OS bookworm
[08:59:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2068.codfw.wmnet with OS bookworm
[08:59:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2069
[08:59:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2069
[08:59:57] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2068
[08:59:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2068
[09:03:16] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:05:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "Untested but LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[09:10:10] <icinga-wm>	 RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops
[09:12:49] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396 (10fgiunchedi) 03NEW
[09:13:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10411780 (10fgiunchedi) No worries at all @cmooney, I've opened {T382396} to investigate/followup on the two issues you mentioned
[09:15:17] <wikibugs>	 (03CR) 10Elukey: [C:03+2] charts: improve Kartotherian metrics and monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey)
[09:15:54] <Emperor>	 !log restart wedged swift stats jobs on ms-fe2009
[09:15:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:34] <wikibugs>	 (03CR) 10Volans: [C:03+2] ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[09:17:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2069.codfw.wmnet with reason: host reimage
[09:17:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2068.codfw.wmnet with reason: host reimage
[09:18:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync
[09:20:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2069.codfw.wmnet with reason: host reimage
[09:20:12] <jinxer-wm>	 RESOLVED: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[09:23:16] <wikibugs>	 (03Merged) 10jenkins-bot: ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[09:23:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2068.codfw.wmnet with reason: host reimage
[09:28:27] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync
[09:29:41] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10411821 (10phaultfinder)
[09:32:57] <wikibugs>	 (03CR) 10Hashar: [V:03+1] "I have cherry picked the change on `puppetmaster-1003.devtools.eqiad1.wikimedia.cloud` and Puppet is passing now." [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar)
[09:33:05] <wikibugs>	 (03CR) 10Hashar: [C:03+1] devtools: fix hiera after host renaming [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar)
[09:40:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2069.codfw.wmnet with OS bookworm
[09:41:24] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:43:43] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2068.codfw.wmnet with OS bookworm
[09:44:05] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2068.codfw.wmnet
[09:44:08] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2068.codfw.wmnet
[09:54:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2069.codfw.wmnet
[09:54:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2069.codfw.wmnet
[09:54:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2071.codfw.wmnet
[09:54:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2071.codfw.wmnet
[09:55:55] <wikibugs>	 (03PS12) 10JMeybohm: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[09:57:01] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2036-2039].codfw.wmnet
[09:58:38] <wikibugs>	 (03PS2) 10Abijeet Patro: Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386)
[09:59:21] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2036-2039].codfw.wmnet
[09:59:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[10:02:24] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes20[36-39] to wikikube-worker20(47|66|85|86) [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto)
[10:04:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2036 to wikikube-worker2047
[10:04:49] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from kubernetes2036 to wikikube-worker2047
[10:05:46] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Idle - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%
[10:05:46] <icinga-wm>	 atus
[10:07:27] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin
[10:07:27] <icinga-wm>	 status
[10:07:59] <hnowlan>	 jouncebot: nowandnext
[10:07:59] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 52 minute(s)
[10:07:59] <jouncebot>	 In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100)
[10:08:30] <hnowlan>	 I'm going to to a sync-world to build new images and test mw-videoscaler rollout logic 
[10:08:48] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Revert^2 "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104985 (owner: 10Hnowlan)
[10:09:10] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1050.eqiad.wmnet
[10:09:11] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1050.eqiad.wmnet
[10:09:34] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1275.eqiad.wmnet
[10:09:35] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1275.eqiad.wmnet
[10:10:09] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1290.eqiad.wmnet
[10:10:10] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1290.eqiad.wmnet
[10:10:24] <wikibugs>	 (03CR) 10FNegri: "I'm reading the discussion for upstream BUG #18349 [0] and it looks like their workaround was to "increase work_mem to 16MB", which is a s" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott)
[10:11:02] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1050.eqiad.wmnet
[10:11:03] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1050.eqiad.wmnet
[10:11:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1275.eqiad.wmnet
[10:11:33] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1275.eqiad.wmnet
[10:12:42] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1290.eqiad.wmnet
[10:13:31] <logmsgbot>	 !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker1290.eqiad.wmnet
[10:15:34] <wikibugs>	 (03CR) 10FNegri: "(ignore my previous "Look" comment, I didn't mean to send it)" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott)
[10:15:40] <jinxer-wm>	 FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[10:19:26] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10411938 (10cmooney) > What is the dashboard and the underlying expression in the graph above?  That one came from here I think:  https://grafana.wikimedia.org/g...
[10:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[10:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[10:20:01] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10411941 (10JMeybohm)
[10:20:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10411942 (10JMeybohm)
[10:22:30] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1007,1021,1080,1287].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[10:22:44] <logmsgbot>	 !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to test mw-videoscaler integration
[10:24:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1007.eqiad.wmnet with OS bookworm
[10:28:00] <icinga-wm>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:28:37] <wikibugs>	 (03PS1) 10Wangombe: Event logging: update schemaId [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460)
[10:29:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[10:29:51] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker20[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788)
[10:41:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1007.eqiad.wmnet with reason: host reimage
[10:44:51] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1007.eqiad.wmnet with reason: host reimage
[10:47:44] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10412012 (10MoritzMuehlenhoff) Earlier today we merged a patch which enables the request of cn=wmf within Wikimedia IDM, so in the future for such requests we no longer need a Phabricator task, but...
[10:56:04] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404 (10Michael) 03NEW
[10:58:01] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412028 (10Michael) (This is technically not a "service-deployment-request" because the ser...
[10:58:10] <logmsgbot>	 !log hnowlan@deploy2002 Finished scap sync-world: Rebuild and deploy to test mw-videoscaler integration (duration: 36m 40s)
[10:58:40] <hnowlan>	 I'll be doing another sync-world
[10:58:45] <wikibugs>	 (03PS2) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788)
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100)
[11:02:55] <wikibugs>	 (03CR) 10Nikerabbit: [C:03+1] Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[11:03:02] <icinga-wm>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:03:20] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1007.eqiad.wmnet with OS bookworm
[11:04:22] <wikibugs>	 (03PS3) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788)
[11:04:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:10] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto)
[11:06:38] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto)
[11:07:44] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1021.eqiad.wmnet with OS bookworm
[11:08:12] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101898 (https://phabricator.wikimedia.org/T356939) (owner: 10Clare Ming)
[11:08:56] <wikibugs>	 (03Merged) 10jenkins-bot: Remove extraneous config for Metrics Platform instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101898 (https://phabricator.wikimedia.org/T356939) (owner: 10Clare Ming)
[11:09:31] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:10:40] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2036 to wikikube-worker2188
[11:10:50] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:11:02] <icinga-wm>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:14:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2036 to wikikube-worker2188 - jelto@cumin1002"
[11:15:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2036 to wikikube-worker2188 - jelto@cumin1002"
[11:15:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:15:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2188
[11:15:16] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2188
[11:15:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2036 to wikikube-worker2188
[11:19:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2037 to wikikube-worker2189
[11:20:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:23:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2037 to wikikube-worker2189 - jelto@cumin1002"
[11:24:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2037 to wikikube-worker2189 - jelto@cumin1002"
[11:24:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:24:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2189
[11:25:01] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1021.eqiad.wmnet with reason: host reimage
[11:25:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2189
[11:26:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2037 to wikikube-worker2189
[11:27:25] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2038 to wikikube-worker2190
[11:27:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:28:36] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1021.eqiad.wmnet with reason: host reimage
[11:31:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2038 to wikikube-worker2190 - jelto@cumin1002"
[11:32:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2038 to wikikube-worker2190 - jelto@cumin1002"
[11:32:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:32:42] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2190
[11:33:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2190
[11:34:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2038 to wikikube-worker2190
[11:35:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2039 to wikikube-worker2191
[11:35:24] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:39:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2039 to wikikube-worker2191 - jelto@cumin1002"
[11:39:52] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10412135 (10cmooney) @fgiunchedi yeah I'm pretty sure it's only gaps in the data we are seeing, for instance here:  https://grafana.wikimedia.org/goto/_GSV1TIHR...
[11:40:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2039 to wikikube-worker2191 - jelto@cumin1002"
[11:40:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:40:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2191
[11:40:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2191
[11:41:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2039 to wikikube-worker2191
[11:41:17] <hashar>	 jouncebot: now
[11:41:17] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100)
[11:41:20] <hashar>	 jouncebot: nextandnow
[11:41:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2188.codfw.wmnet wikikube-worker2189.codfw.wmnet wikikube-worker2190.codfw.wmnet wikikube-worker2191.codfw.wmnet on all recursors
[11:41:26] <hashar>	 jouncebot: nowandnext
[11:41:27] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2188.codfw.wmnet wikikube-worker2189.codfw.wmnet wikikube-worker2190.codfw.wmnet wikikube-worker2191.codfw.wmnet on all recursors
[11:41:27] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100)
[11:41:27] <jouncebot>	 In 0 hour(s) and 18 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1200)
[11:42:13] <hnowlan>	 hashar: if you're planning to run scap, for the next hour or so there will be timestamp diffs for mw-videoscaler that can be ignored. I'm working on a fix
[11:42:29] <hashar>	 I am going to restart Gerrit a couple times to shrink some H2 caches before the holidays ( T323754 )
[11:42:30] <stashbot>	 T323754: Investigate Gerrit h2 cache being way too large - https://phabricator.wikimedia.org/T323754
[11:42:40] <hashar>	 hnowlan: +1 :)
[11:45:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2189.codfw.wmnet with OS bookworm
[11:45:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2188.codfw.wmnet with OS bookworm
[11:46:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2188
[11:46:34] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:47:04] <icinga-wm>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:47:32] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1021.eqiad.wmnet with OS bookworm
[11:49:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1080.eqiad.wmnet with OS bookworm
[11:50:29] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2188 - jelto@cumin1002"
[11:50:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2188 - jelto@cumin1002"
[11:50:33] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:50:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2188.codfw.wmnet 169.32.192.10.in-addr.arpa 9.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:50:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2188.codfw.wmnet 169.32.192.10.in-addr.arpa 9.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:50:37] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2188
[11:52:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2188
[11:52:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2188
[11:52:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2189
[11:52:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[11:56:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2189 - jelto@cumin1002"
[11:56:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2189 - jelto@cumin1002"
[11:56:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:56:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2189.codfw.wmnet 170.32.192.10.in-addr.arpa 0.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:56:41] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2189.codfw.wmnet 170.32.192.10.in-addr.arpa 0.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[11:56:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2189
[11:57:05] <wikibugs>	 (03PS1) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408)
[11:57:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2189
[11:57:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2189
[11:58:19] <wikibugs>	 (03PS2) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408)
[11:58:53] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[12:00:05] <jouncebot>	 mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1200).
[12:01:13] <wikibugs>	 (03CR) 10Abijeet Patro: Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[12:02:14] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412181 (10akosiaris) Thanks for this writeup. Couple of comments below.  * If not already,...
[12:02:35] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[12:03:01] <wikibugs>	 (03PS1) 10Hnowlan: mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700)
[12:03:26] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412183 (10akosiaris) Moving to #serviceops-radar since there isn't something specific acti...
[12:04:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Blacklist squashfs [puppet] - 10https://gerrit.wikimedia.org/r/1104968 (owner: 10Muehlenhoff)
[12:09:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:09:50] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1080.eqiad.wmnet with reason: host reimage
[12:10:01] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:10:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:10:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2188.codfw.wmnet with reason: host reimage
[12:11:01] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:12:46] <wikibugs>	 (03CR) 10Btullis: [C:03+1] ownership: Data Platform cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[12:12:56] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104975 (owner: 10PipelineBot)
[12:13:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[12:13:09] <wikibugs>	 (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101494 (owner: 10PipelineBot)
[12:14:14] <wikibugs>	 (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104975 (owner: 10PipelineBot)
[12:14:51] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1080.eqiad.wmnet with reason: host reimage
[12:15:47] <wikibugs>	 (03CR) 10Volans: "Thanks, reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[12:16:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2189.codfw.wmnet with reason: host reimage
[12:17:34] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2188.codfw.wmnet with reason: host reimage
[12:17:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[12:18:15] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply
[12:20:08] <logmsgbot>	 !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply
[12:20:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2189.codfw.wmnet with reason: host reimage
[12:21:04] <wikibugs>	 (03PS1) 10Urbanecm: [Growth] Disable Surfacing Add Link tasks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037)
[12:21:52] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[12:22:26] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[12:22:28] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply
[12:22:30] <logmsgbot>	 !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply
[12:23:37] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply
[12:25:12] <logmsgbot>	 !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply
[12:27:47] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[12:28:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Cloud VPS-specific Puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/1105304
[12:28:48] <wikibugs>	 (03Merged) 10jenkins-bot: mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[12:31:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for phab/mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1105305
[12:31:31] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[12:31:45] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[12:35:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for phab/mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1105305 (owner: 10Muehlenhoff)
[12:36:04] <wikibugs>	 (03PS1) 10Hnowlan: mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700)
[12:36:18] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1080.eqiad.wmnet with OS bookworm
[12:36:36] <hashar>	 I am restarting Gerrit now
[12:36:46] <hashar>	 it is quite fast to come back
[12:37:21] <hashar>	 Dec 18 12:37:08 gerrit1003 systemd[1]: gerrit.service: Consumed 4month 1d 18h 44min 12.301s CPU time.
[12:37:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2188.codfw.wmnet with OS bookworm
[12:38:05] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1287.eqiad.wmnet with OS bookworm
[12:38:55] <hashar>	 that was since October 22
[12:41:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Druid roles [puppet] - 10https://gerrit.wikimedia.org/r/1105326
[12:41:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2189.codfw.wmnet with OS bookworm
[12:41:45] <hashar>	 !log Restarted Gerrit at 12:37:08 UTC
[12:41:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:56] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:42:43] <wikibugs>	 (03PS3) 10Abijeet Patro: Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386)
[12:43:55] <wikibugs>	 (03CR) 10Abijeet Patro: Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[12:44:24] <wikibugs>	 (03PS2) 10Hnowlan: mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700)
[12:45:03] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2190.codfw.wmnet with OS bookworm
[12:45:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2191.codfw.wmnet with OS bookworm
[12:45:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2190
[12:45:19] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:48:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2190 - jelto@cumin1002"
[12:48:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2190 - jelto@cumin1002"
[12:48:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:48:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2190.codfw.wmnet 171.32.192.10.in-addr.arpa 1.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:48:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2190.codfw.wmnet 171.32.192.10.in-addr.arpa 1.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:49:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2190
[12:49:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2190
[12:49:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2190
[12:49:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2191
[12:51:09] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412293 (10Michael) Thank you for the very quick response!    >>! In T382404#10412181, @ako...
[12:53:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:54:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329
[12:54:26] <wikibugs>	 (03PS1) 10Btullis: Add an-worker106[5-9] to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1105330 (https://phabricator.wikimedia.org/T382410)
[12:54:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff)
[12:54:35] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[12:55:57] <wikibugs>	 (03Merged) 10jenkins-bot: mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[12:57:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[12:57:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[12:57:23] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2191 - jelto@cumin1002"
[12:57:27] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2191 - jelto@cumin1002"
[12:57:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:57:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2191.codfw.wmnet 172.32.192.10.in-addr.arpa 2.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:57:31] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2191.codfw.wmnet 172.32.192.10.in-addr.arpa 2.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[12:57:31] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2191
[12:57:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2191
[12:57:46] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2191
[12:58:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage
[13:00:53] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add an-worker106[5-9] to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1105330 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis)
[13:00:55] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[13:01:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[13:01:05] <wikibugs>	 (03PS2) 10Muehlenhoff: Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329
[13:01:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply
[13:01:24] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply
[13:01:44] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: sync
[13:01:54] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: sync
[13:02:00] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage
[13:04:58] <wikibugs>	 10ops-eqiad, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks - https://phabricator.wikimedia.org/T382412 (10Andrew) 03NEW
[13:07:23] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply
[13:07:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply
[13:12:49] <moritzm>	 !log installing curl security updates
[13:12:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:04] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:14:24] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps dns recursors: increase # of threads x 3 [puppet] - 10https://gerrit.wikimedia.org/r/1105332 (https://phabricator.wikimedia.org/T374830)
[13:16:22] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2191.codfw.wmnet with reason: host reimage
[13:17:01] <hnowlan>	 jouncebot: nowandnext
[13:17:01] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 42 minute(s)
[13:17:01] <jouncebot>	 In 0 hour(s) and 42 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1400)
[13:17:06] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:17:10] <hnowlan>	 I'll be doing just one more scap
[13:17:51] <logmsgbot>	 !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to test mw-videoscaler integration one last time
[13:19:06] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:19:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2191.codfw.wmnet with reason: host reimage
[13:21:25] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1287.eqiad.wmnet with OS bookworm
[13:21:27] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1007,1021,1080,1287].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[13:21:49] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105332 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott)
[13:23:24] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] cloud-vps dns recursors: increase # of threads x 3 [puppet] - 10https://gerrit.wikimedia.org/r/1105332 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott)
[13:24:33] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1280-1284].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[13:24:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10412364 (10phaultfinder)
[13:26:12] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1280.eqiad.wmnet with OS bookworm
[13:26:41] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1291-1295].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[13:28:21] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1291.eqiad.wmnet with OS bookworm
[13:30:04] <icinga-wm>	 PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:32:04] <icinga-wm>	 PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:36:10] <wikibugs>	 (03PS1) 10KartikMistry: CX3 Build 0.2.0+20241218 [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702)
[13:36:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) (owner: 10KartikMistry)
[13:37:11] <jinxer-wm>	 FIRING: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:37:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10412405 (10fgiunchedi) Indeed the underlying data/samples are there as expected: I tested this theory by removing all functions and look at the raw data, which...
[13:37:46] <moritzm>	 !log installing waitress security updates
[13:37:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:04] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2191.codfw.wmnet with OS bookworm
[13:39:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] Deprecate system::role for Cloud VPS-specific Puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/1105304 (owner: 10Muehlenhoff)
[13:39:16] <jinxer-wm>	 RESOLVED: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:44:51] <wikibugs>	 (03PS1) 10Wangombe: Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460)
[13:44:55] <moritzm>	 !log installing jinja2 security updates
[13:44:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[13:45:39] <wikibugs>	 (03CR) 10Abijeet Patro: [C:03+1] Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[13:46:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage
[13:48:32] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage
[13:49:09] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage
[13:51:16] <James_F>	 hnowlan: Congratulations on the k8s videoscalers completion.
[13:52:36] <hnowlan>	 James_F: thanks! 
[13:52:45] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage
[13:53:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[13:55:38] <wikibugs>	 (03PS1) 10Cathal Mooney: Validators: Allow an interface to be called just "irb" on a device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1105346 (https://phabricator.wikimedia.org/T371088)
[13:57:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe)
[14:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1400).
[14:00:04] <jouncebot>	 abijeet and kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:24] <Lucas_WMDE>	 o/
[14:00:41] <kart_>	 here
[14:00:44] <abijeet>	 hi Lucas_WMDE 
[14:00:54] <Lucas_WMDE>	 I can deploy :)
[14:00:55] <kart_>	 I can deploy both changes..
[14:00:59] <Lucas_WMDE>	 or that ^^
[14:01:05] <abijeet>	 :-)
[14:01:05] <kart_>	 :)
[14:01:09] <icinga-wm>	 RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:01:12] <kart_>	 Let me start.. :)
[14:01:19] <Lucas_WMDE>	 sure!
[14:01:37] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:01:48] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(LGTM otherwise)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:01:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:02:20] <hnowlan>	 just a heads-up, I hit a timeout when syncing to the canaries earlier. Hopefully a transient thing but just wanted to warn
[14:02:39] <wikibugs>	 (03Merged) 10jenkins-bot: Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[14:03:52] <kart_>	 ah. undeployed change!
[14:03:57] <kart_>	 14:02:58 The following are unexpected commits pulled from origin for /srv/mediawiki-staging:
[14:03:58] <kart_>	 commit 9fdd7062ecc9fdd006d5a8291da3db623f1a219e
[14:03:58] <kart_>	 Author: Clare Ming <cming@wikimedia.org>
[14:03:58] <kart_>	 Date:   Tue Dec 10 08:52:12 2024 -0700
[14:03:58] <kart_>	 Remove extraneous config for Metrics Platform instruments
[14:03:58] <kart_>	 - AgentData properties are required by the client library
[14:03:59] <kart_>	 so they can be removed from producer config
[14:03:59] <kart_>	 Bug: T356939
[14:04:00] <stashbot>	 T356939: [Java] Make all AgentData properties required - https://phabricator.wikimedia.org/T356939
[14:04:00] <kart_>	 Change-Id: Ibf10b59135bc2f95ac55b5cb43cb5c3a79c6c910
[14:04:21] <Lucas_WMDE>	 hm
[14:04:29] <kart_>	 Anyone aware about this?
[14:04:29] <Lucas_WMDE>	 was +2ed normally, not via TrainBranchBot
[14:04:45] <Lucas_WMDE>	 pinging sfaci 
[14:05:11] <icinga-wm>	 PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:11] <icinga-wm>	 RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:44] <kart_>	 sfaci: OK to go with this?
[14:08:14] <wikibugs>	 (03PS3) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408)
[14:09:22] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1280.eqiad.wmnet with OS bookworm
[14:09:27] <kart_>	 ah. We need to decide quick. Lucas_WMDE what else we can do in such cases?
[14:09:43] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2190.codfw.wmnet with OS bookworm
[14:10:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2190.codfw.wmnet with OS bookworm
[14:10:09] <Lucas_WMDE>	 I’ve pinged them on slack, let’s see if that works better
[14:10:11] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2190
[14:10:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2190
[14:10:19] <Lucas_WMDE>	 though I’m not quite sure why we need to decide quickly, anything specific?
[14:11:06] <Lucas_WMDE>	 if we don’t hear back from them, I’d go for reverting the undeployed change
[14:11:09] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1281.eqiad.wmnet with OS bookworm
[14:11:11] <icinga-wm>	 RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:11:14] <Lucas_WMDE>	 seems safer than rolling it out when we don’t know how to test it
[14:11:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Cloud VPS-specific Puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/1105304 (owner: 10Muehlenhoff)
[14:11:25] <kart_>	 Lucas_WMDE: because we've one more backport patch ahead, which will take time as well ;)
[14:11:41] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1291.eqiad.wmnet with OS bookworm
[14:12:08] <kart_>	 Lucas_WMDE: window is of one hour, and there is a life after the deployment (ie dinner ;))
[14:12:59] <Lucas_WMDE>	 okay, but it’s not “production will explode” urgent, just wanted to check that ;)
[14:13:25] <Lucas_WMDE>	 what do you think about reverting vs. rolling out the undeployed change?
[14:13:28] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1292.eqiad.wmnet with OS bookworm
[14:14:27] <wikibugs>	 (03CR) 10Herron: [C:03+2] thanos-store: enable caching bucket [puppet] - 10https://gerrit.wikimedia.org/r/1105037 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron)
[14:15:11] <icinga-wm>	 PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:17:01] <Lucas_WMDE>	 kart_: it’s been 12 minutes since we pinged them on IRC, I’d say let’s go ahead with deploying
[14:17:08] <Lucas_WMDE>	 and, unless you disagree, let’s do that by reverting their change
[14:17:11] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:17:15] <Lucas_WMDE>	 and then they’ll have to deploy it correctly later
[14:17:34] <kart_>	 Let's go ahead. If something goes wrong we can revert.
[14:18:42] <Lucas_WMDE>	 so you’re saying don’t revert it?
[14:18:53] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]]
[14:18:57] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[14:19:14] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] CX3 Build 0.2.0+20241218 [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) (owner: 10KartikMistry)
[14:19:27] <kart_>	 I've +2 my patch ahead ^^
[14:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[14:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[14:19:51] <kart_>	 Lucas_WMDE: yes
[14:20:51] <Lucas_WMDE>	 ok
[14:21:15] <wikibugs>	 (03CR) 10Elukey: "Left some questions just to understand, if those are no concerns feel free to proceed :)" [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff)
[14:22:04] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412551 (10Urbanecm_WMF) > Since I am on the guesstimations part, same things for requests...
[14:23:13] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM! Optional: I am wondering if our future-selves will benefit of a one line explanation before the if, so that no git blame/etc.. is ne" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1105346 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney)
[14:23:58] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1256.eqiad.wmnet with OS bookworm
[14:24:22] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412553 (10Urbanecm_WMF) > Related to that, are there any very rough guesstimations about w...
[14:28:33] <wikibugs>	 (03PS1) 10Volans: api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454)
[14:30:56] <kart_>	 https://www.irccloud.com/pastebin/0JBvWT57/
[14:31:07] <kart_>	 hnowlan: ^^ is this known?
[14:31:23] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage
[14:31:46] <Lucas_WMDE>	 any more output above that?
[14:31:57] <Lucas_WMDE>	 “exit status 1” isn’t super helpful :/
[14:32:33] <hnowlan>	 kart_: no, and that shouldn't be related to my changes :/
[14:32:35] <hnowlan>	 I'll have a look
[14:32:45] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: configure job history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105354 (https://phabricator.wikimedia.org/T371700)
[14:33:07] <kart_>	 :/
[14:33:16] <kart_>	 looks like bad day for the deployments..
[14:33:29] <kart_>	 abijeet: sorry - we still need to wait more..
[14:33:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki: configure job history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105354 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[14:34:15] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage
[14:34:57] <hnowlan>	 "failed to sync configm
[14:35:07] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage
[14:35:11] <hnowlan>	 "failed to sync configmap cache: timed out waiting for the condition"
[14:36:17] <kart_>	 OK. It failed finally.
[14:36:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:37:51] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage
[14:39:26] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20241218 [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) (owner: 10KartikMistry)
[14:39:28] <hnowlan>	 retry for now? I am still looking into it
[14:39:50] * kamila_ looking too
[14:40:23] <abijeet>	 kart_, ok
[14:40:31] <kamila_>	 where did the etcd's all go?
[14:40:39] <kart_>	 hnowlan: OK. Let me retry.
[14:41:24] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10412580 (10BTullis) >>! In T379258#10409987, @Volans wrote: > In an early draft I had thought of adding working groups to the list of possible groups but talking wi...
[14:41:35] <kart_>	 ah. Sad. my other patch also got merged :/
[14:41:37] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]]
[14:41:41] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[14:41:42] <kamila_>	 I see only etcd-0 in `kubectl get cs` in codfw, is that expected?
[14:44:21] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage
[14:44:39] <jayme>	 kamila_: I did not know about get cs :o ...and given it says healthy I think we're good
[14:44:47] <kamila_>	 https://www.irccloud.com/pastebin/xMbmY9ja/
[14:45:28] <jayme>	 was the deployment failure in codfw only?
[14:45:59] <kart_>	 hnowlan: seems going fine with retry now..
[14:46:01] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "looks good, aside from the comments from @ltoscano@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff)
[14:46:08] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412589 (10Andrew)
[14:46:10] <kart_>	 abijeet: around, right?
[14:46:19] <hnowlan>	 jayme: I believe I had it in eqiad earlier but can't confirm
[14:46:26] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:46:28] <jayme>	 kamila_: ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list
[14:46:36] <kamila_>	 ah! sorry!
[14:46:42] <kart_>	 abijeet: can you test the patch?
[14:46:52] <elukey>	 I was about to say, nothing weird in https://grafana.wikimedia.org/d/Ku6V7QYGz/etcd3?orgId=1&var-site=codfw&var-cluster=kubernetes&var-instance_prefix=wikikube-ctrl
[14:46:59] <akosiaris>	 yeah that. Plus get componentstatuses is deprecated since 1.19+. Don't rely much on it
[14:47:08] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412591 (10Andrew)
[14:47:12] <wikibugs>	 (03CR) 10Muehlenhoff: Deprecate remaining uses of system::role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff)
[14:47:27] <kamila_>	 thanks jayme, I'm done panicking for now '^^
[14:47:42] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage
[14:47:42] <wikibugs>	 (03PS3) 10Muehlenhoff: Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329
[14:47:45] <jayme>	 akosiaris: funny... a thing I did now know of that has already been depricated :D
[14:47:50] <elukey>	 ignorant question - where do I find the mw-on-k8s deployment logs? 
[14:48:28] <elukey>	 the "failed to sync etc.." that Hugh mentioned earlier on
[14:48:29] <akosiaris>	 elukey: helm/k8s stuff easiest way is kube_env mw-web <site>; kubectl get events
[14:48:37] <akosiaris>	 or logstash for the kubernetes events alternatively
[14:48:49] <akosiaris>	 also easier if you want to drill down on various namespaces etc
[14:48:56] <elukey>	 akosiaris: ah ok via get events, there is nothing more specific that scap saves from calling helmfile etc.. 
[14:49:01] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412597 (10fnegri) a:05fnegri→03None
[14:49:02] <akosiaris>	 same data, different medium
[14:49:05] <abijeet>	 kart_, on it
[14:49:08] <elukey>	 okok thanks
[14:49:15] <akosiaris>	 elukey: scap does log to logstash too
[14:49:19] <akosiaris>	 let me find the dashboard
[14:49:39] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412609 (10fnegri)
[14:49:40] <dancy>	 https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 
[14:49:41] <hnowlan>	 elukey: that error is in fact from events :D 
[14:49:46] <hnowlan>	 but is a bit useless 
[14:49:49] <dancy>	 ^ scap logs
[14:50:06] <akosiaris>	 dancy: thanks! I was about to grumble about having to click on share etc
[14:50:15] <akosiaris>	 sigh kibana...
[14:50:20] <dancy>	 haha.. I feel ya
[14:50:30] <elukey>	 thanks!
[14:50:47] <dancy>	 Enjoy reading messages in reverse order.
[14:50:50] <akosiaris>	 elukey: the rest, which is arguably not deployment logs is still on mwlog hosts
[14:51:15] <moritzm>	 !log installing gstreamer1.0 security updates
[14:51:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:58] <abijeet>	 kart_, looks ok.
[14:52:02] <elukey>	 akosiaris: got it, at the end scap just calls helmfile that is usually not very telling, so it can't know much.. I'll remember mw-web etc.. to check when these things happens
[14:52:13] <kart_>	 abijeet: cool. Going ahead.
[14:52:17] <abijeet>	 kart_, thanks for getting the patch through
[14:52:17] <logmsgbot>	 !log kartik@deploy2002 kartik, abi: Continuing with sync
[14:52:49] <akosiaris>	 elukey: the list is at https://gerrit.wikimedia.org/g/operations/puppet/+/6e296f27e8f019645c06e5f47a693d1100adcb85/hieradata/common/profile/kubernetes/deployment_server.yaml#161
[14:53:06] <akosiaris>	 every mw-* thing is 1 namespace in wikikube
[14:53:20] <akosiaris>	 and mostly a MediaWiki deployment
[14:53:30] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs categories: ship lastUpdated metric [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper)
[14:53:33] <akosiaris>	 there are a couple of exceptions, e.g. mw-mcrouter (which is ... duh mcrouter)
[14:54:13] <icinga-wm>	 RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:54:14] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1281.eqiad.wmnet with OS bookworm
[14:54:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add gstreamer1.0 library hint [puppet] - 10https://gerrit.wikimedia.org/r/1105358
[14:55:27] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[14:55:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1282.eqiad.wmnet with OS bookworm
[14:56:15] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:56:39] <_joe_>	 elukey: if you want to see which thing scap deploys to, https://gerrit.wikimedia.org/g/operations/puppet/+/6e296f27e8f019645c06e5f47a693d1100adcb85/hieradata/role/common/deployment_server/kubernetes.yaml#267
[14:56:54] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1292.eqiad.wmnet with OS bookworm
[14:57:01] <_joe_>	 the value of the hiera label "profile::kubernetes::deployment_server::mediawiki::release::mw_releases"
[14:57:48] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:57:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add gstreamer1.0 library hint [puppet] - 10https://gerrit.wikimedia.org/r/1105358 (owner: 10Muehlenhoff)
[14:58:04] <wikibugs>	 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412648 (10aborrero)
[14:58:40] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1293.eqiad.wmnet with OS bookworm
[14:59:13] <icinga-wm>	 PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:00:04] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1500)
[15:00:19] <jayme>	 hnowlan: the 'failed to sync configmap cache' can happen from time to time and is transparent (kubelet giving up and retrying) ... but it can ofc delay deployments
[15:00:55] <jayme>	 but it happens quite regularly
[15:01:10] <_joe_>	 I can confirm
[15:01:13] <hnowlan>	 ah I was worried that would be the case
[15:01:19] <_joe_>	 and also confirm the first time I saw it I was super worried
[15:01:23] <kamila_>	 jayme: so that's not expected to break things and thus is not the problem we're looking for?
[15:01:27] <wikibugs>	 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412657 (10Ladsgroup) I think we need an overarching or at least some best practices on int...
[15:01:35] <hnowlan>	 so we have zero signal about what actually happened other than a timeout waiting for the condition message 
[15:01:48] <kamila_>	 that's annoying
[15:02:04] <jayme>	 kamila_: if it stays like that for 5min then it can ofc make the deployment fail as readiness is never reached
[15:02:15] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:02:19] <kamila_>	 right, thanks jayme 
[15:05:03] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]] (duration: 23m 26s)
[15:05:08] <stashbot>	 T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386
[15:05:12] <kart_>	 ah. Finally!
[15:05:45] <jayme>	 I also don't see anything surrounding that in kubelet logs
[15:06:13] <logmsgbot>	 !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105336|CX3 Build 0.2.0+20241218 (T380702)]]
[15:06:17] <stashbot>	 T380702: Consider length of Collection names on different views - https://phabricator.wikimedia.org/T380702
[15:06:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:06:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye
[15:06:57] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1256.eqiad.wmnet with OS bookworm
[15:07:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10412682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2004-dev.codfw.wmnet with OS bul...
[15:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:10:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm
[15:10:46] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10412689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm
[15:14:09] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-27-074306 to 2024-12-17-184905 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105362 (https://phabricator.wikimedia.org/T378785)
[15:14:25] <wikibugs>	 (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-11-26-193226 to 2024-12-16-202347 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105363 (https://phabricator.wikimedia.org/T377020)
[15:14:45] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:1105336|CX3 Build 0.2.0+20241218 (T380702)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[15:15:23] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-27-074306 to 2024-12-17-184905 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105362 (https://phabricator.wikimedia.org/T378785) (owner: 10Jforrester)
[15:16:20] <kamila_>	 WF deployers, we're still in the middle of MW deploy due to a problem earlier, can you please wait?
[15:16:20] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage
[15:16:28] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-27-074306 to 2024-12-17-184905 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105362 (https://phabricator.wikimedia.org/T378785) (owner: 10Jforrester)
[15:16:34] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[15:16:34] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[15:16:38] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[15:16:51] <wikibugs>	 (03Abandoned) 10Hnowlan: mediawiki: configure job history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105354 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan)
[15:16:59] <James_F>	 kamila_: Oh, sure, do you expect it to break services?
[15:17:14] <James_F>	 Normally they're unrelated.
[15:17:41] <kamila_>	 no, but we don't know what happened, so I don't want to get more confused :D
[15:18:02] <James_F>	 We don't even deploy with scap…
[15:18:38] <kamila_>	 James_F: yes, but the problem was in k8s
[15:18:46] <James_F>	 Fun.
[15:18:46] <kamila_>	 but if hnowlan or jayme think it's fine to do in parallel, feel free to say
[15:18:59] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage
[15:19:27] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage
[15:19:48] <logmsgbot>	 !log kartik@deploy2002 kartik: Continuing with sync
[15:19:54] <wikibugs>	 (03PS1) 10CDanis: chart-renderer: probe: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/1105366 (https://phabricator.wikimedia.org/T372081)
[15:20:08] <hnowlan>	 I don't think it should be an issue to go in parallel
[15:20:17] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Preparing an-presto1001 for renaming to an-worker1065 - btullis@cumin1002"
[15:20:22] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Preparing an-presto1001 for renaming to an-worker1065 - btullis@cumin1002"
[15:20:22] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:20:23] <James_F>	 Ack.
[15:20:25] <kamila_>	 ok, thanks hnowlan!
[15:20:30] <wikibugs>	 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420 (10Jelto) 03NEW
[15:20:31] <hnowlan>	 although take that with a grain of salt in that we can't find what caused it :P 
[15:20:31] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:21:23] <wikibugs>	 (03CR) 10CDanis: [C:03+2] chart-renderer: probe: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/1105366 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis)
[15:21:28] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet
[15:21:32] <kamila_>	 yeah, that's why I wasn't sure :D 
[15:21:36] <James_F>	 hnowlan: Aren't computers fantastic?
[15:21:45] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1065
[15:22:40] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage
[15:23:04] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1065
[15:23:26] <hnowlan>	 wouldn't trust them too much 
[15:23:37] <wikibugs>	 (03PS1) 10Krinkle: Enable $wgWMEStatsBeaconUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837)
[15:24:06] <logmsgbot>	 !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:24:11] <wikibugs>	 (03PS1) 10Bking: team-data-platform: remove misconfigured alert [alerts] - 10https://gerrit.wikimedia.org/r/1105368 (https://phabricator.wikimedia.org/T374916)
[15:24:48] <logmsgbot>	 !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:24:50] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1065.eqiad.wmnet with OS bullseye
[15:25:37] <wikibugs>	 (03CR) 10DCausse: [C:03+1] team-data-platform: remove misconfigured alert [alerts] - 10https://gerrit.wikimedia.org/r/1105368 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[15:27:12] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet
[15:27:12] <logmsgbot>	 !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105336|CX3 Build 0.2.0+20241218 (T380702)]] (duration: 20m 58s)
[15:27:30] <logmsgbot>	 !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:27:34] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:28:54] <wikibugs>	 (03CR) 10Bking: [C:03+2] team-data-platform: remove misconfigured alert [alerts] - 10https://gerrit.wikimedia.org/r/1105368 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[15:29:21] <logmsgbot>	 !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:29:29] <logmsgbot>	 !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:29:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:30:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2004-dev
[15:30:16] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2004-dev
[15:30:20] <logmsgbot>	 !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:30:29] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2190.codfw.wmnet with OS bookworm
[15:31:31] <jelto>	 !log homer 'lsw1-c1-codfw*' commit 'T377877'
[15:31:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:36] <stashbot>	 T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877
[15:32:37] <jelto>	 !log homer 'lsw1-c3-codfw*' commit 'T377877'
[15:32:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:35] <wikibugs>	 (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2024-11-26-193226 to 2024-12-16-202347 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105363 (https://phabricator.wikimedia.org/T377020) (owner: 10Jforrester)
[15:34:08] <jelto>	 !log homer 'cr*codfw*' commit 'T377877'
[15:34:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:34:52] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-11-26-193226 to 2024-12-16-202347 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105363 (https://phabricator.wikimedia.org/T377020) (owner: 10Jforrester)
[15:35:34] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 188, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:14] <icinga-wm>	 PROBLEM - BGP status on lsw1-c3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:36:21] <logmsgbot>	 !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[15:38:10] <logmsgbot>	 !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[15:38:22] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2188-2189,2191].codfw.wmnet
[15:38:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2188-2189,2191].codfw.wmnet
[15:38:26] <icinga-wm>	 RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:38:28] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1282.eqiad.wmnet with OS bookworm
[15:39:46] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T382422 (10Jelto) 03NEW
[15:40:14] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1283.eqiad.wmnet with OS bookworm
[15:40:52] <wikibugs>	 (03PS13) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[15:41:17] <logmsgbot>	 !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[15:41:26] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:41:30] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr3-ulsfo.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[15:41:47] <volans>	 !incidents
[15:41:47] <sirenbot>	 5546 (UNACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr3-ulsfo.wikimedia.org)
[15:41:50] <volans>	 !ack 5546
[15:41:51] <sirenbot>	 5546 (ACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr3-ulsfo.wikimedia.org)
[15:41:58] <swfrench-wmf>	 here as well o/
[15:41:59] <volans>	 topranks: you were saying? :D
[15:42:04] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1293.eqiad.wmnet with OS bookworm
[15:42:06] <volans>	 we will be alerted if it goes over
[15:42:17] <volans>	 elukey: is this you?
[15:42:59] <topranks>	 the transport from codfw is ok (and out to singapore) so not impacting that 
[15:43:03] <topranks>	 but yes massive surge 
[15:43:03] <topranks>	 https://grafana.wikimedia.org/goto/daFI6oIHg?orgId=1
[15:43:05] <logmsgbot>	 !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[15:43:23] <logmsgbot>	 !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[15:43:27] <topranks>	 we're maxing outbound at SF-MIX exchange 
[15:43:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1294.eqiad.wmnet with OS bookworm
[15:44:13] <logmsgbot>	 !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[15:44:26] <icinga-wm>	 PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:44:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2065,2067].codfw.wmnet
[15:45:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[15:45:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2065,2067].codfw.wmnet
[15:46:30] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr3-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[15:46:53] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2065.codfw.wmnet with OS bookworm
[15:46:54] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2067.codfw.wmnet with OS bookworm
[15:47:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2065
[15:47:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2065
[15:47:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2067
[15:47:13] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2067
[15:47:26] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:49:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:50:38] <icinga-wm>	 PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:50:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:51:02] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:51:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:51:30] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device cr3-ulsfo.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[15:51:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:51:53] <volans>	 !incidents
[15:51:53] <sirenbot>	 5547 (UNACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr3-ulsfo.wikimedia.org)
[15:51:54] <sirenbot>	 5546 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) global noc (cr3-ulsfo.wikimedia.org)
[15:52:04] <volans>	 !ack 5547
[15:52:04] <sirenbot>	 5547 (ACKED)  Primary outbound port utilisation over 80%  (paged) global noc (cr3-ulsfo.wikimedia.org)
[15:52:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[15:52:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[15:52:51] <wikibugs>	 (03PS1) 10Btullis: Configure the correct role for reimaging installing an-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1105371 (https://phabricator.wikimedia.org/T382410)
[15:52:52] <wikibugs>	 (03PS14) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[15:53:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[15:53:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[15:53:32] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[15:53:34] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Configure the correct role for reimaging installing an-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1105371 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis)
[15:53:54] <wikibugs>	 (03PS1) 10Krinkle: webperf: Enable --dogstatsd on statsv.py [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837)
[15:54:10] <wikibugs>	 (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[15:54:21] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[15:54:24] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm
[15:54:29] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1065.eqiad.wmnet with OS bullseye
[15:54:29] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10412921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm...
[15:54:31] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[15:55:17] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[15:55:18] <wikibugs>	 (03PS15) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[15:55:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:56:30] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device cr3-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[15:57:29] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:57:36] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:58:26] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:58:50] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T382392#10412927 (10Jhancock.wm) a:03Jhancock.wm
[15:59:19] <wikibugs>	 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T382422#10412942 (10Jhancock.wm) a:03Jhancock.wm
[15:59:37] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420#10412944 (10Jhancock.wm) a:03Jhancock.wm
[16:00:20] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1065.eqiad.wmnet with OS bullseye
[16:00:39] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage
[16:00:59] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10412950 (10Jhancock.wm) @Andrew what kind of partition should this server have? I keep getting an error in that part of the installer. my first thought was...
[16:04:11] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage
[16:04:13] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage
[16:06:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2065.codfw.wmnet with reason: host reimage
[16:06:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2067.codfw.wmnet with reason: host reimage
[16:07:24] <wikibugs>	 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425 (10RobH) 03NEW
[16:07:54] <wikibugs>	 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10412972 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers.  T...
[16:08:09] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage
[16:11:14] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2067.codfw.wmnet with reason: host reimage
[16:15:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2065.codfw.wmnet with reason: host reimage
[16:17:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "As discussed on IRC; let's merge this in the first week of January" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis)
[16:17:14] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.021e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[16:17:43] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[16:20:26] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:21:20] <wikibugs>	 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10412995 (10RobH)
[16:22:01] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-provisioning an-presto1002 and an-worker1066 - btullis@cumin1002"
[16:22:06] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-provisioning an-presto1002 and an-worker1066 - btullis@cumin1002"
[16:22:06] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:22:47] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1066
[16:23:13] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:23:26] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:23:26] <icinga-wm>	 RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:23:28] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1283.eqiad.wmnet with OS bookworm
[16:24:28] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:24:31] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1066
[16:25:05] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1065.eqiad.wmnet with reason: host reimage
[16:25:16] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1284.eqiad.wmnet with OS bookworm
[16:26:12] <icinga-wm>	 PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:27:12] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1294.eqiad.wmnet with OS bookworm
[16:27:14] <icinga-wm>	 RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:27:49] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1065.eqiad.wmnet with reason: host reimage
[16:28:57] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1295.eqiad.wmnet with OS bookworm
[16:29:39] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2067.codfw.wmnet with OS bookworm
[16:34:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2065.codfw.wmnet with OS bookworm
[16:37:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2065,2067].codfw.wmnet
[16:37:09] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2065,2067].codfw.wmnet
[16:37:46] <icinga-wm>	 RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:39:40] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2063-2064].codfw.wmnet
[16:40:52] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2063-2064].codfw.wmnet
[16:41:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2063.codfw.wmnet with OS bookworm
[16:41:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2064.codfw.wmnet with OS bookworm
[16:41:54] <icinga-wm>	 PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:42:07] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2064
[16:42:07] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2064
[16:42:44] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2063
[16:42:44] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2063
[16:42:55] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[16:45:46] <icinga-wm>	 PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:45:46] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage
[16:46:52] <icinga-wm>	 PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:48:03] <wikibugs>	 (03CR) 10Xcollazo: [C:03+1] dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena)
[16:48:35] <wikibugs>	 (03CR) 10MSantos: [C:03+1] "LGTM. I don't have a strong opinion about this and I will wait for Yiannis opinion." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey)
[16:49:08] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage
[16:49:20] <logmsgbot>	 !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage
[16:49:50] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#10413071 (10cmooney) 05Resolved→03Open >>! In T294845#8758882, @ayounsi wrote: > This is completed in drmrs, the same will be applied to the other sites when we bring L3...
[16:49:58] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[16:49:59] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1065.eqiad.wmnet with OS bullseye
[16:50:41] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1066.eqiad.wmnet with OS bullseye
[16:52:53] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage
[16:56:14] <icinga-wm>	 PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:57:14] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1065.eqiad.wmnet
[16:58:15] <wikibugs>	 (03PS1) 10Eevans: sessionstore: Upgrade Cassandra to v4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420)
[16:58:44] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans)
[16:58:58] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1065.eqiad.wmnet
[17:00:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2063.codfw.wmnet with reason: host reimage
[17:01:38] <wikibugs>	 (03PS2) 10Eevans: sessionstore: Upgrade Cassandra to v4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420)
[17:01:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2064.codfw.wmnet with reason: host reimage
[17:02:12] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1066.eqiad.wmnet with reason: host reimage
[17:02:44] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans)
[17:03:06] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2063.codfw.wmnet with reason: host reimage
[17:06:08] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[17:06:45] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1066.eqiad.wmnet with reason: host reimage
[17:07:44] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10413120 (10Papaul) @bking hello do you have any update on @Jhancock.wm above? Thank you
[17:07:59] <icinga-wm>	 RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:08:29] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1284.eqiad.wmnet with OS bookworm
[17:08:31] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1280-1284].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:09:48] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2064.codfw.wmnet with reason: host reimage
[17:10:04] <wikibugs>	 (03PS1) 10Herron: thanos-store: manage and increase chunk-pool-size setting [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953)
[17:10:45] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1003 as an-worker1067 - btullis@cumin1002"
[17:10:49] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1003 as an-worker1067 - btullis@cumin1002"
[17:10:49] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:10:58] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1067
[17:12:02] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1295.eqiad.wmnet with OS bookworm
[17:12:04] <logmsgbot>	 !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1291-1295].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad)
[17:12:17] <icinga-wm>	 RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:12:30] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1067
[17:17:43] <wikibugs>	 (03PS2) 10Herron: thanos-store: manage and increase chunk-pool-size setting [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953)
[17:19:17] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[17:19:45] <wikibugs>	 (03PS1) 10Herron: thanos-store: increase store cache size to 24GB [puppet] - 10https://gerrit.wikimedia.org/r/1105395 (https://phabricator.wikimedia.org/T368953)
[17:19:57] <wikibugs>	 (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4718/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron)
[17:21:19] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[17:22:38] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10413175 (10Andrew) The two small drives should be mirrored (raid 1) and used for the OS, the larger drives left unformatted for Ceph to manage.  I believe...
[17:24:01] <icinga-wm>	 RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:24:32] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2063.codfw.wmnet with OS bookworm
[17:24:53] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[17:24:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1066.eqiad.wmnet with OS bullseye
[17:25:33] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1004 as an-worker1068 - btullis@cumin1002"
[17:25:38] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1004 as an-worker1068 - btullis@cumin1002"
[17:25:38] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:27:57] <Emperor>	 !log depool, restart, repool ms-fe2009
[17:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:38] <wikibugs>	 (03CR) 10Eevans: [C:03+2] sessionstore: Upgrade Cassandra to v4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans)
[17:28:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2064.codfw.wmnet with OS bookworm
[17:28:53] <icinga-wm>	 RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:30:23] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1068
[17:31:42] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1068
[17:32:30] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1066.eqiad.wmnet
[17:32:32] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*.codfw.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[17:32:37] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[17:34:13] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1066.eqiad.wmnet
[17:39:43] <wikibugs>	 (03CR) 10Kamila Součková: "@jmeybohm@wikimedia.org Assuming I create tasks for (and start working on) the incomplete TODOs inline, is there anything blocking merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[17:41:52] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye
[17:44:29] <wikibugs>	 (03PS1) 10Btullis: Add dummy tokens for new temporary Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1105404 (https://phabricator.wikimedia.org/T382410)
[17:44:47] <wikibugs>	 (03PS2) 10Eevans: restbase: cleanup decommissioned hosts [puppet] - 10https://gerrit.wikimedia.org/r/1105015
[17:46:46] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1068.eqiad.wmnet with OS bullseye
[17:48:41] <wikibugs>	 (03CR) 10Eevans: "I changed these entries to corresponding values of the form restbase9xxx. This seems close to "real", while also guarding against future `" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans)
[17:49:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2063-2064].codfw.wmnet
[17:49:07] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2063-2064].codfw.wmnet
[17:50:21] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*.codfw.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[17:50:25] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[17:51:35] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.dns.netbox
[17:53:10] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy tokens for new temporary Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1105404 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis)
[17:53:45] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:55:19] <logmsgbot>	 !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1067.eqiad.wmnet with OS bullseye
[17:57:02] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye
[17:57:21] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1005 as an-worker1069 - btullis@cumin1002"
[17:57:26] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1005 as an-worker1069 - btullis@cumin1002"
[17:57:26] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:58:17] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1069
[17:58:35] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:58:54] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1068.eqiad.wmnet with reason: host reimage
[17:59:35] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1069
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1800)
[18:00:17] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[18:00:21] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[18:01:08] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1069.eqiad.wmnet with OS bullseye
[18:01:52] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1068.eqiad.wmnet with reason: host reimage
[18:04:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:05:47] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1067.eqiad.wmnet with OS bullseye
[18:06:30] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "LGTM, thanks, apologies for delays due to my pedantry!" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans)
[18:09:23] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye
[18:09:26] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:50] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:09:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:13:22] <wikibugs>	 (03PS16) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857)
[18:13:37] <wikibugs>	 (03CR) 10Kamila Součková: "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[18:13:43] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:13:48] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1069.eqiad.wmnet with OS bullseye
[18:14:12] <wikibugs>	 (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková)
[18:14:18] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:15:49] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1067.eqiad.wmnet with OS bullseye
[18:16:28] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye
[18:16:39] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[18:18:09] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002
[18:18:13] <stashbot>	 T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420
[18:18:35] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[18:18:36] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1068.eqiad.wmnet with OS bullseye
[18:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[18:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[18:20:29] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1069.eqiad.wmnet with OS bullseye
[18:21:56] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1068.eqiad.wmnet
[18:23:03] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10413490 (10cmooney) >>! In T382396#10412404, @fgiunchedi wrote: > Indeed the underlying data/samples are there as expected: I tested this theory by removing all...
[18:23:41] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1068.eqiad.wmnet
[18:25:00] <logmsgbot>	 !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1069.eqiad.wmnet with OS bullseye
[18:25:35] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1069.eqiad.wmnet with OS bullseye
[18:28:38] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10413543 (10CDanis) >>! In T382396#10413490, @cmooney wrote: > But we can deal with that if that is the cause.  The goal of the "irate" is that we want as much g...
[18:31:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T382392#10413558 (10Jhancock.wm) 05Open→03Resolved probably came loose yesterday while cleaning up the cable management in that rack. reseated. came up.
[18:34:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10413565 (10cmooney) >>! In T382396#10413543, @CDanis wrote: > It's fine to make the time window longer with `irate()` -- it will always pick the two most-recent...
[18:37:25] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1069.eqiad.wmnet with reason: host reimage
[18:40:36] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1069.eqiad.wmnet with reason: host reimage
[18:52:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:55:33] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[18:57:17] <wikibugs>	 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420#10413692 (10Jhancock.wm) 05Open→03Resolved reseated all cables connected to the backplane and the connection on the main board...
[19:00:05] <jouncebot>	 dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1900).
[19:07:46] <dancy>	 o/
[19:08:20] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105418 (https://phabricator.wikimedia.org/T375667)
[19:08:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105418 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot)
[19:09:04] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105418 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot)
[19:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:13:54] <wikibugs>	 (03PS1) 10Michael Große: Growth: Remove temporary config for clearing link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522)
[19:13:54] <wikibugs>	 (03CR) 10Michael Große: [C:04-1] "Id70d05b05ebd5d8a1650208b28b435da1f89d49e needs to be merged and in production and sure to not be reverted first before this change should" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große)
[19:25:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking)
[19:28:29] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.8  refs T375667
[19:28:34] <stashbot>	 T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667
[19:31:11] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[19:32:01] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Docker
[19:36:35] <brennen>	 dancy: looks pretty chill
[19:36:40] <dancy>	 Agreed
[19:36:54] <dancy>	 The best type of train vibe
[19:37:01] <brennen>	 ^
[19:38:22] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: cleanup decommissioned hosts [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans)
[19:43:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T382422#10413878 (10Jhancock.wm) 05Open→03Resolved
[19:48:59] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] Enable $wgWMEStatsBeaconUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[19:50:43] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle)
[19:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:24:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414003 (10phaultfinder)
[20:26:51] <ottomata>	 !log restarting eventgate-analytics-external to clear schema cache - T382113 |  https://phabricator.wikimedia.org/T382113#10414005 
[20:26:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:56] <stashbot>	 T382113: Invalid EventGate errors with content_translation_event 1.7.0 - https://phabricator.wikimedia.org/T382113
[20:27:04] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync
[20:27:10] <wikibugs>	 (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105431 (https://phabricator.wikimedia.org/T374957)
[20:27:18] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync
[20:27:31] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync
[20:28:17] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync
[20:28:35] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync
[20:29:24] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync
[20:29:28] <wikibugs>	 (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105433 (https://phabricator.wikimedia.org/T374957)
[20:32:29] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105431 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming)
[20:32:30] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105433 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming)
[20:33:31] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105431 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming)
[20:33:58] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105433 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming)
[20:36:00] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[20:36:23] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[20:44:10] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply
[20:44:29] <logmsgbot>	 !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply
[20:45:33] <wikibugs>	 (03PS30) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[20:47:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[20:51:08] <wikibugs>	 (03PS31) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[20:53:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[20:57:36] <wikibugs>	 (03PS32) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860
[21:00:07] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T2100). nyaa~
[21:00:07] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[21:01:58] <wikibugs>	 (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4719/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[21:03:23] <wikibugs>	 (03CR) 10CDobbins: "PCC: https://puppet-compiler.wmflabs.org/output/1102860/4719/dns4003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins)
[21:07:08] <wikibugs>	 (03PS1) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129)
[21:07:10] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[21:08:25] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[21:08:29] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 8565 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[21:09:41] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[21:10:24] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans)
[21:14:11] <wikibugs>	 (03PS2) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129)
[21:14:11] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[21:15:11] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[21:15:15] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[21:19:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414190 (10phaultfinder)
[21:24:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414210 (10phaultfinder)
[21:25:42] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129)
[21:26:03] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott)
[21:30:10] <logmsgbot>	 !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@a43cacf]: bump image suggestions, section topics, and SEAL
[21:31:21] <logmsgbot>	 !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@a43cacf]: bump image suggestions, section topics, and SEAL (duration: 01m 43s)
[21:33:57] <wikibugs>	 (03PS1) 10Brennen Bearnes: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285)
[21:39:14] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10414252 (10Andrew) a:05Andrew→03cmooney
[21:42:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:44:13] <wikibugs>	 (03PS1) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916)
[21:45:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[21:58:53] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:59:43] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T2200)
[22:09:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414308 (10phaultfinder)
[22:14:50] <wikibugs>	 (03CR) 10Ebernhardson: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking)
[22:19:44] <jinxer-wm>	 FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable
[22:19:44] <jinxer-wm>	 FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable
[22:58:09] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm
[22:58:18] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm
[23:09:31] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:41:54] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm
[23:42:05] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414497 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm...
[23:56:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed