[00:04:25] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [00:06:10] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [00:29:56] hello friends! getting a lot of replag on enwiki right now. since wikimediastatus.net isn't showing any issues, just checking in to see if that's something y'all're aware of :) [00:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105092 [00:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105092 (owner: 10TrainBranchBot) [00:39:55] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:49:02] (03PS1) 10Tim Starling: Revert "Use PHP type declarations" [extensions/TimedMediaHandler] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105097 (https://phabricator.wikimedia.org/T382385) [00:55:43] (03CR) 10Tim Starling: [C:03+2] Revert "Use PHP type declarations" [extensions/TimedMediaHandler] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105097 (https://phabricator.wikimedia.org/T382385) (owner: 10Tim Starling) [00:58:17] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1105092 (owner: 10TrainBranchBot) [01:09:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105099 [01:09:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105099 (owner: 10TrainBranchBot) [01:14:36] created https://phabricator.wikimedia.org/T382388 for the problem Tamzin mentioned since it's user-visible [01:15:30] (03Merged) 10jenkins-bot: Revert "Use PHP type declarations" [extensions/TimedMediaHandler] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105097 (https://phabricator.wikimedia.org/T382385) (owner: 10Tim Starling) [01:17:40] (03Abandoned) 10Cwhite: dashboards: sudo set noninteractive flag [puppet] - 10https://gerrit.wikimedia.org/r/888740 (https://phabricator.wikimedia.org/T329688) (owner: 10Cwhite) [01:18:07] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1105097|Revert "Use PHP type declarations" (T382385)]] [01:18:11] T382385: Typed property MediaWiki\\TimedMediaHandler\\WebVideoTranscode\\WebVideoTranscodeJob::$targetEncodeFile must not be accessed before initialization - https://phabricator.wikimedia.org/T382385 [01:25:08] (03CR) 10Cwhite: "These are CI test fixtures. The host number does not need to be a real host, but should be close enough to adequately exercise the logsta" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans) [01:26:32] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1105097|Revert "Use PHP type declarations" (T382385)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:26:36] T382385: Typed property MediaWiki\\TimedMediaHandler\\WebVideoTranscode\\WebVideoTranscodeJob::$targetEncodeFile must not be accessed before initialization - https://phabricator.wikimedia.org/T382385 [01:27:06] !log tstarling@deploy2002 tstarling: Continuing with sync [01:27:14] 06SRE, 06Infrastructure-Foundations: Improve how we generate DNS entries from Netbox - https://phabricator.wikimedia.org/T362985#10411469 (10cmooney) [01:30:25] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1105099 (owner: 10TrainBranchBot) [01:32:42] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105097|Revert "Use PHP type declarations" (T382385)]] (duration: 14m 35s) [01:32:47] T382385: Typed property MediaWiki\\TimedMediaHandler\\WebVideoTranscode\\WebVideoTranscodeJob::$targetEncodeFile must not be accessed before initialization - https://phabricator.wikimedia.org/T382385 [01:55:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [02:17:38] (03PS10) 10Aleksandar Mastilovic: dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [02:18:24] (03CR) 10Aleksandar Mastilovic: "Thanks for the comments brouberol! I've updated the merge request." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [02:19:43] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [02:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:19] PROBLEM - Hadoop HistoryServer on an-master1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:10:19] RECOVERY - Hadoop HistoryServer on an-master1003 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Mapreduce_Historyserver_process [03:24:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:33] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [04:13:54] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [04:59:13] !log tstarling@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [05:00:43] !log tstarling@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [05:00:55] !log tstarling@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [05:01:29] !log tstarling@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [05:13:47] FIRING: HelmReleaseBadStatus: Helm release mw-videoscaler/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-videoscaler - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:30:27] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T382392 (10phaultfinder) 03NEW [05:40:01] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:42:34] (03CR) 10Giuseppe Lavagetto: "While the patch might be correct, I should not that we need the same patch in the mediawiki helm chart for it to have any effect in produc" [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [05:55:12] FIRING: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [06:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [06:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [06:21:23] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 69, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T0700) [07:24:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:27] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for remaining WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1104945 (owner: 10Muehlenhoff) [07:34:31] (03PS1) 10Slyngshede: Release v0.1.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 [07:39:12] (03CR) 10Arnaudb: [C:03+1] ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [07:39:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 (owner: 10Slyngshede) [07:42:06] (03PS1) 10Muehlenhoff: Deprecate system::role for Ceph roles [puppet] - 10https://gerrit.wikimedia.org/r/1105270 [07:42:09] (03CR) 10Slyngshede: [C:03+2] Release v0.1.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 (owner: 10Slyngshede) [07:44:49] (03PS1) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/1105271 [07:46:33] (03Merged) 10jenkins-bot: Release v0.1.6 [software/bitu] - 10https://gerrit.wikimedia.org/r/1105269 (owner: 10Slyngshede) [07:55:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105271 (owner: 10Muehlenhoff) [07:57:46] (03PS1) 10Slyngshede: IDM - 0.1.6 update [dns] - 10https://gerrit.wikimedia.org/r/1105273 [07:59:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1105273 (owner: 10Slyngshede) [08:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T0800). Please do the needful. [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:41] (03CR) 10Slyngshede: [C:03+2] IDM - 0.1.6 update [dns] - 10https://gerrit.wikimedia.org/r/1105273 (owner: 10Slyngshede) [08:03:28] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2070-2071].codfw.wmnet [08:04:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2070-2071].codfw.wmnet [08:09:02] (03PS3) 10Volans: ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) [08:09:16] (03CR) 10Volans: ownership: Traffic cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:10:00] (03CR) 10Volans: [C:03+2] ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:10:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2071.codfw.wmnet with OS bookworm [08:10:43] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2070.codfw.wmnet with OS bookworm [08:11:02] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2070 [08:11:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2070 [08:11:03] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2071 [08:11:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2071 [08:11:27] FIRING: [5x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:14:27] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:15:54] (03Merged) 10jenkins-bot: ownership: Data Persistence cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104951 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:16:27] RESOLVED: [7x] SystemdUnitCrashLoop: logstash.service crashloop on elastic2055:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:28:33] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2070.codfw.wmnet with reason: host reimage [08:28:52] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2071.codfw.wmnet with reason: host reimage [08:29:41] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [08:30:10] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [08:31:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2070.codfw.wmnet with reason: host reimage [08:33:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-videoscaler/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-videoscaler - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:35:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2071.codfw.wmnet with reason: host reimage [08:48:48] (03CR) 10JMeybohm: [C:03+1] ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [08:51:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2070.codfw.wmnet with OS bookworm [08:54:16] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:54:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2071.codfw.wmnet with OS bookworm [08:55:20] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2070.codfw.wmnet [08:55:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2070.codfw.wmnet [08:56:50] (03CR) 10Filippo Giunchedi: [C:03+1] thanos-store: enable caching bucket [puppet] - 10https://gerrit.wikimedia.org/r/1105037 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [08:57:48] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2068-2069].codfw.wmnet [08:58:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2068-2069].codfw.wmnet [08:59:11] (03CR) 10Muehlenhoff: [C:03+2] Enable management of cn=wmf for production IDMs [puppet] - 10https://gerrit.wikimedia.org/r/1104970 (owner: 10Muehlenhoff) [08:59:37] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2069.codfw.wmnet with OS bookworm [08:59:38] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2068.codfw.wmnet with OS bookworm [08:59:56] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2069 [08:59:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2069 [08:59:57] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2068 [08:59:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2068 [09:03:16] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:05:02] (03CR) 10Filippo Giunchedi: [C:03+1] "Untested but LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [09:10:10] RECOVERY - Disk space on ml-lab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ml-lab1001&var-datasource=eqiad+prometheus/ops [09:12:49] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396 (10fgiunchedi) 03NEW [09:13:36] 06SRE, 06Infrastructure-Foundations, 10netops: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10411780 (10fgiunchedi) No worries at all @cmooney, I've opened {T382396} to investigate/followup on the two issues you mentioned [09:15:17] (03CR) 10Elukey: [C:03+2] charts: improve Kartotherian metrics and monitoring config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105034 (https://phabricator.wikimedia.org/T216826) (owner: 10Elukey) [09:15:54] !log restart wedged swift stats jobs on ms-fe2009 [09:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:34] (03CR) 10Volans: [C:03+2] ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:17:13] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2069.codfw.wmnet with reason: host reimage [09:17:24] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2068.codfw.wmnet with reason: host reimage [09:18:08] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: sync [09:20:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2069.codfw.wmnet with reason: host reimage [09:20:12] RESOLVED: [4x] SwiftObjectCountSiteDisparity: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [09:23:16] (03Merged) 10jenkins-bot: ownership: ServiceOps cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104953 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [09:23:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2068.codfw.wmnet with reason: host reimage [09:28:27] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: sync [09:29:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10411821 (10phaultfinder) [09:32:57] (03CR) 10Hashar: [V:03+1] "I have cherry picked the change on `puppetmaster-1003.devtools.eqiad1.wikimedia.cloud` and Puppet is passing now." [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar) [09:33:05] (03CR) 10Hashar: [C:03+1] devtools: fix hiera after host renaming [puppet] - 10https://gerrit.wikimedia.org/r/1104957 (https://phabricator.wikimedia.org/T363415) (owner: 10Hashar) [09:40:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2069.codfw.wmnet with OS bookworm [09:41:24] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:43:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2068.codfw.wmnet with OS bookworm [09:44:05] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2068.codfw.wmnet [09:44:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2068.codfw.wmnet [09:54:34] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2069.codfw.wmnet [09:54:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2069.codfw.wmnet [09:54:52] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2071.codfw.wmnet [09:54:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2071.codfw.wmnet [09:55:55] (03PS12) 10JMeybohm: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [09:57:01] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[2036-2039].codfw.wmnet [09:58:38] (03PS2) 10Abijeet Patro: Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) [09:59:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[2036-2039].codfw.wmnet [09:59:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [10:02:24] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[36-39] to wikikube-worker20(47|66|85|86) [puppet] - 10https://gerrit.wikimedia.org/r/1102710 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [10:04:30] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2036 to wikikube-worker2047 [10:04:49] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.rename (exit_code=93) from kubernetes2036 to wikikube-worker2047 [10:05:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Idle - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring% [10:05:46] atus [10:07:27] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [10:07:27] status [10:07:59] jouncebot: nowandnext [10:07:59] No deployments scheduled for the next 0 hour(s) and 52 minute(s) [10:07:59] In 0 hour(s) and 52 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100) [10:08:30] I'm going to to a sync-world to build new images and test mw-videoscaler rollout logic [10:08:48] (03CR) 10Hnowlan: [C:03+2] Revert^2 "kubernetes: add mw-videoscaler to scap deployments" [puppet] - 10https://gerrit.wikimedia.org/r/1104985 (owner: 10Hnowlan) [10:09:10] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1050.eqiad.wmnet [10:09:11] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1050.eqiad.wmnet [10:09:34] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1275.eqiad.wmnet [10:09:35] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1275.eqiad.wmnet [10:10:09] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node check for host wikikube-worker1290.eqiad.wmnet [10:10:10] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) check for host wikikube-worker1290.eqiad.wmnet [10:10:24] (03CR) 10FNegri: "I'm reading the discussion for upstream BUG #18349 [0] and it looks like their workaround was to "increase work_mem to 16MB", which is a s" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [10:11:02] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1050.eqiad.wmnet [10:11:03] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1050.eqiad.wmnet [10:11:32] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1275.eqiad.wmnet [10:11:33] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker1275.eqiad.wmnet [10:12:42] !log jayme@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker1290.eqiad.wmnet [10:13:31] !log jayme@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) pool for host wikikube-worker1290.eqiad.wmnet [10:15:34] (03CR) 10FNegri: "(ignore my previous "Look" comment, I didn't mean to send it)" [puppet] - 10https://gerrit.wikimedia.org/r/1105020 (https://phabricator.wikimedia.org/T381548) (owner: 10Andrew Bogott) [10:15:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on kubernetes2037:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:19:26] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10411938 (10cmooney) > What is the dashboard and the underlying expression in the graph above? That one came from here I think: https://grafana.wikimedia.org/g... [10:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [10:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [10:20:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10411941 (10JMeybohm) [10:20:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 4 others: Reimage wikikube-worker1290 in eqiad as a replacement for wikikube-ctrl1001 - https://phabricator.wikimedia.org/T379790#10411942 (10JMeybohm) [10:22:30] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1007,1021,1080,1287].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [10:22:44] !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to test mw-videoscaler integration [10:24:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1007.eqiad.wmnet with OS bookworm [10:28:00] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:28:37] (03PS1) 10Wangombe: Event logging: update schemaId [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) [10:29:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [10:29:51] (03PS1) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker20[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) [10:41:05] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1007.eqiad.wmnet with reason: host reimage [10:44:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1007.eqiad.wmnet with reason: host reimage [10:47:44] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for LorenMora - https://phabricator.wikimedia.org/T382377#10412012 (10MoritzMuehlenhoff) Earlier today we merged a patch which enables the request of cn=wmf within Wikimedia IDM, so in the future for such requests we no longer need a Phabricator task, but... [10:56:04] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404 (10Michael) 03NEW [10:58:01] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412028 (10Michael) (This is technically not a "service-deployment-request" because the ser... [10:58:10] !log hnowlan@deploy2002 Finished scap sync-world: Rebuild and deploy to test mw-videoscaler integration (duration: 36m 40s) [10:58:40] I'll be doing another sync-world [10:58:45] (03PS2) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100) [11:02:55] (03CR) 10Nikerabbit: [C:03+1] Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [11:03:02] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:03:20] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1007.eqiad.wmnet with OS bookworm [11:04:22] (03PS3) 10Jelto: Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) [11:04:31] FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:10] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [11:06:38] (03CR) 10Jelto: [C:03+2] Rename kubernetes20[36-39] to wikikube-worker21[88-91] [puppet] - 10https://gerrit.wikimedia.org/r/1105284 (https://phabricator.wikimedia.org/T379788) (owner: 10Jelto) [11:07:44] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1021.eqiad.wmnet with OS bookworm [11:08:12] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101898 (https://phabricator.wikimedia.org/T356939) (owner: 10Clare Ming) [11:08:56] (03Merged) 10jenkins-bot: Remove extraneous config for Metrics Platform instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101898 (https://phabricator.wikimedia.org/T356939) (owner: 10Clare Ming) [11:09:31] FIRING: [4x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:40] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2036 to wikikube-worker2188 [11:10:50] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:11:02] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:41] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2036 to wikikube-worker2188 - jelto@cumin1002" [11:15:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2036 to wikikube-worker2188 - jelto@cumin1002" [11:15:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:15:02] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2188 [11:15:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2188 [11:15:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2036 to wikikube-worker2188 [11:19:59] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2037 to wikikube-worker2189 [11:20:19] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:23:54] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2037 to wikikube-worker2189 - jelto@cumin1002" [11:24:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2037 to wikikube-worker2189 - jelto@cumin1002" [11:24:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:24:24] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2189 [11:25:01] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1021.eqiad.wmnet with reason: host reimage [11:25:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2189 [11:26:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2037 to wikikube-worker2189 [11:27:25] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2038 to wikikube-worker2190 [11:27:46] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:28:36] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1021.eqiad.wmnet with reason: host reimage [11:31:24] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2038 to wikikube-worker2190 - jelto@cumin1002" [11:32:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2038 to wikikube-worker2190 - jelto@cumin1002" [11:32:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:32:42] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2190 [11:33:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2190 [11:34:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2038 to wikikube-worker2190 [11:35:03] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes2039 to wikikube-worker2191 [11:35:24] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:39:06] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2039 to wikikube-worker2191 - jelto@cumin1002" [11:39:52] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10412135 (10cmooney) @fgiunchedi yeah I'm pretty sure it's only gaps in the data we are seeing, for instance here: https://grafana.wikimedia.org/goto/_GSV1TIHR... [11:40:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes2039 to wikikube-worker2191 - jelto@cumin1002" [11:40:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:40:12] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2191 [11:40:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2191 [11:41:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes2039 to wikikube-worker2191 [11:41:17] jouncebot: now [11:41:17] For the next 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100) [11:41:20] jouncebot: nextandnow [11:41:23] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2188.codfw.wmnet wikikube-worker2189.codfw.wmnet wikikube-worker2190.codfw.wmnet wikikube-worker2191.codfw.wmnet on all recursors [11:41:26] jouncebot: nowandnext [11:41:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2188.codfw.wmnet wikikube-worker2189.codfw.wmnet wikikube-worker2190.codfw.wmnet wikikube-worker2191.codfw.wmnet on all recursors [11:41:27] For the next 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1100) [11:41:27] In 0 hour(s) and 18 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1200) [11:42:13] hashar: if you're planning to run scap, for the next hour or so there will be timestamp diffs for mw-videoscaler that can be ignored. I'm working on a fix [11:42:29] I am going to restart Gerrit a couple times to shrink some H2 caches before the holidays ( T323754 ) [11:42:30] T323754: Investigate Gerrit h2 cache being way too large - https://phabricator.wikimedia.org/T323754 [11:42:40] hnowlan: +1 :) [11:45:58] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2189.codfw.wmnet with OS bookworm [11:45:59] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2188.codfw.wmnet with OS bookworm [11:46:09] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2188 [11:46:34] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:47:04] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:47:32] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1021.eqiad.wmnet with OS bookworm [11:49:15] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1080.eqiad.wmnet with OS bookworm [11:50:29] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2188 - jelto@cumin1002" [11:50:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2188 - jelto@cumin1002" [11:50:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:50:33] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2188.codfw.wmnet 169.32.192.10.in-addr.arpa 9.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:50:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2188.codfw.wmnet 169.32.192.10.in-addr.arpa 9.6.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:50:37] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2188 [11:52:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2188 [11:52:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2188 [11:52:41] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2189 [11:52:53] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [11:56:33] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2189 - jelto@cumin1002" [11:56:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2189 - jelto@cumin1002" [11:56:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:56:38] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2189.codfw.wmnet 170.32.192.10.in-addr.arpa 0.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:56:41] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2189.codfw.wmnet 170.32.192.10.in-addr.arpa 0.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [11:56:41] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2189 [11:57:05] (03PS1) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) [11:57:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2189 [11:57:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2189 [11:58:19] (03PS2) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) [11:58:53] (03CR) 10Lucas Werkmeister (WMDE): Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:00:05] mvolz: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Citoid / Zotero . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1200). [12:01:13] (03CR) 10Abijeet Patro: Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:02:14] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412181 (10akosiaris) Thanks for this writeup. Couple of comments below. * If not already,... [12:02:35] (03CR) 10Lucas Werkmeister (WMDE): Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:03:01] (03PS1) 10Hnowlan: mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) [12:03:26] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412183 (10akosiaris) Moving to #serviceops-radar since there isn't something specific acti... [12:04:52] (03CR) 10Muehlenhoff: [C:03+2] Blacklist squashfs [puppet] - 10https://gerrit.wikimedia.org/r/1104968 (owner: 10Muehlenhoff) [12:09:49] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:09:50] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1080.eqiad.wmnet with reason: host reimage [12:10:01] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl2003.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:10:49] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:10:53] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2188.codfw.wmnet with reason: host reimage [12:11:01] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:12:46] (03CR) 10Btullis: [C:03+1] ownership: Data Platform cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [12:12:56] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104975 (owner: 10PipelineBot) [12:13:08] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [12:13:09] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101494 (owner: 10PipelineBot) [12:14:14] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1104975 (owner: 10PipelineBot) [12:14:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1080.eqiad.wmnet with reason: host reimage [12:15:47] (03CR) 10Volans: "Thanks, reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1104950 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [12:16:33] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2189.codfw.wmnet with reason: host reimage [12:17:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2188.codfw.wmnet with reason: host reimage [12:17:59] (03CR) 10Giuseppe Lavagetto: [C:03+1] mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:18:15] !log mvolz@deploy2002 helmfile [staging] START helmfile.d/services/citoid: apply [12:20:08] !log mvolz@deploy2002 helmfile [staging] DONE helmfile.d/services/citoid: apply [12:20:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2189.codfw.wmnet with reason: host reimage [12:21:04] (03PS1) 10Urbanecm: [Growth] Disable Surfacing Add Link tasks on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105302 (https://phabricator.wikimedia.org/T382037) [12:21:52] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:22:26] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:22:28] !log mvolz@deploy2002 helmfile [codfw] START helmfile.d/services/citoid: apply [12:22:30] !log mvolz@deploy2002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [12:23:37] !log mvolz@deploy2002 helmfile [eqiad] START helmfile.d/services/citoid: apply [12:25:12] !log mvolz@deploy2002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [12:27:47] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:28:38] (03PS1) 10Muehlenhoff: Deprecate system::role for Cloud VPS-specific Puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/1105304 [12:28:48] (03Merged) 10jenkins-bot: mw-videoscaler: add dummy value for timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105297 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:31:21] (03PS1) 10Muehlenhoff: Deprecate system::role for phab/mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1105305 [12:31:31] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [12:31:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [12:35:30] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for phab/mariadb [puppet] - 10https://gerrit.wikimedia.org/r/1105305 (owner: 10Muehlenhoff) [12:36:04] (03PS1) 10Hnowlan: mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700) [12:36:18] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1080.eqiad.wmnet with OS bookworm [12:36:36] I am restarting Gerrit now [12:36:46] it is quite fast to come back [12:37:21] Dec 18 12:37:08 gerrit1003 systemd[1]: gerrit.service: Consumed 4month 1d 18h 44min 12.301s CPU time. [12:37:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2188.codfw.wmnet with OS bookworm [12:38:05] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1287.eqiad.wmnet with OS bookworm [12:38:55] that was since October 22 [12:41:10] (03PS1) 10Muehlenhoff: Deprecate system::role for Druid roles [puppet] - 10https://gerrit.wikimedia.org/r/1105326 [12:41:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2189.codfw.wmnet with OS bookworm [12:41:45] !log Restarted Gerrit at 12:37:08 UTC [12:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:56] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:42:43] (03PS3) 10Abijeet Patro: Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) [12:43:55] (03CR) 10Abijeet Patro: Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [12:44:24] (03PS2) 10Hnowlan: mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700) [12:45:03] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2190.codfw.wmnet with OS bookworm [12:45:08] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2191.codfw.wmnet with OS bookworm [12:45:14] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2190 [12:45:19] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:48:51] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2190 - jelto@cumin1002" [12:48:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2190 - jelto@cumin1002" [12:48:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:48:56] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2190.codfw.wmnet 171.32.192.10.in-addr.arpa 1.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:48:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2190.codfw.wmnet 171.32.192.10.in-addr.arpa 1.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:49:00] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2190 [12:49:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2190 [12:49:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2190 [12:49:53] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2191 [12:51:09] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412293 (10Michael) Thank you for the very quick response! >>! In T382404#10412181, @ako... [12:53:53] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:54:14] (03PS1) 10Muehlenhoff: Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 [12:54:26] (03PS1) 10Btullis: Add an-worker106[5-9] to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1105330 (https://phabricator.wikimedia.org/T382410) [12:54:34] (03CR) 10CI reject: [V:04-1] Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff) [12:54:35] (03CR) 10Hnowlan: [C:03+2] mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:55:57] (03Merged) 10jenkins-bot: mw-videoscaler: correct dummy timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105306 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [12:57:08] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [12:57:21] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [12:57:23] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2191 - jelto@cumin1002" [12:57:27] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2191 - jelto@cumin1002" [12:57:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:57:28] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2191.codfw.wmnet 172.32.192.10.in-addr.arpa 2.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:57:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2191.codfw.wmnet 172.32.192.10.in-addr.arpa 2.7.1.0.2.3.0.0.2.9.1.0.0.1.0.0.3.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [12:57:31] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2191 [12:57:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2191 [12:57:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2191 [12:58:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage [13:00:53] (03CR) 10Btullis: [C:03+2] Add an-worker106[5-9] to puppet [puppet] - 10https://gerrit.wikimedia.org/r/1105330 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis) [13:00:55] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [13:01:00] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [13:01:05] (03PS2) 10Muehlenhoff: Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 [13:01:20] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: apply [13:01:24] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: apply [13:01:44] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-videoscaler: sync [13:01:54] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-videoscaler: sync [13:02:00] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1287.eqiad.wmnet with reason: host reimage [13:04:58] 10ops-eqiad, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks - https://phabricator.wikimedia.org/T382412 (10Andrew) 03NEW [13:07:23] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/mw-videoscaler: apply [13:07:28] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-videoscaler: apply [13:12:49] !log installing curl security updates [13:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:04] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:14:24] (03PS1) 10Andrew Bogott: cloud-vps dns recursors: increase # of threads x 3 [puppet] - 10https://gerrit.wikimedia.org/r/1105332 (https://phabricator.wikimedia.org/T374830) [13:16:22] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2191.codfw.wmnet with reason: host reimage [13:17:01] jouncebot: nowandnext [13:17:01] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [13:17:01] In 0 hour(s) and 42 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1400) [13:17:06] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:17:10] I'll be doing just one more scap [13:17:51] !log hnowlan@deploy2002 Started scap sync-world: Rebuild and deploy to test mw-videoscaler integration one last time [13:19:06] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:19:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2191.codfw.wmnet with reason: host reimage [13:21:25] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1287.eqiad.wmnet with OS bookworm [13:21:27] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1007,1021,1080,1287].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:21:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105332 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [13:23:24] (03CR) 10Andrew Bogott: [C:03+2] cloud-vps dns recursors: increase # of threads x 3 [puppet] - 10https://gerrit.wikimedia.org/r/1105332 (https://phabricator.wikimedia.org/T374830) (owner: 10Andrew Bogott) [13:24:33] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1280-1284].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:24:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10412364 (10phaultfinder) [13:26:12] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1280.eqiad.wmnet with OS bookworm [13:26:41] !log jayme@cumin1002 START - Cookbook sre.k8s.roll-reimage-nodes rolling reimage on P{wikikube-worker[1291-1295].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [13:28:21] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1291.eqiad.wmnet with OS bookworm [13:30:04] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:32:04] PROBLEM - BGP status on lsw1-f5-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:10] (03PS1) 10KartikMistry: CX3 Build 0.2.0+20241218 [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) [13:36:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) (owner: 10KartikMistry) [13:37:11] FIRING: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:37:30] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10412405 (10fgiunchedi) Indeed the underlying data/samples are there as expected: I tested this theory by removing all functions and look at the raw data, which... [13:37:46] !log installing waitress security updates [13:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2191.codfw.wmnet with OS bookworm [13:39:09] (03CR) 10Filippo Giunchedi: [C:03+1] Deprecate system::role for Cloud VPS-specific Puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/1105304 (owner: 10Muehlenhoff) [13:39:16] RESOLVED: ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:51] (03PS1) 10Wangombe: Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) [13:44:55] !log installing jinja2 security updates [13:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [13:45:39] (03CR) 10Abijeet Patro: [C:03+1] Event logging: pass empty object to translation property [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [13:46:16] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage [13:48:32] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage [13:49:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1280.eqiad.wmnet with reason: host reimage [13:51:16] hnowlan: Congratulations on the k8s videoscalers completion. [13:52:36] James_F: thanks! [13:52:45] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1291.eqiad.wmnet with reason: host reimage [13:53:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105341 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [13:55:38] (03PS1) 10Cathal Mooney: Validators: Allow an interface to be called just "irb" on a device [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1105346 (https://phabricator.wikimedia.org/T371088) [13:57:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Translate] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105283 (https://phabricator.wikimedia.org/T364460) (owner: 10Wangombe) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1400). [14:00:04] abijeet and kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:24] o/ [14:00:41] here [14:00:44] hi Lucas_WMDE [14:00:54] I can deploy :) [14:00:55] I can deploy both changes.. [14:00:59] or that ^^ [14:01:05] :-) [14:01:05] :) [14:01:09] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:01:12] Let me start.. :) [14:01:19] sure! [14:01:37] (03CR) 10Lucas Werkmeister (WMDE): Translate: Enable message group subscription by default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:01:48] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(LGTM otherwise)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:01:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:02:20] just a heads-up, I hit a timeout when syncing to the canaries earlier. Hopefully a transient thing but just wanted to warn [14:02:39] (03Merged) 10jenkins-bot: Translate: Enable message group subscription by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105279 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [14:03:52] ah. undeployed change! [14:03:57] 14:02:58 The following are unexpected commits pulled from origin for /srv/mediawiki-staging: [14:03:58] commit 9fdd7062ecc9fdd006d5a8291da3db623f1a219e [14:03:58] Author: Clare Ming [14:03:58] Date: Tue Dec 10 08:52:12 2024 -0700 [14:03:58] Remove extraneous config for Metrics Platform instruments [14:03:58] - AgentData properties are required by the client library [14:03:59] so they can be removed from producer config [14:03:59] Bug: T356939 [14:04:00] T356939: [Java] Make all AgentData properties required - https://phabricator.wikimedia.org/T356939 [14:04:00] Change-Id: Ibf10b59135bc2f95ac55b5cb43cb5c3a79c6c910 [14:04:21] hm [14:04:29] Anyone aware about this? [14:04:29] was +2ed normally, not via TrainBranchBot [14:04:45] pinging sfaci [14:05:11] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:11] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:44] sfaci: OK to go with this? [14:08:14] (03PS3) 10Elukey: charts: improve Kartotherian's statsd config (part 2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) [14:09:22] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1280.eqiad.wmnet with OS bookworm [14:09:27] ah. We need to decide quick. Lucas_WMDE what else we can do in such cases? [14:09:43] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2190.codfw.wmnet with OS bookworm [14:10:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2190.codfw.wmnet with OS bookworm [14:10:09] I’ve pinged them on slack, let’s see if that works better [14:10:11] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2190 [14:10:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2190 [14:10:19] though I’m not quite sure why we need to decide quickly, anything specific? [14:11:06] if we don’t hear back from them, I’d go for reverting the undeployed change [14:11:09] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1281.eqiad.wmnet with OS bookworm [14:11:11] RECOVERY - BGP status on lsw1-f5-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:11:14] seems safer than rolling it out when we don’t know how to test it [14:11:14] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Cloud VPS-specific Puppet roles [puppet] - 10https://gerrit.wikimedia.org/r/1105304 (owner: 10Muehlenhoff) [14:11:25] Lucas_WMDE: because we've one more backport patch ahead, which will take time as well ;) [14:11:41] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1291.eqiad.wmnet with OS bookworm [14:12:08] Lucas_WMDE: window is of one hour, and there is a life after the deployment (ie dinner ;)) [14:12:59] okay, but it’s not “production will explode” urgent, just wanted to check that ;) [14:13:25] what do you think about reverting vs. rolling out the undeployed change? [14:13:28] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1292.eqiad.wmnet with OS bookworm [14:14:27] (03CR) 10Herron: [C:03+2] thanos-store: enable caching bucket [puppet] - 10https://gerrit.wikimedia.org/r/1105037 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [14:15:11] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:01] kart_: it’s been 12 minutes since we pinged them on IRC, I’d say let’s go ahead with deploying [14:17:08] and, unless you disagree, let’s do that by reverting their change [14:17:11] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:17:15] and then they’ll have to deploy it correctly later [14:17:34] Let's go ahead. If something goes wrong we can revert. [14:18:42] so you’re saying don’t revert it? [14:18:53] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]] [14:18:57] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:19:14] (03CR) 10KartikMistry: [C:03+2] CX3 Build 0.2.0+20241218 [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) (owner: 10KartikMistry) [14:19:27] I've +2 my patch ahead ^^ [14:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [14:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [14:19:51] Lucas_WMDE: yes [14:20:51] ok [14:21:15] (03CR) 10Elukey: "Left some questions just to understand, if those are no concerns feel free to proceed :)" [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff) [14:22:04] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412551 (10Urbanecm_WMF) > Since I am on the guesstimations part, same things for requests... [14:23:13] (03CR) 10Elukey: [C:03+1] "LGTM! Optional: I am wondering if our future-selves will benefit of a one line explanation before the if, so that no git blame/etc.. is ne" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1105346 (https://phabricator.wikimedia.org/T371088) (owner: 10Cathal Mooney) [14:23:58] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1256.eqiad.wmnet with OS bookworm [14:24:22] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412553 (10Urbanecm_WMF) > Related to that, are there any very rough guesstimations about w... [14:28:33] (03PS1) 10Volans: api: allow to abort before run() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1105351 (https://phabricator.wikimedia.org/T365454) [14:30:56] https://www.irccloud.com/pastebin/0JBvWT57/ [14:31:07] hnowlan: ^^ is this known? [14:31:23] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage [14:31:46] any more output above that? [14:31:57] “exit status 1” isn’t super helpful :/ [14:32:33] kart_: no, and that shouldn't be related to my changes :/ [14:32:35] I'll have a look [14:32:45] (03PS1) 10Hnowlan: mediawiki: configure job history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105354 (https://phabricator.wikimedia.org/T371700) [14:33:07] :/ [14:33:16] looks like bad day for the deployments.. [14:33:29] abijeet: sorry - we still need to wait more.. [14:33:38] (03CR) 10CI reject: [V:04-1] mediawiki: configure job history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105354 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [14:34:15] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage [14:34:57] "failed to sync configm [14:35:07] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1281.eqiad.wmnet with reason: host reimage [14:35:11] "failed to sync configmap cache: timed out waiting for the condition" [14:36:17] OK. It failed finally. [14:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:51] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1292.eqiad.wmnet with reason: host reimage [14:39:26] (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20241218 [extensions/ContentTranslation] (wmf/1.44.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1105336 (https://phabricator.wikimedia.org/T380702) (owner: 10KartikMistry) [14:39:28] retry for now? I am still looking into it [14:39:50] * kamila_ looking too [14:40:23] kart_, ok [14:40:31] where did the etcd's all go? [14:40:39] hnowlan: OK. Let me retry. [14:41:24] 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10412580 (10BTullis) >>! In T379258#10409987, @Volans wrote: > In an early draft I had thought of adding working groups to the list of possible groups but talking wi... [14:41:35] ah. Sad. my other patch also got merged :/ [14:41:37] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]] [14:41:41] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [14:41:42] I see only etcd-0 in `kubectl get cs` in codfw, is that expected? [14:44:21] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [14:44:39] kamila_: I did not know about get cs :o ...and given it says healthy I think we're good [14:44:47] https://www.irccloud.com/pastebin/xMbmY9ja/ [14:45:28] was the deployment failure in codfw only? [14:45:59] hnowlan: seems going fine with retry now.. [14:46:01] (03CR) 10JHathaway: [C:03+1] "looks good, aside from the comments from @ltoscano@wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff) [14:46:08] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412589 (10Andrew) [14:46:10] abijeet: around, right? [14:46:19] jayme: I believe I had it in eqiad earlier but can't confirm [14:46:26] !log kartik@deploy2002 kartik, abi: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:46:28] kamila_: ETCDCTL_API=3 etcdctl --endpoints https://$(hostname -f):2379 member list [14:46:36] ah! sorry! [14:46:42] abijeet: can you test the patch? [14:46:52] I was about to say, nothing weird in https://grafana.wikimedia.org/d/Ku6V7QYGz/etcd3?orgId=1&var-site=codfw&var-cluster=kubernetes&var-instance_prefix=wikikube-ctrl [14:46:59] yeah that. Plus get componentstatuses is deprecated since 1.19+. Don't rely much on it [14:47:08] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412591 (10Andrew) [14:47:12] (03CR) 10Muehlenhoff: Deprecate remaining uses of system::role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1105329 (owner: 10Muehlenhoff) [14:47:27] thanks jayme, I'm done panicking for now '^^ [14:47:42] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1256.eqiad.wmnet with reason: host reimage [14:47:42] (03PS3) 10Muehlenhoff: Deprecate remaining uses of system::role [puppet] - 10https://gerrit.wikimedia.org/r/1105329 [14:47:45] akosiaris: funny... a thing I did now know of that has already been depricated :D [14:47:50] ignorant question - where do I find the mw-on-k8s deployment logs? [14:48:28] the "failed to sync etc.." that Hugh mentioned earlier on [14:48:29] elukey: helm/k8s stuff easiest way is kube_env mw-web ; kubectl get events [14:48:37] or logstash for the kubernetes events alternatively [14:48:49] also easier if you want to drill down on various namespaces etc [14:48:56] akosiaris: ah ok via get events, there is nothing more specific that scap saves from calling helmfile etc.. [14:49:01] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412597 (10fnegri) a:05fnegri→03None [14:49:02] same data, different medium [14:49:05] kart_, on it [14:49:08] okok thanks [14:49:15] elukey: scap does log to logstash too [14:49:19] let me find the dashboard [14:49:39] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412609 (10fnegri) [14:49:40] https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 [14:49:41] elukey: that error is in fact from events :D [14:49:46] but is a bit useless [14:49:49] ^ scap logs [14:50:06] dancy: thanks! I was about to grumble about having to click on share etc [14:50:15] sigh kibana... [14:50:20] haha.. I feel ya [14:50:30] thanks! [14:50:47] Enjoy reading messages in reverse order. [14:50:50] elukey: the rest, which is arguably not deployment logs is still on mwlog hosts [14:51:15] !log installing gstreamer1.0 security updates [14:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:58] kart_, looks ok. [14:52:02] akosiaris: got it, at the end scap just calls helmfile that is usually not very telling, so it can't know much.. I'll remember mw-web etc.. to check when these things happens [14:52:13] abijeet: cool. Going ahead. [14:52:17] kart_, thanks for getting the patch through [14:52:17] !log kartik@deploy2002 kartik, abi: Continuing with sync [14:52:49] elukey: the list is at https://gerrit.wikimedia.org/g/operations/puppet/+/6e296f27e8f019645c06e5f47a693d1100adcb85/hieradata/common/profile/kubernetes/deployment_server.yaml#161 [14:53:06] every mw-* thing is 1 namespace in wikikube [14:53:20] and mostly a MediaWiki deployment [14:53:30] (03CR) 10Bking: [C:03+2] wdqs categories: ship lastUpdated metric [puppet] - 10https://gerrit.wikimedia.org/r/1073529 (https://phabricator.wikimedia.org/T374916) (owner: 10Ryan Kemper) [14:53:33] there are a couple of exceptions, e.g. mw-mcrouter (which is ... duh mcrouter) [14:54:13] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:54:14] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1281.eqiad.wmnet with OS bookworm [14:54:39] (03PS1) 10Muehlenhoff: Add gstreamer1.0 library hint [puppet] - 10https://gerrit.wikimedia.org/r/1105358 [14:55:27] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [14:55:57] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1282.eqiad.wmnet with OS bookworm [14:56:15] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:56:39] <_joe_> elukey: if you want to see which thing scap deploys to, https://gerrit.wikimedia.org/g/operations/puppet/+/6e296f27e8f019645c06e5f47a693d1100adcb85/hieradata/role/common/deployment_server/kubernetes.yaml#267 [14:56:54] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1292.eqiad.wmnet with OS bookworm [14:57:01] <_joe_> the value of the hiera label "profile::kubernetes::deployment_server::mediawiki::release::mw_releases" [14:57:48] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:57:58] (03CR) 10Muehlenhoff: [C:03+2] Add gstreamer1.0 library hint [puppet] - 10https://gerrit.wikimedia.org/r/1105358 (owner: 10Muehlenhoff) [14:58:04] 10ops-eqiad, 06SRE, 10Cloud-VPS, 06DC-Ops, 10cloud-services-team (Hardware): Relocate cloudnet1007-dev and cloudnet1008-dev to new racks and rename - https://phabricator.wikimedia.org/T382412#10412648 (10aborrero) [14:58:40] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1293.eqiad.wmnet with OS bookworm [14:59:13] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:00:04] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1500) [15:00:19] hnowlan: the 'failed to sync configmap cache' can happen from time to time and is transparent (kubelet giving up and retrying) ... but it can ofc delay deployments [15:00:55] but it happens quite regularly [15:01:10] <_joe_> I can confirm [15:01:13] ah I was worried that would be the case [15:01:19] <_joe_> and also confirm the first time I saw it I was super worried [15:01:23] jayme: so that's not expected to break things and thus is not the problem we're looking for? [15:01:27] 06SRE, 10Add-Link, 10Growth-Structured-Tasks, 06Growth-Team, and 2 others: Architecture Conversation: linkrecommendations service - how to handle user-initiated requests? - https://phabricator.wikimedia.org/T382404#10412657 (10Ladsgroup) I think we need an overarching or at least some best practices on int... [15:01:35] so we have zero signal about what actually happened other than a timeout waiting for the condition message [15:01:48] that's annoying [15:02:04] kamila_: if it stays like that for 5min then it can ofc make the deployment fail as readiness is never reached [15:02:15] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:02:19] right, thanks jayme [15:05:03] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105279|Translate: Enable message group subscription by default (T372386)]] (duration: 23m 26s) [15:05:08] T372386: Enable message group subscription feature on Wikimedia wikis - https://phabricator.wikimedia.org/T372386 [15:05:12] ah. Finally! [15:05:45] I also don't see anything surrounding that in kubelet logs [15:06:13] !log kartik@deploy2002 Started scap sync-world: Backport for [[gerrit:1105336|CX3 Build 0.2.0+20241218 (T380702)]] [15:06:17] T380702: Consider length of Collection names on different views - https://phabricator.wikimedia.org/T380702 [15:06:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [15:06:57] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1256.eqiad.wmnet with OS bookworm [15:07:05] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10412682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudcephosd2004-dev.codfw.wmnet with OS bul... [15:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:10:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [15:10:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10412689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm [15:14:09] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-11-27-074306 to 2024-12-17-184905 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105362 (https://phabricator.wikimedia.org/T378785) [15:14:25] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-11-26-193226 to 2024-12-16-202347 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105363 (https://phabricator.wikimedia.org/T377020) [15:14:45] !log kartik@deploy2002 kartik: Backport for [[gerrit:1105336|CX3 Build 0.2.0+20241218 (T380702)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:15:23] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-11-27-074306 to 2024-12-17-184905 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105362 (https://phabricator.wikimedia.org/T378785) (owner: 10Jforrester) [15:16:20] WF deployers, we're still in the middle of MW deploy due to a problem earlier, can you please wait? [15:16:20] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage [15:16:28] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-11-27-074306 to 2024-12-17-184905 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105362 (https://phabricator.wikimedia.org/T378785) (owner: 10Jforrester) [15:16:34] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:16:34] !log btullis@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:16:38] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [15:16:51] (03Abandoned) 10Hnowlan: mediawiki: configure job history limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105354 (https://phabricator.wikimedia.org/T371700) (owner: 10Hnowlan) [15:16:59] kamila_: Oh, sure, do you expect it to break services? [15:17:14] Normally they're unrelated. [15:17:41] no, but we don't know what happened, so I don't want to get more confused :D [15:18:02] We don't even deploy with scap… [15:18:38] James_F: yes, but the problem was in k8s [15:18:46] Fun. [15:18:46] but if hnowlan or jayme think it's fine to do in parallel, feel free to say [15:18:59] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage [15:19:27] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1282.eqiad.wmnet with reason: host reimage [15:19:48] !log kartik@deploy2002 kartik: Continuing with sync [15:19:54] (03PS1) 10CDanis: chart-renderer: probe: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/1105366 (https://phabricator.wikimedia.org/T372081) [15:20:08] I don't think it should be an issue to go in parallel [15:20:17] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Preparing an-presto1001 for renaming to an-worker1065 - btullis@cumin1002" [15:20:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Preparing an-presto1001 for renaming to an-worker1065 - btullis@cumin1002" [15:20:22] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:23] Ack. [15:20:25] ok, thanks hnowlan! [15:20:30] 10ops-codfw, 06collaboration-services, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420 (10Jelto) 03NEW [15:20:31] although take that with a grain of salt in that we can't find what caused it :P [15:20:31] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:21:23] (03CR) 10CDanis: [C:03+2] chart-renderer: probe: increase timeout [puppet] - 10https://gerrit.wikimedia.org/r/1105366 (https://phabricator.wikimedia.org/T372081) (owner: 10CDanis) [15:21:28] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [15:21:32] yeah, that's why I wasn't sure :D [15:21:36] hnowlan: Aren't computers fantastic? [15:21:45] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1065 [15:22:40] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1293.eqiad.wmnet with reason: host reimage [15:23:04] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1065 [15:23:26] wouldn't trust them too much [15:23:37] (03PS1) 10Krinkle: Enable $wgWMEStatsBeaconUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) [15:24:06] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:24:11] (03PS1) 10Bking: team-data-platform: remove misconfigured alert [alerts] - 10https://gerrit.wikimedia.org/r/1105368 (https://phabricator.wikimedia.org/T374916) [15:24:48] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:24:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1065.eqiad.wmnet with OS bullseye [15:25:37] (03CR) 10DCausse: [C:03+1] team-data-platform: remove misconfigured alert [alerts] - 10https://gerrit.wikimedia.org/r/1105368 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [15:27:12] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [15:27:12] !log kartik@deploy2002 Finished scap sync-world: Backport for [[gerrit:1105336|CX3 Build 0.2.0+20241218 (T380702)]] (duration: 20m 58s) [15:27:30] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:27:34] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:28:54] (03CR) 10Bking: [C:03+2] team-data-platform: remove misconfigured alert [alerts] - 10https://gerrit.wikimedia.org/r/1105368 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [15:29:21] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:29:29] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:29:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:30:06] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2004-dev [15:30:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2004-dev [15:30:20] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:30:29] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker2190.codfw.wmnet with OS bookworm [15:31:31] !log homer 'lsw1-c1-codfw*' commit 'T377877' [15:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:36] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:32:37] !log homer 'lsw1-c3-codfw*' commit 'T377877' [15:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:35] (03CR) 10Genoveva Galarza: [C:03+2] wikifunctions: Upgrade evaluators from 2024-11-26-193226 to 2024-12-16-202347 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105363 (https://phabricator.wikimedia.org/T377020) (owner: 10Jforrester) [15:34:08] !log homer 'cr*codfw*' commit 'T377877' [15:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-11-26-193226 to 2024-12-16-202347 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105363 (https://phabricator.wikimedia.org/T377020) (owner: 10Jforrester) [15:35:34] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 188, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:14] PROBLEM - BGP status on lsw1-c3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:21] !log gengh@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:38:10] !log gengh@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:38:22] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2188-2189,2191].codfw.wmnet [15:38:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2188-2189,2191].codfw.wmnet [15:38:26] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1282.eqiad.wmnet with OS bookworm [15:39:46] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T382422 (10Jelto) 03NEW [15:40:14] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1283.eqiad.wmnet with OS bookworm [15:40:52] (03PS13) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [15:41:17] !log gengh@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:41:26] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:41:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr3-ulsfo.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:41:47] !incidents [15:41:47] 5546 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr3-ulsfo.wikimedia.org) [15:41:50] !ack 5546 [15:41:51] 5546 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr3-ulsfo.wikimedia.org) [15:41:58] here as well o/ [15:41:59] topranks: you were saying? :D [15:42:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1293.eqiad.wmnet with OS bookworm [15:42:06] we will be alerted if it goes over [15:42:17] elukey: is this you? [15:42:59] the transport from codfw is ok (and out to singapore) so not impacting that [15:43:03] but yes massive surge [15:43:03] https://grafana.wikimedia.org/goto/daFI6oIHg?orgId=1 [15:43:05] !log gengh@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:43:23] !log gengh@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:43:27] we're maxing outbound at SF-MIX exchange [15:43:46] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1294.eqiad.wmnet with OS bookworm [15:44:13] !log gengh@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:44:26] PROBLEM - BGP status on lsw1-e6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:44:49] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2065,2067].codfw.wmnet [15:45:45] (03CR) 10CI reject: [V:04-1] create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [15:45:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2065,2067].codfw.wmnet [15:46:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr3-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:46:53] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2065.codfw.wmnet with OS bookworm [15:46:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2067.codfw.wmnet with OS bookworm [15:47:12] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2065 [15:47:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2065 [15:47:13] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2067 [15:47:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2067 [15:47:26] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:49:54] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:50:38] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:51] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:51:02] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:51:11] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:51:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr3-ulsfo.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:51:41] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:51:53] !incidents [15:51:53] 5547 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr3-ulsfo.wikimedia.org) [15:51:54] 5546 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr3-ulsfo.wikimedia.org) [15:52:04] !ack 5547 [15:52:04] 5547 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr3-ulsfo.wikimedia.org) [15:52:06] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [15:52:48] !log cgoubert@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [15:52:51] (03PS1) 10Btullis: Configure the correct role for reimaging installing an-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1105371 (https://phabricator.wikimedia.org/T382410) [15:52:52] (03PS14) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [15:53:01] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [15:53:23] !log cgoubert@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [15:53:32] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [15:53:34] (03CR) 10Btullis: [C:03+2] Configure the correct role for reimaging installing an-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1105371 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis) [15:53:54] (03PS1) 10Krinkle: webperf: Enable --dogstatsd on statsv.py [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) [15:54:10] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105372 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [15:54:21] !log cgoubert@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [15:54:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [15:54:29] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1065.eqiad.wmnet with OS bullseye [15:54:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10412921 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm... [15:54:31] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [15:55:17] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [15:55:18] (03PS15) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [15:55:25] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:30] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr3-ulsfo.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [15:57:29] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:57:36] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:58:26] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:58:50] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T382392#10412927 (10Jhancock.wm) a:03Jhancock.wm [15:59:19] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T382422#10412942 (10Jhancock.wm) a:03Jhancock.wm [15:59:37] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420#10412944 (10Jhancock.wm) a:03Jhancock.wm [16:00:20] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1065.eqiad.wmnet with OS bullseye [16:00:39] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage [16:00:59] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10412950 (10Jhancock.wm) @Andrew what kind of partition should this server have? I keep getting an error in that part of the installer. my first thought was... [16:04:11] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage [16:04:13] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1283.eqiad.wmnet with reason: host reimage [16:06:48] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2065.codfw.wmnet with reason: host reimage [16:06:56] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2067.codfw.wmnet with reason: host reimage [16:07:24] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425 (10RobH) 03NEW [16:07:54] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10412972 (10RobH) a:03Marostegui Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new servers. T... [16:08:09] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1294.eqiad.wmnet with reason: host reimage [16:11:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2067.codfw.wmnet with reason: host reimage [16:15:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2065.codfw.wmnet with reason: host reimage [16:17:14] (03CR) 10Muehlenhoff: [C:03+1] "As discussed on IRC; let's merge this in the first week of January" [puppet] - 10https://gerrit.wikimedia.org/r/1085471 (https://phabricator.wikimedia.org/T378752) (owner: 10BryanDavis) [16:17:14] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.021e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:17:43] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [16:20:26] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:21:20] 10ops-codfw, 06Data-Persistence, 06DC-Ops: Q2:rack/setup/install db2243 - https://phabricator.wikimedia.org/T382425#10412995 (10RobH) [16:22:01] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-provisioning an-presto1002 and an-worker1066 - btullis@cumin1002" [16:22:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-provisioning an-presto1002 and an-worker1066 - btullis@cumin1002" [16:22:06] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:22:47] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1066 [16:23:13] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:23:26] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:23:26] RECOVERY - BGP status on lsw1-e6-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:23:28] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1283.eqiad.wmnet with OS bookworm [16:24:28] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:24:31] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1066 [16:25:05] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1065.eqiad.wmnet with reason: host reimage [16:25:16] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1284.eqiad.wmnet with OS bookworm [16:26:12] PROBLEM - BGP status on lsw1-a6-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:12] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1294.eqiad.wmnet with OS bookworm [16:27:14] RECOVERY - BGP status on lsw1-a6-codfw.mgmt is OK: BGP OK - up: 40, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:27:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1065.eqiad.wmnet with reason: host reimage [16:28:57] !log jayme@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1295.eqiad.wmnet with OS bookworm [16:29:39] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2067.codfw.wmnet with OS bookworm [16:34:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2065.codfw.wmnet with OS bookworm [16:37:07] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2065,2067].codfw.wmnet [16:37:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2065,2067].codfw.wmnet [16:37:46] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:39:40] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2063-2064].codfw.wmnet [16:40:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2063-2064].codfw.wmnet [16:41:44] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2063.codfw.wmnet with OS bookworm [16:41:48] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2064.codfw.wmnet with OS bookworm [16:41:54] PROBLEM - BGP status on lsw1-e7-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:42:07] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2064 [16:42:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2064 [16:42:44] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2063 [16:42:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2063 [16:42:55] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [16:45:46] PROBLEM - BGP status on lsw1-b3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:45:46] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage [16:46:52] PROBLEM - BGP status on lsw1-a3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:48:03] (03CR) 10Xcollazo: [C:03+1] dse-k8s: content-history: add kafka cluster domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/1102913 (https://phabricator.wikimedia.org/T381322) (owner: 10Gmodena) [16:48:35] (03CR) 10MSantos: [C:03+1] "LGTM. I don't have a strong opinion about this and I will wait for Yiannis opinion." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105296 (https://phabricator.wikimedia.org/T382408) (owner: 10Elukey) [16:49:08] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1284.eqiad.wmnet with reason: host reimage [16:49:20] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage [16:49:50] 06SRE, 06Infrastructure-Foundations, 10netops: Management routers: use BGP instead of OSPF - https://phabricator.wikimedia.org/T294845#10413071 (10cmooney) 05Resolved→03Open >>! In T294845#8758882, @ayounsi wrote: > This is completed in drmrs, the same will be applied to the other sites when we bring L3... [16:49:58] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [16:49:59] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1065.eqiad.wmnet with OS bullseye [16:50:41] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1066.eqiad.wmnet with OS bullseye [16:52:53] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1295.eqiad.wmnet with reason: host reimage [16:56:14] PROBLEM - BGP status on lsw1-f6-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:57:14] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1065.eqiad.wmnet [16:58:15] (03PS1) 10Eevans: sessionstore: Upgrade Cassandra to v4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) [16:58:44] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [16:58:58] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1065.eqiad.wmnet [17:00:18] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2063.codfw.wmnet with reason: host reimage [17:01:38] (03PS2) 10Eevans: sessionstore: Upgrade Cassandra to v4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) [17:01:55] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2064.codfw.wmnet with reason: host reimage [17:02:12] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1066.eqiad.wmnet with reason: host reimage [17:02:44] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [17:03:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2063.codfw.wmnet with reason: host reimage [17:06:08] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [17:06:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1066.eqiad.wmnet with reason: host reimage [17:07:44] 10ops-codfw, 06SRE, 06DC-Ops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10413120 (10Papaul) @bking hello do you have any update on @Jhancock.wm above? Thank you [17:07:59] RECOVERY - BGP status on lsw1-e7-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:08:29] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1284.eqiad.wmnet with OS bookworm [17:08:31] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1280-1284].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:09:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2064.codfw.wmnet with reason: host reimage [17:10:04] (03PS1) 10Herron: thanos-store: manage and increase chunk-pool-size setting [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953) [17:10:45] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1003 as an-worker1067 - btullis@cumin1002" [17:10:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1003 as an-worker1067 - btullis@cumin1002" [17:10:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:10:58] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1067 [17:12:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1295.eqiad.wmnet with OS bookworm [17:12:04] !log jayme@cumin1002 END (PASS) - Cookbook sre.k8s.roll-reimage-nodes (exit_code=0) rolling reimage on P{wikikube-worker[1291-1295].eqiad.wmnet} and (A:wikikube-master-eqiad or A:wikikube-worker-eqiad) [17:12:17] RECOVERY - BGP status on lsw1-f6-eqiad.mgmt is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:12:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1067 [17:17:43] (03PS2) 10Herron: thanos-store: manage and increase chunk-pool-size setting [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953) [17:19:17] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [17:19:45] (03PS1) 10Herron: thanos-store: increase store cache size to 24GB [puppet] - 10https://gerrit.wikimedia.org/r/1105395 (https://phabricator.wikimedia.org/T368953) [17:19:57] (03CR) 10Herron: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4718/co" [puppet] - 10https://gerrit.wikimedia.org/r/1105389 (https://phabricator.wikimedia.org/T368953) (owner: 10Herron) [17:21:19] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [17:22:38] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10413175 (10Andrew) The two small drives should be mirrored (raid 1) and used for the OS, the larger drives left unformatted for Ceph to manage. I believe... [17:24:01] RECOVERY - BGP status on lsw1-a3-codfw.mgmt is OK: BGP OK - up: 68, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:24:32] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2063.codfw.wmnet with OS bookworm [17:24:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [17:24:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1066.eqiad.wmnet with OS bullseye [17:25:33] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1004 as an-worker1068 - btullis@cumin1002" [17:25:38] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1004 as an-worker1068 - btullis@cumin1002" [17:25:38] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:27:57] !log depool, restart, repool ms-fe2009 [17:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:38] (03CR) 10Eevans: [C:03+2] sessionstore: Upgrade Cassandra to v4.1.7 [puppet] - 10https://gerrit.wikimedia.org/r/1105386 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [17:28:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2064.codfw.wmnet with OS bookworm [17:28:53] RECOVERY - BGP status on lsw1-b3-codfw.mgmt is OK: BGP OK - up: 34, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:30:23] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1068 [17:31:42] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1068 [17:32:30] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1066.eqiad.wmnet [17:32:32] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2*.codfw.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [17:32:37] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [17:34:13] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1066.eqiad.wmnet [17:39:43] (03CR) 10Kamila Součková: "@jmeybohm@wikimedia.org Assuming I create tasks for (and start working on) the incomplete TODOs inline, is there anything blocking merging" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [17:41:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye [17:44:29] (03PS1) 10Btullis: Add dummy tokens for new temporary Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1105404 (https://phabricator.wikimedia.org/T382410) [17:44:47] (03PS2) 10Eevans: restbase: cleanup decommissioned hosts [puppet] - 10https://gerrit.wikimedia.org/r/1105015 [17:46:46] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1068.eqiad.wmnet with OS bullseye [17:48:41] (03CR) 10Eevans: "I changed these entries to corresponding values of the form restbase9xxx. This seems close to "real", while also guarding against future `" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans) [17:49:04] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2063-2064].codfw.wmnet [17:49:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2063-2064].codfw.wmnet [17:50:21] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2*.codfw.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [17:50:25] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [17:51:35] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [17:53:10] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy tokens for new temporary Hadoop workers [labs/private] - 10https://gerrit.wikimedia.org/r/1105404 (https://phabricator.wikimedia.org/T382410) (owner: 10Btullis) [17:53:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:55:19] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1067.eqiad.wmnet with OS bullseye [17:57:02] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye [17:57:21] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1005 as an-worker1069 - btullis@cumin1002" [17:57:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Re-commissioning an-presto1005 as an-worker1069 - btullis@cumin1002" [17:57:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:58:17] !log btullis@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1069 [17:58:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:58:54] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1068.eqiad.wmnet with reason: host reimage [17:59:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1069 [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1800) [18:00:17] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [18:00:21] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [18:01:08] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1069.eqiad.wmnet with OS bullseye [18:01:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1068.eqiad.wmnet with reason: host reimage [18:04:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:05:47] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1067.eqiad.wmnet with OS bullseye [18:06:30] (03CR) 10MVernon: [C:03+1] "LGTM, thanks, apologies for delays due to my pedantry!" [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans) [18:09:23] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye [18:09:26] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:09:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:13:22] (03PS16) 10Kamila Součková: create sre.k8s.roll-reimage-nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) [18:13:37] (03CR) 10Kamila Součková: "Done" [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [18:13:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.198 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:13:48] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1069.eqiad.wmnet with OS bullseye [18:14:12] (03CR) 10Kamila Součková: create sre.k8s.roll-reimage-nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1094494 (https://phabricator.wikimedia.org/T377857) (owner: 10Kamila Součková) [18:14:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53069 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:49] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1067.eqiad.wmnet with OS bullseye [18:16:28] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1067.eqiad.wmnet with OS bullseye [18:16:39] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [18:18:09] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*.eqiad.wmnet: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [18:18:13] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [18:18:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [18:18:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1068.eqiad.wmnet with OS bullseye [18:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [18:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [18:20:29] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1069.eqiad.wmnet with OS bullseye [18:21:56] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1068.eqiad.wmnet [18:23:03] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10413490 (10cmooney) >>! In T382396#10412404, @fgiunchedi wrote: > Indeed the underlying data/samples are there as expected: I tested this theory by removing all... [18:23:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1068.eqiad.wmnet [18:25:00] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1069.eqiad.wmnet with OS bullseye [18:25:35] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1069.eqiad.wmnet with OS bullseye [18:28:38] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10413543 (10CDanis) >>! In T382396#10413490, @cmooney wrote: > But we can deal with that if that is the cause. The goal of the "irate" is that we want as much g... [18:31:35] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T382392#10413558 (10Jhancock.wm) 05Open→03Resolved probably came loose yesterday while cleaning up the cable management in that rack. reseated. came up. [18:34:36] 06SRE, 06Infrastructure-Foundations, 10netops: Investigate gnmic metric gaps and counters going to zero - https://phabricator.wikimedia.org/T382396#10413565 (10cmooney) >>! In T382396#10413543, @CDanis wrote: > It's fine to make the time window longer with `irate()` -- it will always pick the two most-recent... [18:37:25] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1069.eqiad.wmnet with reason: host reimage [18:40:36] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1069.eqiad.wmnet with reason: host reimage [18:52:42] FIRING: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:33] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [18:57:17] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Comm Error: backplane 0 when reimaging wikikube-worker2190 - https://phabricator.wikimedia.org/T382420#10413692 (10Jhancock.wm) 05Open→03Resolved reseated all cables connected to the backplane and the connection on the main board... [19:00:05] dancy: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T1900). [19:07:46] o/ [19:08:20] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105418 (https://phabricator.wikimedia.org/T375667) [19:08:22] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105418 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [19:09:04] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105418 (https://phabricator.wikimedia.org/T375667) (owner: 10TrainBranchBot) [19:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:13:54] (03PS1) 10Michael Große: Growth: Remove temporary config for clearing link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) [19:13:54] (03CR) 10Michael Große: [C:04-1] "Id70d05b05ebd5d8a1650208b28b435da1f89d49e needs to be merged and in production and sure to not be reverted first before this change should" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [19:25:27] (03CR) 10Bking: [C:03+2] dse-k8s-services: introduce Blunderbuss config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1091827 (https://phabricator.wikimedia.org/T371994) (owner: 10Bking) [19:28:29] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.8 refs T375667 [19:28:34] T375667: 1.44.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T375667 [19:31:11] PROBLEM - Docker registry HTTPS interface on registry1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [19:32:01] RECOVERY - Docker registry HTTPS interface on registry1005 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Docker [19:36:35] dancy: looks pretty chill [19:36:40] Agreed [19:36:54] The best type of train vibe [19:37:01] ^ [19:38:22] (03CR) 10Eevans: [C:03+2] restbase: cleanup decommissioned hosts [puppet] - 10https://gerrit.wikimedia.org/r/1105015 (owner: 10Eevans) [19:43:05] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T382422#10413878 (10Jhancock.wm) 05Open→03Resolved [19:48:59] (03CR) 10Cwhite: [C:03+1] Enable $wgWMEStatsBeaconUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [19:50:43] (03CR) 10Cwhite: [C:03+1] "Looks good!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105367 (https://phabricator.wikimedia.org/T355837) (owner: 10Krinkle) [19:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414003 (10phaultfinder) [20:26:51] !log restarting eventgate-analytics-external to clear schema cache - T382113 | https://phabricator.wikimedia.org/T382113#10414005 [20:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:56] T382113: Invalid EventGate errors with content_translation_event 1.7.0 - https://phabricator.wikimedia.org/T382113 [20:27:04] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync [20:27:10] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105431 (https://phabricator.wikimedia.org/T374957) [20:27:18] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync [20:27:31] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync [20:28:17] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync [20:28:35] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync [20:29:24] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync [20:29:28] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105433 (https://phabricator.wikimedia.org/T374957) [20:32:29] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105431 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:32:30] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105433 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:33:31] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105431 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:33:58] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1105433 (https://phabricator.wikimedia.org/T374957) (owner: 10Clare Ming) [20:36:00] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:36:23] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:44:10] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [20:44:29] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [20:45:33] (03PS30) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [20:47:35] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [20:51:08] (03PS31) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [20:53:09] (03CR) 10CI reject: [V:04-1] P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [20:57:36] (03PS32) 10CDobbins: P:hardware::check: add profile to check HW configuration [puppet] - 10https://gerrit.wikimedia.org/r/1102860 [21:00:07] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T2100). nyaa~ [21:00:07] No Gerrit patches in the queue for this window AFAICS. [21:01:58] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4719/co" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [21:03:23] (03CR) 10CDobbins: "PCC: https://puppet-compiler.wmflabs.org/output/1102860/4719/dns4003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1102860 (owner: 10CDobbins) [21:07:08] (03PS1) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) [21:07:10] (03PS1) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [21:08:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:08:29] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 8565 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [21:09:41] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:10:24] (03CR) 10BCornwall: [C:03+1] ownership: Traffic cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1104952 (https://phabricator.wikimedia.org/T379258) (owner: 10Volans) [21:14:11] (03PS2) 10Andrew Bogott: pdns recursor: support injecting extra hostnames into recursor config [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) [21:14:11] (03PS2) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [21:15:11] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105442 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:15:15] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:19:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414190 (10phaultfinder) [21:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414210 (10phaultfinder) [21:25:42] (03PS3) 10Andrew Bogott: cloud-vps dns recursor: remove labs-ip-aliaser [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) [21:26:03] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1105443 (https://phabricator.wikimedia.org/T374129) (owner: 10Andrew Bogott) [21:30:10] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@a43cacf]: bump image suggestions, section topics, and SEAL [21:31:21] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@a43cacf]: bump image suggestions, section topics, and SEAL (duration: 01m 43s) [21:33:57] (03PS1) 10Brennen Bearnes: dockerpkg-builder: add to docker group [puppet] - 10https://gerrit.wikimedia.org/r/1105449 (https://phabricator.wikimedia.org/T382285) [21:39:14] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudcephmon100[1-3].eqiad.wmnet - https://phabricator.wikimedia.org/T380893#10414252 (10Andrew) a:05Andrew→03cmooney [21:42:42] RESOLVED: JobUnavailable: Reduced availability for job cloud_dev_pdns_rec in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:44:13] (03PS1) 10Bking: team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) [21:45:27] (03CR) 10CI reject: [V:04-1] team-search-platform: Add alert for wdqs-categories lag [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [21:58:53] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:59:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241218T2200) [22:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T382002#10414308 (10phaultfinder) [22:14:50] (03CR) 10Ebernhardson: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/1105451 (https://phabricator.wikimedia.org/T374916) (owner: 10Bking) [22:19:44] FIRING: [4x] IPv4AnchorUnreachable: ipv4 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv4AnchorUnreachable [22:19:44] FIRING: [4x] IPv6AnchorUnreachable: ipv6 ping to eqsin RIPE Atlas anchor: failures over threshold - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DIPv6AnchorUnreachable [22:58:09] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host es1043.eqiad.wmnet with OS bookworm [22:58:18] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414371 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm [23:09:31] FIRING: [3x] SystemdUnitFailed: mediawiki_job_translationnotifications-mediawikiwiki.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1043.eqiad.wmnet with OS bookworm [23:42:05] 10ops-eqiad, 06SRE, 06Data-Persistence, 06Data-Persistence-Automations, and 2 others: Q2:rack/setup/install es104[1-6] - https://phabricator.wikimedia.org/T378143#10414497 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host es1043.eqiad.wmnet with OS bookworm... [23:56:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed