[00:01:45] <jinxer-wm>	 FIRING: KubernetesDeploymentUnavailableReplicas: ...
[00:01:45] <jinxer-wm>	 Deployment linkrecommendation-internal in linkrecommendation at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ...
[00:01:45] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[00:05:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:16:17] <wikibugs>	 (03CR) 10Tim Starling: "Should be deployed ASAP to avoid breaking the next train, now that the core patch is merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling)
[00:19:36] <urbanecm>	 !log Delete previously-started mwscript-k8s instances of revalidateLinkRecommendations.php (T380455)
[00:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:40] <stashbot>	 T380455: Run revalidateLinkRecommendations.php for wikis with more than 25 excluded sections - https://phabricator.wikimedia.org/T380455
[00:19:49] <urbanecm>	 !log mwmaint2002: foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --all --verbose # T380455
[00:19:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:21:45] <jinxer-wm>	 RESOLVED: KubernetesDeploymentUnavailableReplicas: ...
[00:21:45] <jinxer-wm>	 Deployment linkrecommendation-internal in linkrecommendation at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ...
[00:21:45] <jinxer-wm>	 https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas
[00:38:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100878
[00:38:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100878 (owner: 10TrainBranchBot)
[00:45:49] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:59:59] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100878 (owner: 10TrainBranchBot)
[01:04:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mediawiki_job_purge_parsercache_pc4.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:08:22] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100879
[01:08:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100879 (owner: 10TrainBranchBot)
[01:15:25] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:15:42] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[01:16:16] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[01:27:47] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100879 (owner: 10TrainBranchBot)
[01:32:10] <TimStarling>	 !log on mwmaint2002: deleting [[MediaWiki:Sitesupport-url]] pages per T379205
[01:32:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:32:14] <stashbot>	 T379205: Donate sidebar link consistency (sitesupport-url) - https://phabricator.wikimedia.org/T379205
[01:36:33] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - fasw2-c1b-eqiad.mgmt.eqiad - https://phabricator.wikimedia.org/T381543#10385480 (10Dzahn)
[01:37:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - ps1-b4-eqiad.mgmt.eqiad - https://phabricator.wikimedia.org/T381540#10385481 (10Dzahn)
[01:47:23] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[01:47:26] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[01:47:52] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[01:47:55] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[01:48:43] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[01:49:10] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[01:49:38] <wikibugs>	 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T381635 (10phaultfinder) 03NEW
[01:49:53] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[01:50:47] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[02:29:49] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[02:40:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:05:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:04:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:40:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P71622 and previous config saved to /var/cache/conftool/dbconfig/20241206-054010-root.json
[05:41:41] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add es2044 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1100887 (https://phabricator.wikimedia.org/T381259)
[05:42:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2044 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1100887 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui)
[05:43:41] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10385626 (10Marostegui) p:05Triage→03Medium
[05:44:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2044 to dbctl depooled T381259', diff saved to https://phabricator.wikimedia.org/P71623 and previous config saved to /var/cache/conftool/dbconfig/20241206-054457-marostegui.json
[05:45:01] <stashbot>	 T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259
[05:47:09] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es4 and es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1100888 (https://phabricator.wikimedia.org/T381259)
[05:48:21] <wikibugs>	 (03PS1) 10Marostegui: es2044: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100889 (https://phabricator.wikimedia.org/T381259)
[05:49:53] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2044: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100889 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui)
[05:50:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update es4 and es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1100888 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui)
[05:50:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 1%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71624 and previous config saved to /var/cache/conftool/dbconfig/20241206-055047-root.json
[05:53:23] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1154.eqiad.wmnet with reason: Alter table
[05:53:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1154.eqiad.wmnet with reason: Alter table
[05:55:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P71625 and previous config saved to /var/cache/conftool/dbconfig/20241206-055516-root.json
[06:00:39] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:00:57] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:01] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:09] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:21] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:41] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:43] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:43] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:01:47] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:02:07] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:02:25] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:02:32] <wikibugs>	 (03PS1) 10Marostegui: create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068)
[06:03:01] <marostegui>	 Fixed the DNS issue
[06:03:23] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:05:39] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:05:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 5%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71626 and previous config saved to /var/cache/conftool/dbconfig/20241206-060552-root.json
[06:05:55] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:01] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:09] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:21] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:35] <wikibugs>	 (03PS2) 10Marostegui: create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068)
[06:06:39] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:41] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:41] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:06:47] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:07:06] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:07:25] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:08:23] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[06:09:47] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068) (owner: 10Marostegui)
[06:10:15] <wikibugs>	 (03Merged) 10jenkins-bot: create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068) (owner: 10Marostegui)
[06:10:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P71627 and previous config saved to /var/cache/conftool/dbconfig/20241206-061021-root.json
[06:20:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71628 and previous config saved to /var/cache/conftool/dbconfig/20241206-062058-root.json
[06:25:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P71629 and previous config saved to /var/cache/conftool/dbconfig/20241206-062527-root.json
[06:36:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71630 and previous config saved to /var/cache/conftool/dbconfig/20241206-063603-root.json
[06:51:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71631 and previous config saved to /var/cache/conftool/dbconfig/20241206-065109-root.json
[06:52:03] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[06:52:37] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T0700)
[07:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:04:47] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[07:05:20] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[07:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:06:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71632 and previous config saved to /var/cache/conftool/dbconfig/20241206-070614-root.json
[07:06:21] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[07:07:21] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[07:19:27] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[07:20:01] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[07:21:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71633 and previous config saved to /var/cache/conftool/dbconfig/20241206-072120-root.json
[07:36:53] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename mw143[0-5] to wikikube-worker105[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková)
[07:45:55] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385696 (10taavi)
[07:46:04] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385697 (10taavi) This seems to have caused {T381538}
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T0800)
[08:08:25] <wikibugs>	 (03CR) 10Elukey: [C:03+1] style: a pass of black on all files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100772 (owner: 10Volans)
[08:11:59] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "The change looks good, I do see some changes not related to firewalling in PCC but I have no idea why they are there." [puppet] - 10https://gerrit.wikimedia.org/r/1100788 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:12:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Nevermind, change before this one, got it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1100788 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:14:16] <wikibugs>	 (03CR) 10Elukey: [C:03+1] maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[08:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:16:32] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:17:04] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:17:23] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:17:25] <wikibugs>	 (03CR) 10Hashar: [C:03+2] "+1 !! Thanks Esuvat for the review!!" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar)
[08:17:32] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: Reinstate the banner for the developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar)
[08:18:10] <logmsgbot>	 !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:18:19] <logmsgbot>	 !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:19:36] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385717 (10MoritzMuehlenhoff) >>! In T381538#10385696, @taavi wrote: > This seems to have caused {T381639}  Sorry for that!   I debugged the issue an...
[08:22:49] <wikibugs>	 (03CR) 10Jelto: [C:03+1] Rename mw143[0-5] to wikikube-worker105[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková)
[08:26:06] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@ac50ebe]: Reinstate the banner for the developer survey
[08:26:17] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@ac50ebe]: Reinstate the banner for the developer survey (duration: 00m 11s)
[08:28:44] <wikibugs>	 (03PS1) 10Elukey: TEST: dump bios changes to be applied [cookbooks] - 10https://gerrit.wikimedia.org/r/1100996
[08:28:53] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10385736 (10LSobanski) p:05Triage→03High
[08:29:23] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10385737 (10LSobanski) a:03Dzahn
[08:30:45] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:30:58] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:33:16] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:33:41] <moritzm>	 !log uploaded ruby-sys-filesystem 1.4.3-1~wmf11u1 to component/puppet7 for Bullseye (needed by the mountpoints fact in facter 4) T381538
[08:33:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:45] <stashbot>	 T381538: Backport facter to bullseye - https://phabricator.wikimedia.org/T381538
[08:43:41] <logmsgbot>	 !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:54:36] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes[1033-1034] to wikikube-worker[1052-1053] [puppet] - 10https://gerrit.wikimedia.org/r/1100998 (https://phabricator.wikimedia.org/T377876)
[08:54:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Install updated ruby-sys-filesystem on bulleye systems running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538)
[08:55:56] <logmsgbot>	 !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[08:56:36] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100782 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[08:57:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, though let's merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[08:57:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, let's merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[08:59:41] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff)
[08:59:44] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1033-1034] to wikikube-worker[1052-1053] [puppet] - 10https://gerrit.wikimedia.org/r/1100998 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto)
[09:00:57] <logmsgbot>	 !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART
[09:01:33] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052)
[09:02:18] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1033-1034].eqiad.wmnet
[09:02:59] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff)
[09:03:24] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1033-1034].eqiad.wmnet
[09:04:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:07:07] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes[1033-1034] to wikikube-worker[1052-1053] [puppet] - 10https://gerrit.wikimedia.org/r/1100998 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto)
[09:07:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10385815 (10elukey) The error seems to be related to a specific network card:  ` PATCH https://10.65.4.200/redfish/v1/Syst...
[09:09:46] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1033 to wikikube-worker1052
[09:09:55] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[09:10:40] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:11:12] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:13:32] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1033 to wikikube-worker1052 - jelto@cumin1002"
[09:13:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1033 to wikikube-worker1052 - jelto@cumin1002"
[09:13:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:13:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1052
[09:14:10] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385843 (10taavi)
[09:14:14] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385845 (10taavi)
[09:15:28] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1052
[09:16:07] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1033 to wikikube-worker1052
[09:16:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1034 to wikikube-worker1053
[09:16:58] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[09:18:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I just read your audit/comments on the related task, ok to proceed whenever!" [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[09:18:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] blackbox/tcp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[09:18:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "I just read your audit/comments on the related task, ok to proceed whenever!" [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli)
[09:19:27] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052)
[09:20:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1034 to wikikube-worker1053 - jelto@cumin1002"
[09:21:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1034 to wikikube-worker1053 - jelto@cumin1002"
[09:21:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:21:02] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1053
[09:22:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1053
[09:23:14] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1034 to wikikube-worker1053
[09:24:52] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1052.eqiad.wmnet wikikube-worker1053.eqiad.wmnet on all recursors
[09:24:56] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1052.eqiad.wmnet wikikube-worker1053.eqiad.wmnet on all recursors
[09:28:01] <wikibugs>	 (03PS1) 10Brouberol: flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134)
[09:28:06] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1052.eqiad.wmnet with OS bookworm
[09:28:31] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1053.eqiad.wmnet with OS bookworm
[09:32:10] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Install updated ruby-sys-filesystem on bulleye systems running Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff)
[09:32:50] <wikibugs>	 (03PS1) 10Filippo Giunchedi: tests: assert page severity and summary match [alerts] - 10https://gerrit.wikimedia.org/r/1101005
[09:33:05] <wikibugs>	 (03PS2) 10Muehlenhoff: Install updated ruby-sys-filesystem on bullseye systems running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538)
[09:36:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Install updated ruby-sys-filesystem on bullseye systems running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff)
[09:41:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: tests: fix alertname whitespace check [alerts] - 10https://gerrit.wikimedia.org/r/1101006
[09:41:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "CI should have complained on I575c8c5e692 and didn't, thus fix the tests" [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi)
[09:44:07] <wikibugs>	 (03PS3) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020)
[09:44:50] <wikibugs>	 (03PS1) 10Slyngshede: Django Admin: Disable admin interface in production [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637)
[09:45:41] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1052.eqiad.wmnet with reason: host reimage
[09:46:04] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1053.eqiad.wmnet with reason: host reimage
[09:48:53] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1052.eqiad.wmnet with reason: host reimage
[09:51:15] <wikibugs>	 (03PS2) 10Abijeet Patro: Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101008 (https://phabricator.wikimedia.org/T372386)
[09:52:36] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101008 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro)
[09:52:54] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1053.eqiad.wmnet with reason: host reimage
[09:53:56] <wikibugs>	 (03PS1) 10Slyngshede: P:idm disable index listings for Bitu media and static content. [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637)
[10:04:20] <wikibugs>	 (03CR) 10Muehlenhoff: P:idm disable index listings for Bitu media and static content. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[10:05:22] <wikibugs>	 (03PS2) 10Slyngshede: P:idm disable index listings for Bitu media and static content. [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637)
[10:05:30] <wikibugs>	 (03CR) 10Slyngshede: P:idm disable index listings for Bitu media and static content. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[10:07:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[10:08:18] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1052.eqiad.wmnet with OS bookworm
[10:11:25] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1053.eqiad.wmnet with OS bookworm
[10:11:58] <jelto>	 !log homer 'cr*eqiad*' commit 'T377876'
[10:12:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:02] <stashbot>	 T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876
[10:13:23] <wikibugs>	 (03PS2) 10Brouberol: flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134)
[10:15:22] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] P:idm disable index listings for Bitu media and static content. [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[10:16:58] <wikibugs>	 (03CR) 10DCausse: [C:03+1] "lgtm, thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) (owner: 10Brouberol)
[10:21:06] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) (owner: 10Brouberol)
[10:21:18] <wikibugs>	 (03CR) 10Brouberol: [V:03+2 C:03+2] flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) (owner: 10Brouberol)
[10:21:32] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:25:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: tests: validate deploy-tag values [alerts] - 10https://gerrit.wikimedia.org/r/1101019
[10:27:28] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1052-1053].eqiad.wmnet
[10:27:29] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1052-1053].eqiad.wmnet
[10:28:17] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10385961 (10Jelto)
[10:30:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix wdqs-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1101020
[10:34:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Fix wdqs-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1101020 (owner: 10Muehlenhoff)
[10:34:55] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: add aliases for net-new wdqs services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking)
[10:35:15] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes[1035-1036] to wikikube-worker[1054-1055] [puppet] - 10https://gerrit.wikimedia.org/r/1101022 (https://phabricator.wikimedia.org/T377876)
[10:37:14] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1035-1036] to wikikube-worker[1054-1055] [puppet] - 10https://gerrit.wikimedia.org/r/1101022 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto)
[10:39:28] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes[1035-1036] to wikikube-worker[1054-1055] [puppet] - 10https://gerrit.wikimedia.org/r/1101022 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto)
[10:39:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1035-1036].eqiad.wmnet
[10:41:02] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1035-1036].eqiad.wmnet
[10:43:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1035 to wikikube-worker1054
[10:43:56] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:44:26] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:44:44] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:47:27] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1035 to wikikube-worker1054 - jelto@cumin1002"
[10:47:48] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1035 to wikikube-worker1054 - jelto@cumin1002"
[10:47:49] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:47:49] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1054
[10:48:57] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1054
[10:49:36] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1035 to wikikube-worker1054
[10:49:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[10:52:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2042.codfw.wmnet to cluster codfw and group D
[10:53:00] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1036 to wikikube-worker1055
[10:53:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[10:53:46] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2042.codfw.wmnet to cluster codfw and group D
[10:57:14] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1036 to wikikube-worker1055 - jelto@cumin1002"
[10:57:30] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM! Sorry for not following the "no whitespace" convention, did something break because of the space?" [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi)
[10:57:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1036 to wikikube-worker1055 - jelto@cumin1002"
[10:57:37] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:57:38] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1055
[10:58:44] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1055
[10:58:55] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Nice one." [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi)
[10:59:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1036 to wikikube-worker1055
[11:00:15] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1054.eqiad.wmnet wikikube-worker1055.eqiad.wmnet on all recursors
[11:00:19] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1054.eqiad.wmnet wikikube-worker1055.eqiad.wmnet on all recursors
[11:03:11] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] tests: assert page severity and summary match [alerts] - 10https://gerrit.wikimedia.org/r/1101005 (owner: 10Filippo Giunchedi)
[11:03:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] tests: assert page severity and summary match [alerts] - 10https://gerrit.wikimedia.org/r/1101005 (owner: 10Filippo Giunchedi)
[11:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:05:12] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1054.eqiad.wmnet with OS bookworm
[11:05:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1055.eqiad.wmnet with OS bookworm
[11:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:05:55] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] tests: validate deploy-tag values [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi)
[11:06:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Nothing broke no, it is a naming convention though no automated process relies on it AFAIK" [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi)
[11:09:26] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] tests: fix alertname whitespace check [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi)
[11:12:56] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:13:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews, since the wmcs alerts are paging I'm holding off until Monday to avoid surprises. Please let me know if you'd l" [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi)
[11:22:56] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:24:09] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "I don't expect any issues, but waiting til Monday sounds good!" [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi)
[11:26:29] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10386086 (10MoritzMuehlenhoff) I've readded ganeti2042 to the cluster and moved on VM to the node. I'll report back if there's any issues, otherwise I think you can send back...
[11:30:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host build2002.codfw.wmnet with OS bookworm
[11:30:22] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10386092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm
[11:30:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1054.eqiad.wmnet with reason: host reimage
[11:31:56] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] Prepare for migration of the Interwiki extension to core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling)
[11:32:40] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1055.eqiad.wmnet with reason: host reimage
[11:34:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1054.eqiad.wmnet with reason: host reimage
[11:38:17] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1055.eqiad.wmnet with reason: host reimage
[11:41:02] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652 (10MoritzMuehlenhoff) 03NEW
[11:45:12] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386122 (10MoritzMuehlenhoff)
[11:45:42] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10386123 (10MoritzMuehlenhoff)
[11:48:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on build2002.codfw.wmnet with reason: host reimage
[11:48:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Add ganeti1053/1054 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1101031 (https://phabricator.wikimedia.org/T381576)
[11:51:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2002.codfw.wmnet with reason: host reimage
[11:52:55] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1054.eqiad.wmnet with OS bookworm
[11:53:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1053/1054 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1101031 (https://phabricator.wikimedia.org/T381576) (owner: 10Muehlenhoff)
[11:54:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10386137 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None site.pp has been updated
[11:56:45] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1055.eqiad.wmnet with OS bookworm
[11:58:06] <jelto>	 !log homer 'cr*eqiad*' commit 'T377876'
[11:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:09] <stashbot>	 T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876
[11:58:45] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Django Admin: Disable admin interface in production [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[12:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T0800)
[12:00:05] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T1200).
[12:01:14] <wikibugs>	 (03Merged) 10jenkins-bot: Django Admin: Disable admin interface in production [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede)
[12:09:54] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host build2002.codfw.wmnet with OS bookworm
[12:10:06] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10386230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm completed: - build2002 (**PASS**)...
[12:15:09] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1054-1055].eqiad.wmnet
[12:15:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1054-1055].eqiad.wmnet
[12:15:51] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10386261 (10Jelto)
[12:18:26] <wikibugs>	 (03PS1) 10Jelto: Rename kubernetes[1037-1038] to wikikube-worker[1056-1057] [puppet] - 10https://gerrit.wikimedia.org/r/1101036 (https://phabricator.wikimedia.org/T377876)
[12:21:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1009.eqiad.wmnet
[12:23:53] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701)
[12:30:54] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[12:36:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[12:38:16] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1037-1038] to wikikube-worker[1056-1057] [puppet] - 10https://gerrit.wikimedia.org/r/1101036 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto)
[12:39:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:40:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[12:40:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:40:22] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti1009.eqiad.wmnet
[12:40:28] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386359 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1009.eqiad.wmnet` - ganeti1009.eqiad.wmnet (**FAIL...
[12:43:26] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386373 (10MoritzMuehlenhoff)
[12:47:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1037-1038].eqiad.wmnet
[12:48:23] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1037-1038].eqiad.wmnet
[12:50:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[12:52:16] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan)
[12:54:27] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Rename kubernetes[1037-1038] to wikikube-worker[1056-1057] [puppet] - 10https://gerrit.wikimedia.org/r/1101036 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto)
[12:56:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1037 to wikikube-worker1056
[12:56:50] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[12:57:51] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:57:51] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[12:58:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1016.eqiad.wmnet
[13:01:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1037 to wikikube-worker1056 - jelto@cumin1002"
[13:04:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:07:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:10:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1037 to wikikube-worker1056 - jelto@cumin1002"
[13:10:35] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:10:35] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1056
[13:11:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1038:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1038 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:11:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:11:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1056
[13:11:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:11:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:11:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1016.eqiad.wmnet
[13:11:59] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386438 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1016.eqiad.wmnet` - ganeti1016.eqiad.wmnet (**PASS...
[13:12:26] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1037 to wikikube-worker1056
[13:13:10] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1038 to wikikube-worker1057
[13:13:30] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.netbox
[13:17:13] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1038 to wikikube-worker1057 - jelto@cumin1002"
[13:17:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1038 to wikikube-worker1057 - jelto@cumin1002"
[13:17:59] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:17:59] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1057
[13:19:11] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1057
[13:19:50] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1038 to wikikube-worker1057
[13:21:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1017.eqiad.wmnet
[13:25:48] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1056.eqiad.wmnet wikikube-worker1057.eqiad.wmnet on all recursors
[13:25:51] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1056.eqiad.wmnet wikikube-worker1057.eqiad.wmnet on all recursors
[13:28:54] <wikibugs>	 (03CR) 10ZhaoFJx: [C:03+1] "Not sure about the scrutineer, but sysop LGTM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[13:29:16] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1056.eqiad.wmnet with OS bookworm
[13:29:36] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[13:31:00] <wikibugs>	 (03CR) 10Stang: "Referenced T377531#10369860" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang)
[13:31:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:32:04] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386486 (10MoritzMuehlenhoff)
[13:34:55] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10386500 (10SuzanneWood-WMDE) I signed that : )
[13:35:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:35:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:35:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:35:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1017.eqiad.wmnet
[13:35:42] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386501 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1017.eqiad.wmnet` - ganeti1017.eqiad.wmnet (**PASS...
[13:36:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1018.eqiad.wmnet
[13:43:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:46:51] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1056.eqiad.wmnet with reason: host reimage
[13:47:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1018.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:50:31] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1056.eqiad.wmnet with reason: host reimage
[13:50:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:55:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:55:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1018.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[13:55:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:55:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1018.eqiad.wmnet
[13:55:49] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386552 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1018.eqiad.wmnet` - ganeti1018.eqiad.wmnet (**PASS...
[13:56:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1020.eqiad.wmnet
[14:10:12] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1056.eqiad.wmnet with OS bookworm
[14:11:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[14:12:51] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10386587 (10Jhancock.wm) they denied the request. gonna resubmit. didn't see any errors this morning since draining power and reseating the CPU.   could you try to get it to fail again...
[14:14:10] <wikibugs>	 (03PS2) 10Kamila Součková: Rename mw143[0-5] to wikikube-worker10[58-63] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876)
[14:15:08] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10386588 (10Marostegui) I will do it on Monday, as I need to stop another server to clone this one and I don't want to leave it stopped before the weekend.  I'll keep you posted. Thank...
[14:15:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1005.eqiad.wmnet - https://phabricator.wikimedia.org/T381491#10386590 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,...
[14:15:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1004.eqiad.wmnet - https://phabricator.wikimedia.org/T381490#10386594 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,...
[14:15:23] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1003.eqiad.wmnet - https://phabricator.wikimedia.org/T381489#10386598 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,...
[14:15:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381488#10386602 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,...
[14:15:26] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1001.eqiad.wmnet - https://phabricator.wikimedia.org/T381487#10386606 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,...
[14:17:37] <wikibugs>	 (03PS3) 10Kamila Součková: Rename mw143[0-5] to wikikube-worker10[58-63] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876)
[14:19:29] <wikibugs>	 (03CR) 10Kamila Součková: "re-did this due to new numbers clashes" [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková)
[14:19:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:20:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002"
[14:20:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:20:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1020.eqiad.wmnet
[14:20:28] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386609 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1020.eqiad.wmnet` - ganeti1020.eqiad.wmnet (**PASS...
[14:21:36] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386613 (10MoritzMuehlenhoff)
[14:27:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: ganeti1009.eqiad.wmnet
[14:27:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: ganeti1009.eqiad.wmnet
[14:27:16] <wikibugs>	 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386615 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: ganeti1009.eqiad.wmnet
[14:29:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove site.pp entries of decommed Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1101065 (https://phabricator.wikimedia.org/T381652)
[14:29:56] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10386619 (10elukey) @Jclark-ctr if those are not urgent I'd ask you to leave them to me for some tests, I'll ping you when...
[14:32:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove site.pp entries of decommed Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1101065 (https://phabricator.wikimedia.org/T381652) (owner: 10Muehlenhoff)
[14:33:17] <wikibugs>	 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386626 (10MoritzMuehlenhoff)
[14:33:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386627 (10MoritzMuehlenhoff)
[14:34:17] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10386630 (10MoritzMuehlenhoff) 05Open→03Resolved All new servers added, all old server decommissioned and clusters rebalanced.
[14:35:47] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf group> for <cpetrillo> - https://phabricator.wikimedia.org/T381464#10386635 (10HShaikh) For some clarity. The request is for Chris to be able to eventually run jupyter notebooks.  So he is requesting access to the analytics-privatedata-users group in the anal...
[14:40:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:42:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068
[14:43:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068 (owner: 10Muehlenhoff)
[14:46:58] <wikibugs>	 (03Abandoned) 10Btullis: Revert "Upgrade the remainder of the cephosd cluster to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1099669 (https://phabricator.wikimedia.org/T381264) (owner: 10Btullis)
[14:49:41] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[14:50:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[14:51:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068
[14:53:14] <wikibugs>	 (03PS1) 10Máté Szabó: dialog: Fix wrong title on Types of unacceptable behavior step [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529)
[14:53:37] <wikibugs>	 (03PS1) 10Máté Szabó: dialog: Fix spacing between buttons in the dialog footer [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530)
[14:54:09] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) (owner: 10Máté Szabó)
[14:54:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) (owner: 10Máté Szabó)
[15:00:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100101 (owner: 10Máté Szabó)
[15:02:38] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1430-1435].eqiad.wmnet
[15:03:06] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] "proceeding, as the changes after the +1s were trivial" [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková)
[15:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:05:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1430-1435].eqiad.wmnet
[15:08:18] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1430 to wikikube-worker1058
[15:08:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:10:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1431 to wikikube-worker1059
[15:13:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1430 to wikikube-worker1058 - kamila@cumin1002"
[15:13:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1430 to wikikube-worker1058 - kamila@cumin1002"
[15:13:39] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:13:39] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1058
[15:14:06] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:14:50] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1058
[15:15:29] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1430 to wikikube-worker1058
[15:16:32] <wikibugs>	 (03PS2) 10Máté Szabó: dialog: Fix spacing between buttons in the dialog footer [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530)
[15:18:03] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1431 to wikikube-worker1059 - kamila@cumin1002"
[15:18:04] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1432 to wikikube-worker1060
[15:18:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1431 to wikikube-worker1059 - kamila@cumin1002"
[15:18:07] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:18:08] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1059
[15:18:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:18:40] <jinxer-wm>	 FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw1433:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:19:16] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1059
[15:19:55] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1431 to wikikube-worker1059
[15:20:42] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1433 to wikikube-worker1061
[15:20:49] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1434 to wikikube-worker1062
[15:20:53] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1435 to wikikube-worker1063
[15:21:28] <wikibugs>	 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10386737 (10RobH) >>! In T373993#10385350, @BCornwall wrote: > Some observations: >  > * [[ https://grafana.wikimedia.org/goto/_53fKoVHR?orgId=1 | magru has the highest ave...
[15:22:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1432 to wikikube-worker1060 - kamila@cumin1002"
[15:22:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1432 to wikikube-worker1060 - kamila@cumin1002"
[15:22:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:22:46] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1060
[15:22:47] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:23:52] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1060
[15:24:31] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1432 to wikikube-worker1060
[15:26:51] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1433 to wikikube-worker1061 - kamila@cumin1002"
[15:26:56] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1433 to wikikube-worker1061 - kamila@cumin1002"
[15:26:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:26:57] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1061
[15:27:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:28:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1061
[15:28:49] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1433 to wikikube-worker1061
[15:29:53] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:29:54] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1063
[15:29:56] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.netbox
[15:30:59] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1063
[15:31:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1435 to wikikube-worker1063
[15:32:14] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:32:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1062
[15:33:15] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1062
[15:33:53] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1434 to wikikube-worker1062
[15:34:35] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1058.eqiad.wmnet wikikube-worker1059.eqiad.wmnet wikikube-worker1060.eqiad.wmnet wikikube-worker1061.eqiad.wmnet wikikube-worker1062.eqiad.wmnet wikikube-worker1063.eqiad.wmnet on all recursors
[15:34:38] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1058.eqiad.wmnet wikikube-worker1059.eqiad.wmnet wikikube-worker1060.eqiad.wmnet wikikube-worker1061.eqiad.wmnet wikikube-worker1062.eqiad.wmnet wikikube-worker1063.eqiad.wmnet on all recursors
[15:36:11] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker1058.eqiad.wmnet
[15:36:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1058.eqiad.wmnet with OS bullseye
[15:36:49] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1058.eqiad.wmnet with OS bullseye
[15:36:50] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker1058.eqiad.wmnet
[15:39:09] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1058.eqiad.wmnet with OS bookworm
[15:41:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1059.eqiad.wmnet with OS bookworm
[15:41:46] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1060.eqiad.wmnet with OS bookworm
[15:42:15] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1061.eqiad.wmnet with OS bookworm
[15:43:05] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1062.eqiad.wmnet with OS bookworm
[15:43:25] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1063.eqiad.wmnet with OS bookworm
[15:45:00] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:45:52] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:50:02] <wikibugs>	 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10386830 (10kamila)
[15:52:54] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10386835 (10elukey) I am reviewing the quote of these nodes to figure out what the item is, afaics it seems a 10G network...
[15:54:52] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1058.eqiad.wmnet with reason: host reimage
[15:57:22] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1060.eqiad.wmnet with reason: host reimage
[15:58:12] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1061.eqiad.wmnet with reason: host reimage
[15:58:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1058.eqiad.wmnet with reason: host reimage
[15:58:55] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1062.eqiad.wmnet with reason: host reimage
[15:59:14] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1063.eqiad.wmnet with reason: host reimage
[16:01:57] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1060.eqiad.wmnet with reason: host reimage
[16:05:33] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1062.eqiad.wmnet with reason: host reimage
[16:08:49] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "LGTM, thanks for cleaning this up." [puppet] - 10https://gerrit.wikimedia.org/r/1101068 (owner: 10Muehlenhoff)
[16:09:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1063.eqiad.wmnet with reason: host reimage
[16:10:22] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[16:11:20] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[16:12:10] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1061.eqiad.wmnet with reason: host reimage
[16:17:13] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1058.eqiad.wmnet with OS bookworm
[16:20:24] <wikibugs>	 (03PS1) 10Hnowlan: mediawiki: add debug flag for mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101081 (https://phabricator.wikimedia.org/T371701)
[16:20:35] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1060.eqiad.wmnet with OS bookworm
[16:24:42] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1062.eqiad.wmnet with OS bookworm
[16:28:09] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1063.eqiad.wmnet with OS bookworm
[16:29:05] <logmsgbot>	 !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1059.eqiad.wmnet with OS bookworm
[16:29:37] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1059.eqiad.wmnet with OS bookworm
[16:30:14] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1061.eqiad.wmnet with OS bookworm
[16:32:54] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:35:10] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff)
[16:37:54] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:40:47] <wikibugs>	 (03PS1) 10Btullis: Add ahoelzl to analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1101082 (https://phabricator.wikimedia.org/T345959)
[16:43:09] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Add ahoelzl to analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1101082 (https://phabricator.wikimedia.org/T345959) (owner: 10Btullis)
[16:45:45] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1059.eqiad.wmnet with reason: host reimage
[16:47:28] <wikibugs>	 (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1101083
[16:48:45] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1059.eqiad.wmnet with reason: host reimage
[16:48:58] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.121e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[16:50:05] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Comm Error: backplane 0 when reimaging wikikube-worker1057 - https://phabricator.wikimedia.org/T381676 (10Jelto) 03NEW
[17:04:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:08:12] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1059.eqiad.wmnet with OS bookworm
[17:08:33] <logmsgbot>	 !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1058-1063].eqiad.wmnet
[17:08:36] <logmsgbot>	 !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1058-1063].eqiad.wmnet
[17:25:45] <jinxer-wm>	 FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:26:07] <sukhe>	 ^ I will get a lot of @sukhe for this but this should be a paging alert
[17:26:39] <sukhe>	 a widespread puppet failure that's simply a scroll in the alerts channel is not enough
[17:27:48] <wikibugs>	 (03PS1) 10Andrea Denisse: ldap: Grant access to the wmf group for cpetrillo [puppet] - 10https://gerrit.wikimedia.org/r/1101090 (https://phabricator.wikimedia.org/T381464)
[17:28:05] <jhathaway>	 sukhe: that is probably me, looking
[17:28:46] <sukhe>	 jhathaway: I am not fully sure, I think there are some network issues at play that are unrelated to you
[17:28:50] <sukhe>	 topranks: ^
[17:29:10] <topranks>	 !log splitting codfw -> eqsin traffic over path via ulsfo as direct link is saturated 
[17:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:27] <jhathaway>	 could be I was running puppet with batchs of 75 in codfw, perhaps that was too much
[17:29:29] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+2] ldap: Grant access to the wmf group for cpetrillo [puppet] - 10https://gerrit.wikimedia.org/r/1101090 (https://phabricator.wikimedia.org/T381464) (owner: 10Andrea Denisse)
[17:30:03] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to <wmf group> for <cpetrillo> - https://phabricator.wikimedia.org/T381464#10387234 (10andrea.denisse) I've added cpetrillo to the `wmf` group and to the `WMF-NDA` Phabricator group. Please reopen the task if there's anything else I can assist w...
[17:30:13] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to <wmf group> for <cpetrillo> - https://phabricator.wikimedia.org/T381464#10387235 (10andrea.denisse) 05In progress→03Resolved
[17:31:35] <logmsgbot>	 !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1057.eqiad.wmnet with OS bookworm
[17:40:26] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101092
[17:44:42] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "mostly harmless, but it can give an alert that too many facts are being persisted to puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott)
[17:45:45] <jinxer-wm>	 RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:53:46] <wikibugs>	 (03PS4) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817)
[17:54:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[17:55:13] <wikibugs>	 (03PS5) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817)
[17:55:31] <wikibugs>	 (03CR) 10Scott French: "Thanks for the review, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French)
[17:55:39] <wikibugs>	 (03PS6) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817)
[17:56:20] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[17:57:34] <wikibugs>	 (03PS7) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817)
[18:07:09] <wikibugs>	 (03CR) 10Bking: [C:03+2] partman: add recipe for UEFI 4-disk SW RAID-10 [puppet] - 10https://gerrit.wikimedia.org/r/1099740 (https://phabricator.wikimedia.org/T373519) (owner: 10Bking)
[18:10:05] <wikibugs>	 (03PS1) 10JHathaway: hadoop: sort local-dirs [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538)
[18:10:22] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538) (owner: 10JHathaway)
[18:17:19] <wikibugs>	 (03CR) 10Jdlrobson: Enable Empty search A/B test on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson)
[18:18:16] <wikibugs>	 (03PS1) 10Jdlrobson: Fixes A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101094 (https://phabricator.wikimedia.org/T378115)
[18:21:19] <wikibugs>	 (03PS1) 10Bking: wdqs1025: Configure partitions for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1101095 (https://phabricator.wikimedia.org/T378030)
[18:33:30] <wikibugs>	 (03PS1) 10AntiCompositeNumber: entrypoint.sh: use full thumbor path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101097
[18:34:57] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Interesting." [puppet] - 10https://gerrit.wikimedia.org/r/1101095 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking)
[18:37:24] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] entrypoint.sh: use full thumbor path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101097 (owner: 10AntiCompositeNumber)
[18:42:18] <wikibugs>	 (03PS2) 10Herron: pyrra: onboard wdqs-availability [puppet] - 10https://gerrit.wikimedia.org/r/1101083 (https://phabricator.wikimedia.org/T302995)
[18:42:18] <wikibugs>	 (03CR) 10Herron: [C:03+2] "self merge for initial onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101083 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[18:42:28] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf group> for <cpetrillo> - https://phabricator.wikimedia.org/T381464#10387466 (10Dzahn) >>! In T381464#10386635, @HShaikh wrote: > For some clarity. The request is for Chris to be able to eventually run jupyter notebooks.  > So he is requesting access to th...
[18:48:27] <wikibugs>	 (03CR) 10Bking: [C:03+2] wdqs1025: Configure partitions for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1101095 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking)
[18:53:02] <wikibugs>	 (03PS1) 10Herron: pyrra: switch wdqs-availability ratio type [puppet] - 10https://gerrit.wikimedia.org/r/1101099 (https://phabricator.wikimedia.org/T302995)
[18:54:08] <wikibugs>	 (03PS2) 10Herron: pyrra: switch wdqs-availability ratio type [puppet] - 10https://gerrit.wikimedia.org/r/1101099 (https://phabricator.wikimedia.org/T302995)
[18:55:25] <inflatador>	 uefi
[18:56:57] <wikibugs>	 (03CR) 10Herron: [C:03+2] pyrra: switch wdqs-availability ratio type [puppet] - 10https://gerrit.wikimedia.org/r/1101099 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[19:00:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:01:46] <wikibugs>	 06SRE, 06Editing-team, 10MediaWiki-Debug-Logger, 10observability, and 5 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10387562 (10Urbanecm_WMF) a:03Urbanecm_WMF Thanks @Michael. I think the best course of action is to revert that change, as it is making...
[19:02:16] <wikibugs>	 06SRE, 06Editing-team, 10MediaWiki-Debug-Logger, 10observability, and 5 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10387566 (10Urbanecm_WMF) @kharlan @catrope As the engineers who made the original chance, CCing you, in case you have any concerns with...
[19:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:05:00] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[19:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:22:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:22:34] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:26:02] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 8392 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[19:40:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye
[19:40:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10387631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye
[20:08:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage
[20:12:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage
[20:29:01] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1025.eqiad.wmnet with OS bullseye
[20:29:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10387829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye completed: - wdqs...
[20:35:27] <wikibugs>	 (03CR) 10Krinkle: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[21:04:28] <jinxer-wm>	 FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:32] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10387912 (10KFrancis) Waiting on legal counsel to counter sign.  I just pinged him.
[21:07:38] <wikibugs>	 (03PS1) 10Herron: pyrra: wdqs-availability invert query [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995)
[21:07:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pyrra: wdqs-availability invert query [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[21:08:27] <wikibugs>	 (03PS2) 10Herron: pyrra: wdqs-availability invert query [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995)
[21:13:48] <wikibugs>	 (03CR) 10Herron: [C:03+2] "self merge sorting out pyrra onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[21:17:45] <wikibugs>	 (03PS1) 10Herron: add pyrra note for wdqs-availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1101114 (https://phabricator.wikimedia.org/T302995)
[21:18:02] <wikibugs>	 (03CR) 10Herron: [V:03+2 C:03+2] add pyrra note for wdqs-availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1101114 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron)
[22:33:16] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[22:33:50] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[22:40:08] <wikibugs>	 (03PS1) 10Scott French: mw-(apt-ext|api-int|jobrunner|parsoid|web): set php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040)
[22:40:33] <wikibugs>	 (03PS4) 10Scott French: hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040)
[22:40:34] <wikibugs>	 (03PS3) 10Scott French: hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040)
[22:40:34] <wikibugs>	 (03PS1) 10Scott French: hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040)
[23:04:28] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:05:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:18:23] <mutante>	 !log clouddumps1001/clouddumps1002: rm /srv/dumps/xmldatadumps/public/other/misc/phabricator_public.dump  - an uncompressed old file from Sep 2023 - normal dumps are gzipped and current
[23:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:46] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:29:28] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:49:13] <wikibugs>	 (03PS2) 10Wziko: feat(cfssl-issuer): change default value for external_services in cfssl issuer helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837