[00:01:45] FIRING: KubernetesDeploymentUnavailableReplicas: ... [00:01:45] Deployment linkrecommendation-internal in linkrecommendation at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [00:01:45] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:05:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:15:25] FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:17] (03CR) 10Tim Starling: "Should be deployed ASAP to avoid breaking the next train, now that the core patch is merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling) [00:19:36] !log Delete previously-started mwscript-k8s instances of revalidateLinkRecommendations.php (T380455) [00:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:40] T380455: Run revalidateLinkRecommendations.php for wikis with more than 25 excluded sections - https://phabricator.wikimedia.org/T380455 [00:19:49] !log mwmaint2002: foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --all --verbose # T380455 [00:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:45] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [00:21:45] Deployment linkrecommendation-internal in linkrecommendation at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=linkrecommendation&var-deployment=linkrecommendation-internal - ... [00:21:45] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100878 [00:38:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100878 (owner: 10TrainBranchBot) [00:45:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:59:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1100878 (owner: 10TrainBranchBot) [01:04:28] FIRING: [2x] SystemdUnitFailed: mediawiki_job_purge_parsercache_pc4.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:08:22] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100879 [01:08:22] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100879 (owner: 10TrainBranchBot) [01:15:25] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [01:15:42] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [01:16:16] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [01:27:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1100879 (owner: 10TrainBranchBot) [01:32:10] !log on mwmaint2002: deleting [[MediaWiki:Sitesupport-url]] pages per T379205 [01:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:32:14] T379205: Donate sidebar link consistency (sitesupport-url) - https://phabricator.wikimedia.org/T379205 [01:36:33] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound interface errors - fasw2-c1b-eqiad.mgmt.eqiad - https://phabricator.wikimedia.org/T381543#10385480 (10Dzahn) [01:37:11] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - ps1-b4-eqiad.mgmt.eqiad - https://phabricator.wikimedia.org/T381540#10385481 (10Dzahn) [01:47:23] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [01:47:26] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [01:47:52] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [01:47:55] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [01:48:43] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [01:49:10] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [01:49:38] 10ops-eqiad, 06DC-Ops: Inbound interface errors - https://phabricator.wikimedia.org/T381635 (10phaultfinder) 03NEW [01:49:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [01:50:47] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [02:29:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:40:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:25] FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:40:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P71622 and previous config saved to /var/cache/conftool/dbconfig/20241206-054010-root.json [05:41:41] (03PS1) 10Marostegui: instances.yaml: Add es2044 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1100887 (https://phabricator.wikimedia.org/T381259) [05:42:29] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es2044 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1100887 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [05:43:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10385626 (10Marostegui) p:05Triage→03Medium [05:44:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es2044 to dbctl depooled T381259', diff saved to https://phabricator.wikimedia.org/P71623 and previous config saved to /var/cache/conftool/dbconfig/20241206-054457-marostegui.json [05:45:01] T381259: Productionize es204[1-6] - https://phabricator.wikimedia.org/T381259 [05:47:09] (03PS1) 10Marostegui: wmnet: Update es4 and es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1100888 (https://phabricator.wikimedia.org/T381259) [05:48:21] (03PS1) 10Marostegui: es2044: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100889 (https://phabricator.wikimedia.org/T381259) [05:49:53] (03CR) 10Marostegui: [C:03+2] es2044: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1100889 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [05:50:09] (03CR) 10Marostegui: [C:03+2] wmnet: Update es4 and es5 CNAME [dns] - 10https://gerrit.wikimedia.org/r/1100888 (https://phabricator.wikimedia.org/T381259) (owner: 10Marostegui) [05:50:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 1%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71624 and previous config saved to /var/cache/conftool/dbconfig/20241206-055047-root.json [05:53:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1154.eqiad.wmnet with reason: Alter table [05:53:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1154.eqiad.wmnet with reason: Alter table [05:55:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P71625 and previous config saved to /var/cache/conftool/dbconfig/20241206-055516-root.json [06:00:39] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:00:57] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:09] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:21] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:01:47] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:02:07] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:02:25] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:02:32] (03PS1) 10Marostegui: create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068) [06:03:01] Fixed the DNS issue [06:03:23] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 8ba89a6115f0b32932e3987d3086840bf5504502, dns.git is 1a098c0a58f3dbf237834094d3d48f38c9105dc7) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:05:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:05:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 5%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71626 and previous config saved to /var/cache/conftool/dbconfig/20241206-060552-root.json [06:05:55] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:09] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:21] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:35] (03PS2) 10Marostegui: create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068) [06:06:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:06:47] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:07:06] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:07:25] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:08:23] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [06:09:47] (03CR) 10Marostegui: [C:03+2] create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068) (owner: 10Marostegui) [06:10:15] (03Merged) 10jenkins-bot: create_pc_tables.sh: Create table in parsercache [software] - 10https://gerrit.wikimedia.org/r/1100890 (https://phabricator.wikimedia.org/T378068) (owner: 10Marostegui) [06:10:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P71627 and previous config saved to /var/cache/conftool/dbconfig/20241206-061021-root.json [06:20:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71628 and previous config saved to /var/cache/conftool/dbconfig/20241206-062058-root.json [06:25:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2023 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P71629 and previous config saved to /var/cache/conftool/dbconfig/20241206-062527-root.json [06:36:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 25%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71630 and previous config saved to /var/cache/conftool/dbconfig/20241206-063603-root.json [06:51:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 50%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71631 and previous config saved to /var/cache/conftool/dbconfig/20241206-065109-root.json [06:52:03] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [06:52:37] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T0700) [07:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:04:47] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [07:05:20] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [07:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71632 and previous config saved to /var/cache/conftool/dbconfig/20241206-070614-root.json [07:06:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [07:07:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [07:19:27] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [07:20:01] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [07:21:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Pooling in production', diff saved to https://phabricator.wikimedia.org/P71633 and previous config saved to /var/cache/conftool/dbconfig/20241206-072120-root.json [07:36:53] (03CR) 10JMeybohm: [C:03+1] Rename mw143[0-5] to wikikube-worker105[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [07:45:55] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385696 (10taavi) [07:46:04] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385697 (10taavi) This seems to have caused {T381538} [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T0800) [08:08:25] (03CR) 10Elukey: [C:03+1] style: a pass of black on all files [software/spicerack] - 10https://gerrit.wikimedia.org/r/1100772 (owner: 10Volans) [08:11:59] (03CR) 10Elukey: [C:03+1] "The change looks good, I do see some changes not related to firewalling in PCC but I have no idea why they are there." [puppet] - 10https://gerrit.wikimedia.org/r/1100788 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:12:27] (03CR) 10Elukey: [C:03+1] "Nevermind, change before this one, got it :)" [puppet] - 10https://gerrit.wikimedia.org/r/1100788 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:14:16] (03CR) 10Elukey: [C:03+1] maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [08:15:25] FIRING: SystemdUnitFailed: user@499.service on build2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:32] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:17:04] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:17:23] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:17:25] (03CR) 10Hashar: [C:03+2] "+1 !! Thanks Esuvat for the review!!" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [08:17:32] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:17:59] (03Merged) 10jenkins-bot: Reinstate the banner for the developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1100163 (owner: 10Hashar) [08:18:10] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:18:19] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:19:36] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385717 (10MoritzMuehlenhoff) >>! In T381538#10385696, @taavi wrote: > This seems to have caused {T381639} Sorry for that! I debugged the issue an... [08:22:49] (03CR) 10Jelto: [C:03+1] Rename mw143[0-5] to wikikube-worker105[2-7] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [08:26:06] !log hashar@deploy2002 Started deploy [gerrit/gerrit@ac50ebe]: Reinstate the banner for the developer survey [08:26:17] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@ac50ebe]: Reinstate the banner for the developer survey (duration: 00m 11s) [08:28:44] (03PS1) 10Elukey: TEST: dump bios changes to be applied [cookbooks] - 10https://gerrit.wikimedia.org/r/1100996 [08:28:53] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10385736 (10LSobanski) p:05Triage→03High [08:29:23] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Message content lost when mailing list is the only recipient - https://phabricator.wikimedia.org/T377045#10385737 (10LSobanski) a:03Dzahn [08:30:45] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:30:58] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:33:16] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:33:41] !log uploaded ruby-sys-filesystem 1.4.3-1~wmf11u1 to component/puppet7 for Bullseye (needed by the mountpoints fact in facter 4) T381538 [08:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:45] T381538: Backport facter to bullseye - https://phabricator.wikimedia.org/T381538 [08:43:41] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:54:36] (03PS1) 10Jelto: Rename kubernetes[1033-1034] to wikikube-worker[1052-1053] [puppet] - 10https://gerrit.wikimedia.org/r/1100998 (https://phabricator.wikimedia.org/T377876) [08:54:55] (03PS1) 10Muehlenhoff: Install updated ruby-sys-filesystem on bulleye systems running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) [08:55:56] !log elukey@cumin2002 START - Cookbook sre.hosts.provision for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:56:36] (03CR) 10Filippo Giunchedi: [C:03+1] blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100782 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [08:57:08] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, though let's merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [08:57:21] (03CR) 10Filippo Giunchedi: "LGTM, let's merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [08:59:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff) [08:59:44] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1033-1034] to wikikube-worker[1052-1053] [puppet] - 10https://gerrit.wikimedia.org/r/1100998 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [09:00:57] !log elukey@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1011.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [09:01:33] (03PS1) 10Ilias Sarantopoulos: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) [09:02:18] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1033-1034].eqiad.wmnet [09:02:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff) [09:03:24] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1033-1034].eqiad.wmnet [09:04:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:07] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1033-1034] to wikikube-worker[1052-1053] [puppet] - 10https://gerrit.wikimedia.org/r/1100998 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [09:07:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10385815 (10elukey) The error seems to be related to a specific network card: ` PATCH https://10.65.4.200/redfish/v1/Syst... [09:09:46] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1033 to wikikube-worker1052 [09:09:55] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:10:40] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:12] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:32] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1033 to wikikube-worker1052 - jelto@cumin1002" [09:13:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1033 to wikikube-worker1052 - jelto@cumin1002" [09:13:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:13:58] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1052 [09:14:10] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385843 (10taavi) [09:14:14] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Backport facter to bullseye - https://phabricator.wikimedia.org/T381538#10385845 (10taavi) [09:15:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1052 [09:16:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1033 to wikikube-worker1052 [09:16:38] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1034 to wikikube-worker1053 [09:16:58] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [09:18:34] (03CR) 10Filippo Giunchedi: "I just read your audit/comments on the related task, ok to proceed whenever!" [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [09:18:41] (03CR) 10Filippo Giunchedi: [C:03+1] blackbox/tcp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1100839 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [09:18:48] (03CR) 10Filippo Giunchedi: [C:03+1] "I just read your audit/comments on the related task, ok to proceed whenever!" [puppet] - 10https://gerrit.wikimedia.org/r/1100838 (https://phabricator.wikimedia.org/T381561) (owner: 10Tiziano Fogli) [09:19:27] (03PS2) 10Ilias Sarantopoulos: ml-services: revamp llm model server with aya-8B [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101000 (https://phabricator.wikimedia.org/T379052) [09:20:33] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1034 to wikikube-worker1053 - jelto@cumin1002" [09:21:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1034 to wikikube-worker1053 - jelto@cumin1002" [09:21:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:21:02] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1053 [09:22:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1053 [09:23:14] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1034 to wikikube-worker1053 [09:24:52] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1052.eqiad.wmnet wikikube-worker1053.eqiad.wmnet on all recursors [09:24:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1052.eqiad.wmnet wikikube-worker1053.eqiad.wmnet on all recursors [09:28:01] (03PS1) 10Brouberol: flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) [09:28:06] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1052.eqiad.wmnet with OS bookworm [09:28:31] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1053.eqiad.wmnet with OS bookworm [09:32:10] (03CR) 10Elukey: [C:03+1] Install updated ruby-sys-filesystem on bulleye systems running Puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff) [09:32:50] (03PS1) 10Filippo Giunchedi: tests: assert page severity and summary match [alerts] - 10https://gerrit.wikimedia.org/r/1101005 [09:33:05] (03PS2) 10Muehlenhoff: Install updated ruby-sys-filesystem on bullseye systems running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) [09:36:59] (03CR) 10Muehlenhoff: [C:03+2] Install updated ruby-sys-filesystem on bullseye systems running Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1100999 (https://phabricator.wikimedia.org/T381538) (owner: 10Muehlenhoff) [09:41:05] (03PS1) 10Filippo Giunchedi: tests: fix alertname whitespace check [alerts] - 10https://gerrit.wikimedia.org/r/1101006 [09:41:49] (03CR) 10Filippo Giunchedi: "CI should have complained on I575c8c5e692 and didn't, thus fix the tests" [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi) [09:44:07] (03PS3) 10Stang: zhwiki: Allow local securepoll setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) [09:44:50] (03PS1) 10Slyngshede: Django Admin: Disable admin interface in production [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) [09:45:41] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1052.eqiad.wmnet with reason: host reimage [09:46:04] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1053.eqiad.wmnet with reason: host reimage [09:48:53] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1052.eqiad.wmnet with reason: host reimage [09:51:15] (03PS2) 10Abijeet Patro: Translate: Enable message group subscription for 6 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101008 (https://phabricator.wikimedia.org/T372386) [09:52:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101008 (https://phabricator.wikimedia.org/T372386) (owner: 10Abijeet Patro) [09:52:54] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1053.eqiad.wmnet with reason: host reimage [09:53:56] (03PS1) 10Slyngshede: P:idm disable index listings for Bitu media and static content. [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) [10:04:20] (03CR) 10Muehlenhoff: P:idm disable index listings for Bitu media and static content. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:05:22] (03PS2) 10Slyngshede: P:idm disable index listings for Bitu media and static content. [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) [10:05:30] (03CR) 10Slyngshede: P:idm disable index listings for Bitu media and static content. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:07:43] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:08:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1052.eqiad.wmnet with OS bookworm [10:11:25] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1053.eqiad.wmnet with OS bookworm [10:11:58] !log homer 'cr*eqiad*' commit 'T377876' [10:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:02] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [10:13:23] (03PS2) 10Brouberol: flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) [10:15:22] (03CR) 10Slyngshede: [C:03+2] P:idm disable index listings for Bitu media and static content. [puppet] - 10https://gerrit.wikimedia.org/r/1101009 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:16:58] (03CR) 10DCausse: [C:03+1] "lgtm, thanks!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) (owner: 10Brouberol) [10:21:06] (03CR) 10Brouberol: [C:03+2] flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) (owner: 10Brouberol) [10:21:18] (03CR) 10Brouberol: [V:03+2 C:03+2] flink: upgrade to 1.20.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1101004 (https://phabricator.wikimedia.org/T377134) (owner: 10Brouberol) [10:21:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:25:23] (03PS1) 10Filippo Giunchedi: tests: validate deploy-tag values [alerts] - 10https://gerrit.wikimedia.org/r/1101019 [10:27:28] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1052-1053].eqiad.wmnet [10:27:29] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1052-1053].eqiad.wmnet [10:28:17] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10385961 (10Jelto) [10:30:58] (03PS1) 10Muehlenhoff: Fix wdqs-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1101020 [10:34:28] (03CR) 10Muehlenhoff: [C:03+2] Fix wdqs-all alias [puppet] - 10https://gerrit.wikimedia.org/r/1101020 (owner: 10Muehlenhoff) [10:34:55] (03CR) 10Muehlenhoff: cumin: add aliases for net-new wdqs services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1100465 (https://phabricator.wikimedia.org/T376150) (owner: 10Bking) [10:35:15] (03PS1) 10Jelto: Rename kubernetes[1035-1036] to wikikube-worker[1054-1055] [puppet] - 10https://gerrit.wikimedia.org/r/1101022 (https://phabricator.wikimedia.org/T377876) [10:37:14] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1035-1036] to wikikube-worker[1054-1055] [puppet] - 10https://gerrit.wikimedia.org/r/1101022 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:39:28] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1035-1036] to wikikube-worker[1054-1055] [puppet] - 10https://gerrit.wikimedia.org/r/1101022 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [10:39:56] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1035-1036].eqiad.wmnet [10:41:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1035-1036].eqiad.wmnet [10:43:36] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1035 to wikikube-worker1054 [10:43:56] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:44:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:44:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:47:27] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1035 to wikikube-worker1054 - jelto@cumin1002" [10:47:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1035 to wikikube-worker1054 - jelto@cumin1002" [10:47:49] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:47:49] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1054 [10:48:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1054 [10:49:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1035 to wikikube-worker1054 [10:49:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [10:52:41] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2042.codfw.wmnet to cluster codfw and group D [10:53:00] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1036 to wikikube-worker1055 [10:53:20] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [10:53:46] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2042.codfw.wmnet to cluster codfw and group D [10:57:14] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1036 to wikikube-worker1055 - jelto@cumin1002" [10:57:30] (03CR) 10FNegri: [C:03+1] "LGTM! Sorry for not following the "no whitespace" convention, did something break because of the space?" [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi) [10:57:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1036 to wikikube-worker1055 - jelto@cumin1002" [10:57:37] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:57:38] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1055 [10:58:44] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1055 [10:58:55] (03CR) 10FNegri: [C:03+1] "Nice one." [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi) [10:59:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1036 to wikikube-worker1055 [11:00:15] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1054.eqiad.wmnet wikikube-worker1055.eqiad.wmnet on all recursors [11:00:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1054.eqiad.wmnet wikikube-worker1055.eqiad.wmnet on all recursors [11:03:11] (03CR) 10Tiziano Fogli: [C:03+1] tests: assert page severity and summary match [alerts] - 10https://gerrit.wikimedia.org/r/1101005 (owner: 10Filippo Giunchedi) [11:03:32] (03CR) 10Filippo Giunchedi: [C:03+2] tests: assert page severity and summary match [alerts] - 10https://gerrit.wikimedia.org/r/1101005 (owner: 10Filippo Giunchedi) [11:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:05:12] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1054.eqiad.wmnet with OS bookworm [11:05:30] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1055.eqiad.wmnet with OS bookworm [11:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:05:55] (03CR) 10Tiziano Fogli: [C:03+1] tests: validate deploy-tag values [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi) [11:06:34] (03CR) 10Filippo Giunchedi: [C:03+2] "Nothing broke no, it is a naming convention though no automated process relies on it AFAIK" [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi) [11:09:26] (03CR) 10Tiziano Fogli: [C:03+1] tests: fix alertname whitespace check [alerts] - 10https://gerrit.wikimedia.org/r/1101006 (owner: 10Filippo Giunchedi) [11:12:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:01] (03CR) 10Filippo Giunchedi: "Thank you for the reviews, since the wmcs alerts are paging I'm holding off until Monday to avoid surprises. Please let me know if you'd l" [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi) [11:22:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:24:09] (03CR) 10FNegri: [C:03+1] "I don't expect any issues, but waiting til Monday sounds good!" [alerts] - 10https://gerrit.wikimedia.org/r/1101019 (owner: 10Filippo Giunchedi) [11:26:29] 10ops-codfw, 06SRE, 06DC-Ops: ganeti2042 seems to have a broken CPU? (new Supermicro node) - https://phabricator.wikimedia.org/T378358#10386086 (10MoritzMuehlenhoff) I've readded ganeti2042 to the cluster and moved on VM to the node. I'll report back if there's any issues, otherwise I think you can send back... [11:30:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host build2002.codfw.wmnet with OS bookworm [11:30:22] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10386092 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm [11:30:33] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1054.eqiad.wmnet with reason: host reimage [11:31:56] (03CR) 10Ladsgroup: [C:03+1] Prepare for migration of the Interwiki extension to core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100217 (https://phabricator.wikimedia.org/T33951) (owner: 10Tim Starling) [11:32:40] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1055.eqiad.wmnet with reason: host reimage [11:34:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1054.eqiad.wmnet with reason: host reimage [11:38:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1055.eqiad.wmnet with reason: host reimage [11:41:02] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652 (10MoritzMuehlenhoff) 03NEW [11:45:12] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386122 (10MoritzMuehlenhoff) [11:45:42] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10386123 (10MoritzMuehlenhoff) [11:48:16] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on build2002.codfw.wmnet with reason: host reimage [11:48:39] (03PS1) 10Muehlenhoff: Add ganeti1053/1054 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1101031 (https://phabricator.wikimedia.org/T381576) [11:51:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on build2002.codfw.wmnet with reason: host reimage [11:52:55] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1054.eqiad.wmnet with OS bookworm [11:53:25] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti1053/1054 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1101031 (https://phabricator.wikimedia.org/T381576) (owner: 10Muehlenhoff) [11:54:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 13Patch-For-Review: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10386137 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03None site.pp has been updated [11:56:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1055.eqiad.wmnet with OS bookworm [11:58:06] !log homer 'cr*eqiad*' commit 'T377876' [11:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:09] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [11:58:45] (03CR) 10Slyngshede: [C:03+2] Django Admin: Disable admin interface in production [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T0800) [12:00:05] eoghan, jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241206T1200). [12:01:14] (03Merged) 10jenkins-bot: Django Admin: Disable admin interface in production [software/bitu] - 10https://gerrit.wikimedia.org/r/1101007 (https://phabricator.wikimedia.org/T381637) (owner: 10Slyngshede) [12:09:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host build2002.codfw.wmnet with OS bookworm [12:10:06] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Create bookworm-based build host - https://phabricator.wikimedia.org/T379343#10386230 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host build2002.codfw.wmnet with OS bookworm completed: - build2002 (**PASS**)... [12:15:09] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1054-1055].eqiad.wmnet [12:15:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1054-1055].eqiad.wmnet [12:15:51] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10386261 (10Jelto) [12:18:26] (03PS1) 10Jelto: Rename kubernetes[1037-1038] to wikikube-worker[1056-1057] [puppet] - 10https://gerrit.wikimedia.org/r/1101036 (https://phabricator.wikimedia.org/T377876) [12:21:23] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1009.eqiad.wmnet [12:23:53] (03PS1) 10Hnowlan: mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) [12:30:54] (03CR) 10Clément Goubert: [C:03+1] mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:36:19] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:38:16] (03CR) 10JMeybohm: [C:03+1] Rename kubernetes[1037-1038] to wikikube-worker[1056-1057] [puppet] - 10https://gerrit.wikimedia.org/r/1101036 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [12:39:48] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:40:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1009.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:40:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:40:22] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ganeti1009.eqiad.wmnet [12:40:28] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386359 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1009.eqiad.wmnet` - ganeti1009.eqiad.wmnet (**FAIL... [12:43:26] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386373 (10MoritzMuehlenhoff) [12:47:10] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host kubernetes[1037-1038].eqiad.wmnet [12:48:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host kubernetes[1037-1038].eqiad.wmnet [12:50:16] (03CR) 10Hnowlan: [C:03+2] mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:52:16] (03Merged) 10jenkins-bot: mediawiki: pass raw input to mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101038 (https://phabricator.wikimedia.org/T371701) (owner: 10Hnowlan) [12:54:27] (03CR) 10Jelto: [C:03+2] Rename kubernetes[1037-1038] to wikikube-worker[1056-1057] [puppet] - 10https://gerrit.wikimedia.org/r/1101036 (https://phabricator.wikimedia.org/T377876) (owner: 10Jelto) [12:56:30] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1037 to wikikube-worker1056 [12:56:50] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [12:57:51] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:57:51] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:58:24] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1016.eqiad.wmnet [13:01:36] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1037 to wikikube-worker1056 - jelto@cumin1002" [13:04:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:42] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:10:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1037 to wikikube-worker1056 - jelto@cumin1002" [13:10:35] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:10:35] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1056 [13:11:40] FIRING: KubernetesRsyslogDown: rsyslog on kubernetes1038:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes1038 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:11:46] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:11:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1056 [13:11:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:11:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:11:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1016.eqiad.wmnet [13:11:59] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386438 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1016.eqiad.wmnet` - ganeti1016.eqiad.wmnet (**PASS... [13:12:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1037 to wikikube-worker1056 [13:13:10] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from kubernetes1038 to wikikube-worker1057 [13:13:30] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [13:17:13] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1038 to wikikube-worker1057 - jelto@cumin1002" [13:17:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming kubernetes1038 to wikikube-worker1057 - jelto@cumin1002" [13:17:59] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:17:59] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1057 [13:19:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1057 [13:19:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from kubernetes1038 to wikikube-worker1057 [13:21:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1017.eqiad.wmnet [13:25:48] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1056.eqiad.wmnet wikikube-worker1057.eqiad.wmnet on all recursors [13:25:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1056.eqiad.wmnet wikikube-worker1057.eqiad.wmnet on all recursors [13:28:54] (03CR) 10ZhaoFJx: [C:03+1] "Not sure about the scrutineer, but sysop LGTM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [13:29:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1056.eqiad.wmnet with OS bookworm [13:29:36] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm [13:31:00] (03CR) 10Stang: "Referenced T377531#10369860" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100228 (https://phabricator.wikimedia.org/T380020) (owner: 10Stang) [13:31:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:32:04] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386486 (10MoritzMuehlenhoff) [13:34:55] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10386500 (10SuzanneWood-WMDE) I signed that : ) [13:35:02] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:35:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:35:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:35:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1017.eqiad.wmnet [13:35:42] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386501 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1017.eqiad.wmnet` - ganeti1017.eqiad.wmnet (**PASS... [13:36:19] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1018.eqiad.wmnet [13:43:37] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:46:51] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1056.eqiad.wmnet with reason: host reimage [13:47:08] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1018.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:50:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1056.eqiad.wmnet with reason: host reimage [13:50:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:55:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1018.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [13:55:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1018.eqiad.wmnet [13:55:49] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386552 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1018.eqiad.wmnet` - ganeti1018.eqiad.wmnet (**PASS... [13:56:53] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts ganeti1020.eqiad.wmnet [14:10:12] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1056.eqiad.wmnet with OS bookworm [14:11:16] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [14:12:51] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10386587 (10Jhancock.wm) they denied the request. gonna resubmit. didn't see any errors this morning since draining power and reseating the CPU. could you try to get it to fail again... [14:14:10] (03PS2) 10Kamila Součková: Rename mw143[0-5] to wikikube-worker10[58-63] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) [14:15:08] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: es2045 went down: CPU error - https://phabricator.wikimedia.org/T381549#10386588 (10Marostegui) I will do it on Monday, as I need to stop another server to clone this one and I don't want to leave it stopped before the weekend. I'll keep you posted. Thank... [14:15:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1005.eqiad.wmnet - https://phabricator.wikimedia.org/T381491#10386590 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,... [14:15:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1004.eqiad.wmnet - https://phabricator.wikimedia.org/T381490#10386594 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,... [14:15:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1003.eqiad.wmnet - https://phabricator.wikimedia.org/T381489#10386598 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,... [14:15:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1002.eqiad.wmnet - https://phabricator.wikimedia.org/T381488#10386602 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,... [14:15:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission an-presto1001.eqiad.wmnet - https://phabricator.wikimedia.org/T381487#10386606 (10BTullis) 05Open→03Declined We have decided to postpone the de-racking, just in case we decide to re-commission these five servers as Hadoop workers,... [14:17:37] (03PS3) 10Kamila Součková: Rename mw143[0-5] to wikikube-worker10[58-63] [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) [14:19:29] (03CR) 10Kamila Součková: "re-did this due to new numbers clashes" [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [14:19:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:20:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ganeti1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [14:20:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:20:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ganeti1020.eqiad.wmnet [14:20:28] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386609 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `ganeti1020.eqiad.wmnet` - ganeti1020.eqiad.wmnet (**PASS... [14:21:36] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386613 (10MoritzMuehlenhoff) [14:27:04] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: ganeti1009.eqiad.wmnet [14:27:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: ganeti1009.eqiad.wmnet [14:27:16] 06SRE, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386615 (10ops-monitoring-bot) Cookbook cookbooks.sre.debmonitor.remove-hosts run by jmm: for 1 hosts: ganeti1009.eqiad.wmnet [14:29:40] (03PS1) 10Muehlenhoff: Remove site.pp entries of decommed Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1101065 (https://phabricator.wikimedia.org/T381652) [14:29:56] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10386619 (10elukey) @Jclark-ctr if those are not urgent I'd ask you to leave them to me for some tests, I'll ping you when... [14:32:15] (03CR) 10Muehlenhoff: [C:03+2] Remove site.pp entries of decommed Ganeti nodes [puppet] - 10https://gerrit.wikimedia.org/r/1101065 (https://phabricator.wikimedia.org/T381652) (owner: 10Muehlenhoff) [14:33:17] 06SRE, 10decommission-hardware, 13Patch-For-Review: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386626 (10MoritzMuehlenhoff) [14:33:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission ganeti1009 / ganeti1016 / ganeti1017 / ganeti1018 / ganeti1020 - https://phabricator.wikimedia.org/T381652#10386627 (10MoritzMuehlenhoff) [14:34:17] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Add ganeti1039 to ganeti1052 and decom ganeti1009 to ganeti1022 - https://phabricator.wikimedia.org/T378921#10386630 (10MoritzMuehlenhoff) 05Open→03Resolved All new servers added, all old server decommissioned and clusters rebalanced. [14:35:47] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T381464#10386635 (10HShaikh) For some clarity. The request is for Chris to be able to eventually run jupyter notebooks. So he is requesting access to the analytics-privatedata-users group in the anal... [14:40:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:42:27] (03PS1) 10Muehlenhoff: Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068 [14:43:05] (03CR) 10CI reject: [V:04-1] Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068 (owner: 10Muehlenhoff) [14:46:58] (03Abandoned) 10Btullis: Revert "Upgrade the remainder of the cephosd cluster to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/1099669 (https://phabricator.wikimedia.org/T381264) (owner: 10Btullis) [14:49:41] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1057.eqiad.wmnet with OS bookworm [14:50:08] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm [14:51:27] (03PS2) 10Muehlenhoff: Deprecate system::role for wikireplicas roles [puppet] - 10https://gerrit.wikimedia.org/r/1101068 [14:53:14] (03PS1) 10Máté Szabó: dialog: Fix wrong title on Types of unacceptable behavior step [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) [14:53:37] (03PS1) 10Máté Szabó: dialog: Fix spacing between buttons in the dialog footer [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) [14:54:09] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101069 (https://phabricator.wikimedia.org/T381529) (owner: 10Máté Szabó) [14:54:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) (owner: 10Máté Szabó) [15:00:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100101 (owner: 10Máté Szabó) [15:02:38] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1430-1435].eqiad.wmnet [15:03:06] (03CR) 10Kamila Součková: [C:03+2] "proceeding, as the changes after the +1s were trivial" [puppet] - 10https://gerrit.wikimedia.org/r/1100842 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [15:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:42] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1430-1435].eqiad.wmnet [15:08:18] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1430 to wikikube-worker1058 [15:08:39] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:10:45] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1431 to wikikube-worker1059 [15:13:05] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1430 to wikikube-worker1058 - kamila@cumin1002" [15:13:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1430 to wikikube-worker1058 - kamila@cumin1002" [15:13:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:13:39] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1058 [15:14:06] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:14:50] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1058 [15:15:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1430 to wikikube-worker1058 [15:16:32] (03PS2) 10Máté Szabó: dialog: Fix spacing between buttons in the dialog footer [extensions/ReportIncident] (wmf/1.44.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1101070 (https://phabricator.wikimedia.org/T381530) [15:18:03] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1431 to wikikube-worker1059 - kamila@cumin1002" [15:18:04] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1432 to wikikube-worker1060 [15:18:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1431 to wikikube-worker1059 - kamila@cumin1002" [15:18:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:18:08] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1059 [15:18:25] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:18:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw1433:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:19:16] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1059 [15:19:55] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1431 to wikikube-worker1059 [15:20:42] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1433 to wikikube-worker1061 [15:20:49] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1434 to wikikube-worker1062 [15:20:53] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1435 to wikikube-worker1063 [15:21:28] 10ops-esams, 10ops-magru, 06SRE, 06DC-Ops, 06Traffic: CPU temperature issues in cp hosts - https://phabricator.wikimedia.org/T373993#10386737 (10RobH) >>! In T373993#10385350, @BCornwall wrote: > Some observations: > > * [[ https://grafana.wikimedia.org/goto/_53fKoVHR?orgId=1 | magru has the highest ave... [15:22:15] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1432 to wikikube-worker1060 - kamila@cumin1002" [15:22:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1432 to wikikube-worker1060 - kamila@cumin1002" [15:22:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:22:46] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1060 [15:22:47] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:23:52] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1060 [15:24:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1432 to wikikube-worker1060 [15:26:51] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1433 to wikikube-worker1061 - kamila@cumin1002" [15:26:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1433 to wikikube-worker1061 - kamila@cumin1002" [15:26:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:26:57] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1061 [15:27:33] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:28:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1061 [15:28:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1433 to wikikube-worker1061 [15:29:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:29:54] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1063 [15:29:56] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:30:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1063 [15:31:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1435 to wikikube-worker1063 [15:32:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:32:15] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1062 [15:33:15] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1062 [15:33:53] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1434 to wikikube-worker1062 [15:34:35] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1058.eqiad.wmnet wikikube-worker1059.eqiad.wmnet wikikube-worker1060.eqiad.wmnet wikikube-worker1061.eqiad.wmnet wikikube-worker1062.eqiad.wmnet wikikube-worker1063.eqiad.wmnet on all recursors [15:34:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1058.eqiad.wmnet wikikube-worker1059.eqiad.wmnet wikikube-worker1060.eqiad.wmnet wikikube-worker1061.eqiad.wmnet wikikube-worker1062.eqiad.wmnet wikikube-worker1063.eqiad.wmnet on all recursors [15:36:11] !log kamila@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker1058.eqiad.wmnet [15:36:45] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1058.eqiad.wmnet with OS bullseye [15:36:49] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1058.eqiad.wmnet with OS bullseye [15:36:50] !log kamila@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker1058.eqiad.wmnet [15:39:09] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1058.eqiad.wmnet with OS bookworm [15:41:25] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1059.eqiad.wmnet with OS bookworm [15:41:46] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1060.eqiad.wmnet with OS bookworm [15:42:15] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1061.eqiad.wmnet with OS bookworm [15:43:05] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1062.eqiad.wmnet with OS bookworm [15:43:25] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1063.eqiad.wmnet with OS bookworm [15:45:00] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:45:52] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:50:02] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, and 3 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T381504#10386830 (10kamila) [15:52:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, 10Data-Platform-SRE (2024.11.30 - 2024.12.20): Q2:rack/setup/install cloudelastic101[12] - https://phabricator.wikimedia.org/T378368#10386835 (10elukey) I am reviewing the quote of these nodes to figure out what the item is, afaics it seems a 10G network... [15:54:52] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1058.eqiad.wmnet with reason: host reimage [15:57:22] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1060.eqiad.wmnet with reason: host reimage [15:58:12] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1061.eqiad.wmnet with reason: host reimage [15:58:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1058.eqiad.wmnet with reason: host reimage [15:58:55] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1062.eqiad.wmnet with reason: host reimage [15:59:14] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1063.eqiad.wmnet with reason: host reimage [16:01:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1060.eqiad.wmnet with reason: host reimage [16:05:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1062.eqiad.wmnet with reason: host reimage [16:08:49] (03CR) 10FNegri: [C:03+1] "LGTM, thanks for cleaning this up." [puppet] - 10https://gerrit.wikimedia.org/r/1101068 (owner: 10Muehlenhoff) [16:09:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1063.eqiad.wmnet with reason: host reimage [16:10:22] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1057.eqiad.wmnet with OS bookworm [16:11:20] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1057.eqiad.wmnet with OS bookworm [16:12:10] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1061.eqiad.wmnet with reason: host reimage [16:17:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1058.eqiad.wmnet with OS bookworm [16:20:24] (03PS1) 10Hnowlan: mediawiki: add debug flag for mercurius [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101081 (https://phabricator.wikimedia.org/T371701) [16:20:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1060.eqiad.wmnet with OS bookworm [16:24:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1062.eqiad.wmnet with OS bookworm [16:28:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1063.eqiad.wmnet with OS bookworm [16:29:05] !log kamila@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1059.eqiad.wmnet with OS bookworm [16:29:37] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1059.eqiad.wmnet with OS bookworm [16:30:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1061.eqiad.wmnet with OS bookworm [16:32:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:35:10] (03CR) 10Hnowlan: [C:03+1] maps: Remove support for osm2pgsql as OSM engine [puppet] - 10https://gerrit.wikimedia.org/r/1100784 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:37:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST ipamblocks) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:40:47] (03PS1) 10Btullis: Add ahoelzl to analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1101082 (https://phabricator.wikimedia.org/T345959) [16:43:09] (03CR) 10Btullis: [C:03+2] Add ahoelzl to analytics-admins group [puppet] - 10https://gerrit.wikimedia.org/r/1101082 (https://phabricator.wikimedia.org/T345959) (owner: 10Btullis) [16:45:45] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1059.eqiad.wmnet with reason: host reimage [16:47:28] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1101083 [16:48:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1059.eqiad.wmnet with reason: host reimage [16:48:58] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is CRITICAL: 1.121e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [16:50:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Comm Error: backplane 0 when reimaging wikikube-worker1057 - https://phabricator.wikimedia.org/T381676 (10Jelto) 03NEW [17:04:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:12] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1059.eqiad.wmnet with OS bookworm [17:08:33] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1058-1063].eqiad.wmnet [17:08:36] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1058-1063].eqiad.wmnet [17:25:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:26:07] ^ I will get a lot of @sukhe for this but this should be a paging alert [17:26:39] a widespread puppet failure that's simply a scroll in the alerts channel is not enough [17:27:48] (03PS1) 10Andrea Denisse: ldap: Grant access to the wmf group for cpetrillo [puppet] - 10https://gerrit.wikimedia.org/r/1101090 (https://phabricator.wikimedia.org/T381464) [17:28:05] sukhe: that is probably me, looking [17:28:46] jhathaway: I am not fully sure, I think there are some network issues at play that are unrelated to you [17:28:50] topranks: ^ [17:29:10] !log splitting codfw -> eqsin traffic over path via ulsfo as direct link is saturated [17:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:27] could be I was running puppet with batchs of 75 in codfw, perhaps that was too much [17:29:29] (03CR) 10Andrea Denisse: [C:03+2] ldap: Grant access to the wmf group for cpetrillo [puppet] - 10https://gerrit.wikimedia.org/r/1101090 (https://phabricator.wikimedia.org/T381464) (owner: 10Andrea Denisse) [17:30:03] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to for - https://phabricator.wikimedia.org/T381464#10387234 (10andrea.denisse) I've added cpetrillo to the `wmf` group and to the `WMF-NDA` Phabricator group. Please reopen the task if there's anything else I can assist w... [17:30:13] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to for  - https://phabricator.wikimedia.org/T381464#10387235 (10andrea.denisse) 05In progress→03Resolved [17:31:35] !log jelto@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1057.eqiad.wmnet with OS bookworm [17:40:26] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101092 [17:44:42] (03CR) 10JHathaway: [C:03+1] "mostly harmless, but it can give an alert that too many facts are being persisted to puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/1099748 (https://phabricator.wikimedia.org/T381293) (owner: 10Andrew Bogott) [17:45:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:53:46] (03PS4) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [17:54:27] (03CR) 10CI reject: [V:04-1] mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:55:13] (03PS5) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [17:55:31] (03CR) 10Scott French: "Thanks for the review, Hugh!" [puppet] - 10https://gerrit.wikimedia.org/r/1084247 (owner: 10Scott French) [17:55:39] (03PS6) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [17:56:20] (03CR) 10CI reject: [V:04-1] mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [17:57:34] (03PS7) 10Ottomata: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) [18:07:09] (03CR) 10Bking: [C:03+2] partman: add recipe for UEFI 4-disk SW RAID-10 [puppet] - 10https://gerrit.wikimedia.org/r/1099740 (https://phabricator.wikimedia.org/T373519) (owner: 10Bking) [18:10:05] (03PS1) 10JHathaway: hadoop: sort local-dirs [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538) [18:10:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1101093 (https://phabricator.wikimedia.org/T381538) (owner: 10JHathaway) [18:17:19] (03CR) 10Jdlrobson: Enable Empty search A/B test on beta cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1100869 (https://phabricator.wikimedia.org/T378115) (owner: 10Jdlrobson) [18:18:16] (03PS1) 10Jdlrobson: Fixes A/B test for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1101094 (https://phabricator.wikimedia.org/T378115) [18:21:19] (03PS1) 10Bking: wdqs1025: Configure partitions for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1101095 (https://phabricator.wikimedia.org/T378030) [18:33:30] (03PS1) 10AntiCompositeNumber: entrypoint.sh: use full thumbor path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101097 [18:34:57] (03CR) 10Btullis: [C:03+1] "Interesting." [puppet] - 10https://gerrit.wikimedia.org/r/1101095 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [18:37:24] (03CR) 10Hnowlan: [C:03+1] entrypoint.sh: use full thumbor path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1101097 (owner: 10AntiCompositeNumber) [18:42:18] (03PS2) 10Herron: pyrra: onboard wdqs-availability [puppet] - 10https://gerrit.wikimedia.org/r/1101083 (https://phabricator.wikimedia.org/T302995) [18:42:18] (03CR) 10Herron: [C:03+2] "self merge for initial onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101083 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [18:42:28] 06SRE, 10LDAP-Access-Requests: Grant Access to for  - https://phabricator.wikimedia.org/T381464#10387466 (10Dzahn) >>! In T381464#10386635, @HShaikh wrote: > For some clarity. The request is for Chris to be able to eventually run jupyter notebooks. > So he is requesting access to th... [18:48:27] (03CR) 10Bking: [C:03+2] wdqs1025: Configure partitions for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1101095 (https://phabricator.wikimedia.org/T378030) (owner: 10Bking) [18:53:02] (03PS1) 10Herron: pyrra: switch wdqs-availability ratio type [puppet] - 10https://gerrit.wikimedia.org/r/1101099 (https://phabricator.wikimedia.org/T302995) [18:54:08] (03PS2) 10Herron: pyrra: switch wdqs-availability ratio type [puppet] - 10https://gerrit.wikimedia.org/r/1101099 (https://phabricator.wikimedia.org/T302995) [18:55:25] uefi [18:56:57] (03CR) 10Herron: [C:03+2] pyrra: switch wdqs-availability ratio type [puppet] - 10https://gerrit.wikimedia.org/r/1101099 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [19:00:50] !log bking@cumin2002 START - Cookbook sre.hosts.provision for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:01:46] 06SRE, 06Editing-team, 10MediaWiki-Debug-Logger, 10observability, and 5 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10387562 (10Urbanecm_WMF) a:03Urbanecm_WMF Thanks @Michael. I think the best course of action is to revert that change, as it is making... [19:02:16] 06SRE, 06Editing-team, 10MediaWiki-Debug-Logger, 10observability, and 5 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10387566 (10Urbanecm_WMF) @kharlan @catrope As the engineers who made the original chance, CCing you, in case you have any concerns with... [19:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:05:00] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1025.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [19:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:22:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 207, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:22:34] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:26:02] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1002 is OK: (C)1e+05 gt (W)1e+04 gt 8392 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [19:40:16] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1025.eqiad.wmnet with OS bullseye [19:40:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10387631 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye [20:08:30] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [20:12:02] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1025.eqiad.wmnet with reason: host reimage [20:29:01] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1025.eqiad.wmnet with OS bullseye [20:29:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06Discovery-Search, and 4 others: Q2:rack/setup/install wdqs102[567] - https://phabricator.wikimedia.org/T378030#10387829 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1025.eqiad.wmnet with OS bullseye completed: - wdqs... [20:35:27] (03CR) 10Krinkle: mediawiki.org/beacon/event/index.php - use EventLoggingLegacyConverter::submitEvent (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1063222 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [21:04:28] FIRING: SystemdUnitFailed: ifup@eno12399np0.service on wikikube-worker1290:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:32] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for SuzanneWood-WMDE - https://phabricator.wikimedia.org/T380487#10387912 (10KFrancis) Waiting on legal counsel to counter sign. I just pinged him. [21:07:38] (03PS1) 10Herron: pyrra: wdqs-availability invert query [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995) [21:07:53] (03CR) 10CI reject: [V:04-1] pyrra: wdqs-availability invert query [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [21:08:27] (03PS2) 10Herron: pyrra: wdqs-availability invert query [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995) [21:13:48] (03CR) 10Herron: [C:03+2] "self merge sorting out pyrra onboarding" [puppet] - 10https://gerrit.wikimedia.org/r/1101113 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [21:17:45] (03PS1) 10Herron: add pyrra note for wdqs-availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1101114 (https://phabricator.wikimedia.org/T302995) [21:18:02] (03CR) 10Herron: [V:03+2 C:03+2] add pyrra note for wdqs-availability [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1101114 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [22:33:16] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [22:33:50] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [22:40:08] (03PS1) 10Scott French: mw-(apt-ext|api-int|jobrunner|parsoid|web): set php.version to 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1101121 (https://phabricator.wikimedia.org/T377040) [22:40:33] (03PS4) 10Scott French: hieradata: add "migration" release of mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/1081451 (https://phabricator.wikimedia.org/T377040) [22:40:34] (03PS3) 10Scott French: hieradata: add remaining "migration" releases [puppet] - 10https://gerrit.wikimedia.org/r/1082865 (https://phabricator.wikimedia.org/T377040) [22:40:34] (03PS1) 10Scott French: hieradata: switch all "migration" releases to 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1101122 (https://phabricator.wikimedia.org/T377040) [23:04:28] FIRING: [4x] SystemdUnitFailed: load-dcatap-weekly.service on wdqs2026:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:05:42] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:18:23] !log clouddumps1001/clouddumps1002: rm /srv/dumps/xmldatadumps/public/other/misc/phabricator_public.dump - an uncompressed old file from Sep 2023 - normal dumps are gzipped and current [23:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:46] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:29:28] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:49:13] (03PS2) 10Wziko: feat(cfssl-issuer): change default value for external_services in cfssl issuer helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1099837