[00:05:29] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:55] !log removing 1 file for legal compliance [00:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1111360 [00:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1111360 (owner: 10TrainBranchBot) [00:56:44] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1111360 (owner: 10TrainBranchBot) [01:08:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1111361 [01:08:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1111361 (owner: 10TrainBranchBot) [01:35:39] (03PS5) 10Scott French: shellbox-syntaxhighlight: 1 codfw replica on 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087581 (https://phabricator.wikimedia.org/T377038) [01:35:39] (03PS5) 10Scott French: shellbox-syntaxhighlight: all replicas on PHP 8.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1087582 (https://phabricator.wikimedia.org/T377038) [01:44:50] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1111361 (owner: 10TrainBranchBot) [01:45:39] (03CR) 10Bartosz Dziewoński: [C:03+1] Yet more authentication domain overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111343 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [01:47:09] (03CR) 10Bartosz Dziewoński: [C:03+1] "Looks good, assuming that it's intended that you define several of them to the same name `'static'`." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111344 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [01:48:18] (03CR) 10Bartosz Dziewoński: [C:03+1] "I see that there's a note about that in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1111345." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111344 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [02:01:21] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/dc93be088aaa4b75dd2c125bf59f25a009e9f771d54e5e2e443062e2577b23a2/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:21:21] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:36:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:36:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10460931 (10phaultfinder) [03:30:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:05:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:05:55] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:06:45] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53367 bytes in 0.069 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:19:53] (03PS1) 10KartikMistry: Update cxserver to 2025-01-13-044601-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111373 (https://phabricator.wikimedia.org/T382294) [04:52:52] (03PS1) 10KartikMistry: Update MinT to [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111374 (https://phabricator.wikimedia.org/T347929) [04:54:27] (03PS2) 10KartikMistry: Update MinT to 2025-01-07-122638-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111374 (https://phabricator.wikimedia.org/T347929) [05:03:46] Deploying cxserver/MinT. [05:03:57] 06SRE, 06Infrastructure-Foundations, 10Mail: Message sizes exceeding limits after migrating from Exim to Postfix - https://phabricator.wikimedia.org/T383271#10460979 (10Aklapper) [05:05:35] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-13-044601-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111373 (https://phabricator.wikimedia.org/T382294) (owner: 10KartikMistry) [05:06:37] (03Merged) 10jenkins-bot: Update cxserver to 2025-01-13-044601-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111373 (https://phabricator.wikimedia.org/T382294) (owner: 10KartikMistry) [05:14:53] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:15:19] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:20:04] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:20:33] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:22:07] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:22:41] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:23:23] !log Updated cxserver to 2025-01-13-044601-production (T382294) [05:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:27] T382294: Use openapi compliant examples in swagger spec - https://phabricator.wikimedia.org/T382294 [05:25:22] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2025-01-07-122638-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111374 (https://phabricator.wikimedia.org/T347929) (owner: 10KartikMistry) [05:26:28] (03Merged) 10jenkins-bot: Update MinT to 2025-01-07-122638-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111374 (https://phabricator.wikimedia.org/T347929) (owner: 10KartikMistry) [05:30:13] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:47:53] Anything change recently with peopleweb.discovery.wmnet? Seems MinT can't download models from there. [05:50:24] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:10:25] 10SRE-swift-storage, 10CX-deployments, 10LPL Essential, 10MinT: Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10461028 (10KartikMistry) [06:50:17] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[2369-2372].codfw.wmnet [06:52:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[2369-2372].codfw.wmnet [06:54:55] (03CR) 10Jelto: [C:03+2] Rename mw23[69-72] to wikikube-worker222[0-3] [puppet] - 10https://gerrit.wikimedia.org/r/1111271 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [06:58:46] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2369 to wikikube-worker2220 [06:59:07] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [06:59:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [06:59:11] status [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T0700) [07:00:49] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitorin [07:00:49] status [07:04:31] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2369 to wikikube-worker2220 - jelto@cumin1002" [07:04:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2369 to wikikube-worker2220 - jelto@cumin1002" [07:04:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:04:51] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2220 [07:05:31] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2220 [07:06:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2369 to wikikube-worker2220 [07:08:40] FIRING: [3x] KubernetesRsyslogDown: rsyslog on mw2370:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:29:52] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2370 to wikikube-worker2221 [07:30:12] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:30:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:36:38] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2370 to wikikube-worker2221 - jelto@cumin1002" [07:37:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2370 to wikikube-worker2221 - jelto@cumin1002" [07:37:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:37:26] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2221 [07:37:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2221 [07:38:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2370 to wikikube-worker2221 [07:39:21] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2371 to wikikube-worker2222 [07:39:42] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:43:10] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2371 to wikikube-worker2222 - jelto@cumin1002" [07:45:45] (03CR) 10Michael Große: "Agreed. This can now be deployed whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [07:49:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2371 to wikikube-worker2222 - jelto@cumin1002" [07:49:56] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:49:56] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2222 [07:50:23] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2222 [07:51:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2371 to wikikube-worker2222 [07:55:02] !log jelto@cumin1002 START - Cookbook sre.hosts.rename from mw2372 to wikikube-worker2223 [07:55:23] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [07:56:25] (03CR) 10Giuseppe Lavagetto: "What's stopping us from merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/1007026 (https://phabricator.wikimedia.org/T357595) (owner: 10RLazarus) [07:58:48] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2372 to wikikube-worker2223 - jelto@cumin1002" [07:59:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw2372 to wikikube-worker2223 - jelto@cumin1002" [07:59:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:59:10] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2223 [07:59:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2223 [07:59:58] (03PS3) 10Giuseppe Lavagetto: Explicitly disable all local imagescaling on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) [08:00:01] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw2372 to wikikube-worker2223 [08:00:04] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T0800). Please do the needful. [08:00:05] MatmaRex: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/987432 (https://phabricator.wikimedia.org/T352515) (owner: 10Giuseppe Lavagetto) [08:00:35] hi. any deployers around at this unusual hour? [08:01:59] (03Abandoned) 10Giuseppe Lavagetto: Start using the ClusterConfig class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756016 (owner: 10Giuseppe Lavagetto) [08:03:13] (03Abandoned) 10Giuseppe Lavagetto: Simplify management of the request time limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/749718 (owner: 10Giuseppe Lavagetto) [08:03:46] (03Abandoned) 10Giuseppe Lavagetto: Do not use firejail on kubernetes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/920213 (owner: 10Giuseppe Lavagetto) [08:05:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:07] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2220.codfw.wmnet wikikube-worker2221.codfw.wmnet wikikube-worker2222.codfw.wmnet wikikube-worker2223.codfw.wmnet on all recursors [08:08:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2220.codfw.wmnet wikikube-worker2221.codfw.wmnet wikikube-worker2222.codfw.wmnet wikikube-worker2223.codfw.wmnet on all recursors [08:10:00] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2220.codfw.wmnet [08:10:16] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2220.codfw.wmnet with OS bullseye [08:10:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2220 [08:10:34] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:13:59] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2220 - jelto@cumin1002" [08:14:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2220 - jelto@cumin1002" [08:14:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:14:03] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2220.codfw.wmnet 19.48.192.10.in-addr.arpa 9.1.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:14:06] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2220.codfw.wmnet 19.48.192.10.in-addr.arpa 9.1.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:14:07] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2220 [08:14:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2220 [08:14:18] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2220 [08:18:27] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2221.codfw.wmnet [08:18:45] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2221.codfw.wmnet with OS bullseye [08:18:55] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2221 [08:19:03] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:19:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72049 and previous config saved to /var/cache/conftool/dbconfig/20250115-081922-root.json [08:20:53] (03PS1) 10Marostegui: instances.yaml: Add es1043 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1111567 (https://phabricator.wikimedia.org/T382569) [08:21:04] MatmaRex: i'm here! [08:22:08] oh, hi! [08:22:28] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2221 - jelto@cumin1002" [08:22:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2221 - jelto@cumin1002" [08:22:33] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:22:33] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2221.codfw.wmnet 20.48.192.10.in-addr.arpa 0.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:22:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2221.codfw.wmnet 20.48.192.10.in-addr.arpa 0.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:22:36] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2221 [08:22:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2221 [08:22:48] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2221 [08:23:04] (03CR) 10Urbanecm: [C:03+2] Add license messages for new Wikinews licenses [extensions/WikimediaMessages] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109756 (https://phabricator.wikimedia.org/T383338) (owner: 10Bartosz Dziewoński) [08:23:27] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es1043 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1111567 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [08:23:38] MatmaRex: i guess i need to wait with the license config for the WikimediaMessages backport [08:24:21] that'd be ideal [08:25:09] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2222.codfw.wmnet [08:25:25] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2222.codfw.wmnet with OS bullseye [08:25:35] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2222 [08:25:43] * MichaelG_WMF is here as well [08:25:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add es1043 to dbctl depooled T382569', diff saved to https://phabricator.wikimedia.org/P72050 and previous config saved to /var/cache/conftool/dbconfig/20250115-082554-marostegui.json [08:25:58] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [08:26:04] (03PS5) 10Filippo Giunchedi: thanos-rule: manage retention setting [puppet] - 10https://gerrit.wikimedia.org/r/1111241 (https://phabricator.wikimedia.org/T352756) (owner: 10Herron) [08:26:18] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:26:35] (03CR) 10Urbanecm: [C:03+2] Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109736 (https://phabricator.wikimedia.org/T383332) (owner: 10Dreamrimmer) [08:26:50] (03PS2) 10Michael Große: Growth: Remove temporary config for clearing link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) [08:26:52] (03CR) 10Urbanecm: [C:03+2] Growth: Remove temporary config for clearing link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [08:27:18] (03Merged) 10jenkins-bot: Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109736 (https://phabricator.wikimedia.org/T383332) (owner: 10Dreamrimmer) [08:27:41] (03Merged) 10jenkins-bot: Growth: Remove temporary config for clearing link recommendations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1105420 (https://phabricator.wikimedia.org/T379522) (owner: 10Michael Große) [08:28:12] (03CR) 10Filippo Giunchedi: "Please see PS4" [puppet] - 10https://gerrit.wikimedia.org/r/1111241 (https://phabricator.wikimedia.org/T352756) (owner: 10Herron) [08:28:13] (03PS1) 10Bartosz Dziewoński: htmlform: fix defaults for namespace and relative in titlesmultiselect [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111568 (https://phabricator.wikimedia.org/T383133) [08:28:18] (03PS1) 10Bartosz Dziewoński: htmlform: fix defaults for namespace and relative in titlesmultiselect [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111569 (https://phabricator.wikimedia.org/T383133) [08:28:37] (03PS6) 10Filippo Giunchedi: thanos-rule: manage retention setting [puppet] - 10https://gerrit.wikimedia.org/r/1111241 (https://phabricator.wikimedia.org/T352756) (owner: 10Herron) [08:28:45] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1109736|Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks (T383332)]], [[gerrit:1105420|Growth: Remove temporary config for clearing link recommendations (T379522)]] [08:28:50] T383332: Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks - https://phabricator.wikimedia.org/T383332 [08:28:51] T379522: Switch GETempLinkRecommendationSwitchTagClearHook to true at all wikis - https://phabricator.wikimedia.org/T379522 [08:29:00] just realized those core changes should probably be backported too. if we have the time [08:29:28] sure [08:29:40] MatmaRex: do you mind adding them to the calendar, please? [08:29:43] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2222 - jelto@cumin1002" [08:29:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2222 - jelto@cumin1002" [08:29:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:29:47] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2222.codfw.wmnet 21.48.192.10.in-addr.arpa 1.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:29:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2222.codfw.wmnet 21.48.192.10.in-addr.arpa 1.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:29:50] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2222 [08:30:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2222 [08:30:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2222 [08:30:11] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4802/co" [puppet] - 10https://gerrit.wikimedia.org/r/1111241 (https://phabricator.wikimedia.org/T352756) (owner: 10Herron) [08:30:27] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2220.codfw.wmnet with reason: host reimage [08:31:01] !log jelto@cumin1002 START - Cookbook sre.k8s.renumber-node Renumbering for host wikikube-worker2223.codfw.wmnet [08:31:18] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2223.codfw.wmnet with OS bullseye [08:31:29] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2223 [08:31:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111568 (https://phabricator.wikimedia.org/T383133) (owner: 10Bartosz Dziewoński) [08:31:34] !log jelto@cumin1002 START - Cookbook sre.dns.netbox [08:31:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111569 (https://phabricator.wikimedia.org/T383133) (owner: 10Bartosz Dziewoński) [08:31:58] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Move kafka-main2010 within the same rack - https://phabricator.wikimedia.org/T381788#10461157 (10JMeybohm) >>! In T381788#10458970, @Jhancock.wm wrote: > We are off on the 20th in the US. but the rest of the week is good for me. Sorry, I wasn't aware. What abou... [08:32:13] MatmaRex: i meant for this window, unless you deliberately want to wait with them for later? [08:32:24] urbanecm: done (fixed the window) [08:32:42] ty [08:32:58] (03CR) 10Urbanecm: [C:03+2] htmlform: fix defaults for namespace and relative in titlesmultiselect [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111568 (https://phabricator.wikimedia.org/T383133) (owner: 10Bartosz Dziewoński) [08:32:59] (03CR) 10Urbanecm: [C:03+2] htmlform: fix defaults for namespace and relative in titlesmultiselect [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111569 (https://phabricator.wikimedia.org/T383133) (owner: 10Bartosz Dziewoński) [08:33:09] (03PS1) 10Marostegui: production-m5.sql.erb: Add new grants to ipoid_rw [puppet] - 10https://gerrit.wikimedia.org/r/1111570 (https://phabricator.wikimedia.org/T383753) [08:33:28] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2220.codfw.wmnet with reason: host reimage [08:34:14] (03CR) 10Marostegui: "This is a noop as grants need to be added to the DB" [puppet] - 10https://gerrit.wikimedia.org/r/1111570 (https://phabricator.wikimedia.org/T383753) (owner: 10Marostegui) [08:34:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72051 and previous config saved to /var/cache/conftool/dbconfig/20250115-083427-root.json [08:34:59] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2223 - jelto@cumin1002" [08:35:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host wikikube-worker2223 - jelto@cumin1002" [08:35:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:35:04] !log jelto@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker2223.codfw.wmnet 22.48.192.10.in-addr.arpa 2.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:35:07] !log jelto@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker2223.codfw.wmnet 22.48.192.10.in-addr.arpa 2.2.0.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [08:35:07] !log jelto@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker2223 [08:35:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker2223 [08:35:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2223 [08:35:45] !log urbanecm@deploy2002 dreamrimmer, urbanecm, migr: Backport for [[gerrit:1109736|Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks (T383332)]], [[gerrit:1105420|Growth: Remove temporary config for clearing link recommendations (T379522)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:35:50] T383332: Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks - https://phabricator.wikimedia.org/T383332 [08:35:50] T379522: Switch GETempLinkRecommendationSwitchTagClearHook to true at all wikis - https://phabricator.wikimedia.org/T379522 [08:36:13] MatmaRex: can you test the first one (r1109736)? [08:37:04] urbanecm: i can look at Special:UserGroupRights, but not beyond that [08:37:13] that's fair [08:37:31] looks good then [08:37:33] !log urbanecm@deploy2002 dreamrimmer, urbanecm, migr: Continuing with sync [08:38:37] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2221.codfw.wmnet with reason: host reimage [08:41:31] 07Puppet, 06SRE, 06Data-Engineering-Radar: modules/udp2log/manifests/instance/monitoring.pp has unreachable code - https://phabricator.wikimedia.org/T152104#10461173 (10fgiunchedi) 05Open→03Invalid Manifest doesn't contain unreachable code anymore ` define udp2log::instance::monitoring( $log_dir... [08:42:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2221.codfw.wmnet with reason: host reimage [08:44:55] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109736|Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks (T383332)]], [[gerrit:1105420|Growth: Remove temporary config for clearing link recommendations (T379522)]] (duration: 16m 09s) [08:45:00] T383332: Enable abusefilter-log-detail for autoconfirmed users on en.wikibooks - https://phabricator.wikimedia.org/T383332 [08:45:00] T379522: Switch GETempLinkRecommendationSwitchTagClearHook to true at all wikis - https://phabricator.wikimedia.org/T379522 [08:45:04] okay, first deployment done [08:45:06] waiting on CI now [08:45:22] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2222.codfw.wmnet with reason: host reimage [08:45:27] (03CR) 10Urbanecm: [C:03+2] Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [08:46:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/WikimediaMessages] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109756 (https://phabricator.wikimedia.org/T383338) (owner: 10Bartosz Dziewoński) [08:46:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [08:46:14] (03Merged) 10jenkins-bot: Update French wikinews license to CC-BY-SA 4.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1106911 (https://phabricator.wikimedia.org/T381946) (owner: 10Dreamrimmer) [08:47:26] (03Merged) 10jenkins-bot: Add license messages for new Wikinews licenses [extensions/WikimediaMessages] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1109756 (https://phabricator.wikimedia.org/T383338) (owner: 10Bartosz Dziewoński) [08:48:00] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1109756|Add license messages for new Wikinews licenses (T383338)]], [[gerrit:1106911|Update French wikinews license to CC-BY-SA 4.0 (T381946)]] [08:48:06] T383338: Check/fix/cleanup licenses on Wikinewses january 2025 - https://phabricator.wikimedia.org/T383338 [08:48:06] T381946: Update license of dewikinews and frwikinews to CC-BY-SA 4.0 by January 1, 2025 - https://phabricator.wikimedia.org/T381946 [08:49:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2222.codfw.wmnet with reason: host reimage [08:49:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.748s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:49:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72052 and previous config saved to /var/cache/conftool/dbconfig/20250115-084932-root.json [08:51:15] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2223.codfw.wmnet with reason: host reimage [08:52:55] (03Merged) 10jenkins-bot: htmlform: fix defaults for namespace and relative in titlesmultiselect [core] (wmf/1.44.0-wmf.11) - 10https://gerrit.wikimedia.org/r/1111568 (https://phabricator.wikimedia.org/T383133) (owner: 10Bartosz Dziewoński) [08:53:03] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [08:53:26] !log urbanecm@deploy2002 sync-world aborted: Backport for [[gerrit:1109756|Add license messages for new Wikinews licenses (T383338)]], [[gerrit:1106911|Update French wikinews license to CC-BY-SA 4.0 (T381946)]] (duration: 05m 25s) [08:53:30] T383338: Check/fix/cleanup licenses on Wikinewses january 2025 - https://phabricator.wikimedia.org/T383338 [08:53:30] T381946: Update license of dewikinews and frwikinews to CC-BY-SA 4.0 by January 1, 2025 - https://phabricator.wikimedia.org/T381946 [08:53:34] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2220.codfw.wmnet with OS bullseye [08:53:35] (03Merged) 10jenkins-bot: htmlform: fix defaults for namespace and relative in titlesmultiselect [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111569 (https://phabricator.wikimedia.org/T383133) (owner: 10Bartosz Dziewoński) [08:53:52] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2223.codfw.wmnet with reason: host reimage [08:54:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 1.324s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:54:23] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1109756|Add license messages for new Wikinews licenses (T383338)]], [[gerrit:1106911|Update French wikinews license to CC-BY-SA 4.0 (T381946)]], [[gerrit:1111568|htmlform: fix defaults for namespace and relative in titlesmultiselect (T383133)]], [[gerrit:1111569|htmlform: fix defaults for namespace and relative in titlesmultiselect (T383133)]] [08:54:28] T383133: Page restrictions menu not being populated correctly - https://phabricator.wikimedia.org/T383133 [08:54:49] !log !log homer cr*codfw* commit 'T377877' [08:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:52] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [08:57:02] !log homer lsw1-d3-codfw* commit 'T377877' [08:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:58] PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:02:00] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 104, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:02:57] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2221.codfw.wmnet with OS bullseye [09:04:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72053 and previous config saved to /var/cache/conftool/dbconfig/20250115-090437-root.json [09:07:58] RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:10:30] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2222.codfw.wmnet with OS bullseye [09:10:58] PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:11:00] (03PS1) 10Stevemunene: Kerberos access for Kgraessle [puppet] - 10https://gerrit.wikimedia.org/r/1111575 (https://phabricator.wikimedia.org/T383598) [09:11:13] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2220.codfw.wmnet [09:12:56] RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:15:09] (03CR) 10Marostegui: [C:03+2] production-m5.sql.erb: Add new grants to ipoid_rw [puppet] - 10https://gerrit.wikimedia.org/r/1111570 (https://phabricator.wikimedia.org/T383753) (owner: 10Marostegui) [09:15:22] urbanecm: it's still in progress, right? [09:15:26] MatmaRex: correct [09:15:29] or did i miss it [09:15:30] alright [09:15:37] still in the pre-mwdebug stage [09:15:44] deploying i18n changes takes time :) [09:15:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2223.codfw.wmnet with OS bullseye [09:15:55] !log urbanecm@deploy2002 matmarex, urbanecm, dreamrimmer: Backport for [[gerrit:1109756|Add license messages for new Wikinews licenses (T383338)]], [[gerrit:1106911|Update French wikinews license to CC-BY-SA 4.0 (T381946)]], [[gerrit:1111568|htmlform: fix defaults for namespace and relative in titlesmultiselect (T383133)]], [[gerrit:1111569|htmlform: fix defaults for namespace and relative in titlesmultiselect (T383133)]] [09:15:56] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:16:01] here we go! [09:16:01] T383338: Check/fix/cleanup licenses on Wikinewses january 2025 - https://phabricator.wikimedia.org/T383338 [09:16:02] T381946: Update license of dewikinews and frwikinews to CC-BY-SA 4.0 by January 1, 2025 - https://phabricator.wikimedia.org/T381946 [09:16:02] T383133: Page restrictions menu not being populated correctly - https://phabricator.wikimedia.org/T383133 [09:16:03] MatmaRex: can you test? [09:16:08] (all remaining patches should be there) [09:16:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10461268 (10Bmueller) @Dzahn Approved, thank you! [09:16:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P72054 and previous config saved to /var/cache/conftool/dbconfig/20250115-091622-root.json [09:16:34] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp [09:16:37] yeah. looking [09:17:26] (03PS1) 10Marostegui: es1043: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111576 (https://phabricator.wikimedia.org/T382569) [09:18:03] (03CR) 10Marostegui: [C:03+2] es1043: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111576 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [09:19:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72055 and previous config saved to /var/cache/conftool/dbconfig/20250115-091943-root.json [09:20:15] urbanecm: things look good [09:20:19] great! proceeding [09:20:20] !log urbanecm@deploy2002 matmarex, urbanecm, dreamrimmer: Continuing with sync [09:21:49] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [09:24:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10461281 (10phaultfinder) [09:26:31] (03PS1) 10Filippo Giunchedi: kubernetes: enable selecting clusters for deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1111577 (https://phabricator.wikimedia.org/T383699) [09:26:34] (03PS1) 10Filippo Giunchedi: ci: select k8s staging for deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1111578 (https://phabricator.wikimedia.org/T383699) [09:28:31] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1110732 (https://phabricator.wikimedia.org/T383201) (owner: 10Slyngshede) [09:28:53] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1111575 (https://phabricator.wikimedia.org/T383598) (owner: 10Stevemunene) [09:30:33] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2221.codfw.wmnet [09:31:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P72056 and previous config saved to /var/cache/conftool/dbconfig/20250115-093127-root.json [09:32:23] (03CR) 10Stevemunene: [C:03+2] Kerberos access for Kgraessle [puppet] - 10https://gerrit.wikimedia.org/r/1111575 (https://phabricator.wikimedia.org/T383598) (owner: 10Stevemunene) [09:34:20] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1109756|Add license messages for new Wikinews licenses (T383338)]], [[gerrit:1106911|Update French wikinews license to CC-BY-SA 4.0 (T381946)]], [[gerrit:1111568|htmlform: fix defaults for namespace and relative in titlesmultiselect (T383133)]], [[gerrit:1111569|htmlform: fix defaults for namespace and relative in titlesmultiselect (T383133)]] (durat [09:34:21] ion: 39m 57s) [09:34:26] finally [09:34:26] T383338: Check/fix/cleanup licenses on Wikinewses january 2025 - https://phabricator.wikimedia.org/T383338 [09:34:26] T381946: Update license of dewikinews and frwikinews to CC-BY-SA 4.0 by January 1, 2025 - https://phabricator.wikimedia.org/T381946 [09:34:27] T383133: Page restrictions menu not being populated correctly - https://phabricator.wikimedia.org/T383133 [09:34:28] 39mins [09:34:31] MatmaRex: anything else? :) [09:34:45] thanks urbanecm [09:36:10] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:36:44] (03CR) 10Hashar: [C:03+1] ci: Install memcached for MediaWiki success cache [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) (owner: 10Dduvall) [09:37:26] argh [09:37:32] 39 minutes is way tooo lonng [09:38:36] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2222.codfw.wmnet [09:42:25] (03PS1) 10Gkyziridis: ml-services: update articletopic outlink image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) [09:43:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1109108 (https://phabricator.wikimedia.org/T382947) (owner: 10Giuseppe Lavagetto) [09:44:55] !log jelto@cumin1002 END (FAIL) - Cookbook sre.k8s.renumber-node (exit_code=99) Renumbering for host wikikube-worker2223.codfw.wmnet [09:45:02] (03CR) 10David Caro: "Thanks very much for this!" [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:45:46] (03CR) 10David Caro: [C:03+1] "LGTM (once we have the others)" [puppet] - 10https://gerrit.wikimedia.org/r/1111340 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:46:25] (03CR) 10David Caro: [C:03+1] "/me being unclear" [puppet] - 10https://gerrit.wikimedia.org/r/1111340 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [09:46:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P72057 and previous config saved to /var/cache/conftool/dbconfig/20250115-094632-root.json [09:46:40] hashar: i'm jealously looking at deployments from few years ago, when it took less than a minute [09:46:47] (granting, _not_ when i'm changing i18n) [09:47:13] (03PS2) 10Stevemunene: Add linkeddata.cultureelerfgoed.nl to SPARQL allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1105882 (https://phabricator.wikimedia.org/T381717) [09:47:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp [09:49:56] (03PS2) 10Filippo Giunchedi: kubernetes: enable selecting clusters for deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1111577 (https://phabricator.wikimedia.org/T383699) [09:49:56] (03PS2) 10Filippo Giunchedi: ci: select k8s staging for deployment_server [puppet] - 10https://gerrit.wikimedia.org/r/1111578 (https://phabricator.wikimedia.org/T383699) [09:50:16] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_magru and A:cp [09:51:35] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[2220-2223].codfw.wmnet [09:51:42] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[2220-2223].codfw.wmnet [09:53:19] (03PS1) 10Muehlenhoff: Add component/amd-gpu-firmware [puppet] - 10https://gerrit.wikimedia.org/r/1111582 (https://phabricator.wikimedia.org/T383557) [09:53:21] (03PS1) 10Muehlenhoff: amd_rocm: Switch to installing from component/amd-gpu-firmware [puppet] - 10https://gerrit.wikimedia.org/r/1111583 (https://phabricator.wikimedia.org/T383557) [09:54:45] !log disabling puppet on 543 nodes using k8s::package resource - T341984 [09:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:49] T341984: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 [09:54:54] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2220.codfw.wmnet with OS bookworm [09:55:13] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2220 [09:55:13] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2220 [09:57:18] (03CR) 10JMeybohm: [C:03+2] k8s::package: Install version specific kubernetes-client package [puppet] - 10https://gerrit.wikimedia.org/r/1109704 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [09:58:57] PROBLEM - BGP status on lsw1-d3-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:59:07] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2221.codfw.wmnet with OS bookworm [09:59:26] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2221 [09:59:26] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2221 [09:59:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111344 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [09:59:57] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2222.codfw.wmnet with OS bookworm [10:00:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111343 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [10:00:16] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2222 [10:00:16] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2222 [10:01:00] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2223.codfw.wmnet with OS bookworm [10:01:19] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2223 [10:01:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2223 [10:01:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P72058 and previous config saved to /var/cache/conftool/dbconfig/20250115-100138-root.json [10:02:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1025 to eqiad es5 master dbmaint T382569', diff saved to https://phabricator.wikimedia.org/P72059 and previous config saved to /var/cache/conftool/dbconfig/20250115-100207-marostegui.json [10:02:11] T382569: Productionize es104[1-6] - https://phabricator.wikimedia.org/T382569 [10:02:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1024 T382569', diff saved to https://phabricator.wikimedia.org/P72060 and previous config saved to /var/cache/conftool/dbconfig/20250115-100228-marostegui.json [10:02:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on es1024.eqiad.wmnet with reason: cloning [10:02:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on es1024.eqiad.wmnet with reason: cloning [10:04:20] (03CR) 10Máté Szabó: [C:03+1] urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [10:04:47] (03PS1) 10Marostegui: mariadb: Productionize es1045 [puppet] - 10https://gerrit.wikimedia.org/r/1111584 (https://phabricator.wikimedia.org/T382569) [10:05:08] (03CR) 10CI reject: [V:04-1] mariadb: Productionize es1045 [puppet] - 10https://gerrit.wikimedia.org/r/1111584 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [10:05:58] (03PS2) 10Marostegui: mariadb: Productionize es1045 [puppet] - 10https://gerrit.wikimedia.org/r/1111584 (https://phabricator.wikimedia.org/T382569) [10:06:19] (03CR) 10CI reject: [V:04-1] mariadb: Productionize es1045 [puppet] - 10https://gerrit.wikimedia.org/r/1111584 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [10:06:53] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:08:02] !log re-enabling puppet on nodes using k8s::package resource - T341984 [10:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:06] T341984: Update Kubernetes clusters to >1.25 - https://phabricator.wikimedia.org/T341984 [10:08:08] (03PS3) 10Marostegui: mariadb: Productionize es1045 [puppet] - 10https://gerrit.wikimedia.org/r/1111584 (https://phabricator.wikimedia.org/T382569) [10:09:03] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1045 [puppet] - 10https://gerrit.wikimedia.org/r/1111584 (https://phabricator.wikimedia.org/T382569) (owner: 10Marostegui) [10:11:47] PROBLEM - Host restbase2037 is DOWN: PING CRITICAL - Packet loss = 100% [10:11:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:12:08] (03PS9) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) [10:12:29] (03PS1) 10Gergő Tisza: Enable SUL3 on test wikis, second attempt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111585 (https://phabricator.wikimedia.org/T383729) [10:12:39] (03CR) 10JMeybohm: Update staging-codfw to kubernetes 1.31, calico 3.29 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:13:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, January 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111585 (https://phabricator.wikimedia.org/T383729) (owner: 10Gergő Tisza) [10:13:29] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2220.codfw.wmnet with reason: host reimage [10:13:58] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_magru and A:cp [10:14:22] FIRING: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:14:42] (03PS6) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [10:14:47] RECOVERY - Host restbase2037 is UP: PING OK - Packet loss = 0%, RTA = 30.27 ms [10:16:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2220.codfw.wmnet with reason: host reimage [10:16:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P72061 and previous config saved to /var/cache/conftool/dbconfig/20250115-101643-root.json [10:16:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:17:09] (03CR) 10Slyngshede: [C:03+2] Provide additional information about users [software/bitu] - 10https://gerrit.wikimedia.org/r/1110732 (https://phabricator.wikimedia.org/T383201) (owner: 10Slyngshede) [10:17:35] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2221.codfw.wmnet with reason: host reimage [10:18:07] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2222.codfw.wmnet with reason: host reimage [10:19:11] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2223.codfw.wmnet with reason: host reimage [10:21:24] (03Merged) 10jenkins-bot: Provide additional information about users [software/bitu] - 10https://gerrit.wikimedia.org/r/1110732 (https://phabricator.wikimedia.org/T383201) (owner: 10Slyngshede) [10:21:45] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2221.codfw.wmnet with reason: host reimage [10:21:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.codfw.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [10:22:08] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.193 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:22:17] RESOLVED: [6x] ProbeDown: Service restbase2037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:22:55] (03PS5) 10Klausman: admin/data: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 [10:22:55] (03CR) 10Klausman: [V:03+1] "Sorry for the wide approvers list." [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman) [10:22:56] PROBLEM - Juniper alarms on cr2-codfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:23:46] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:24:00] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 113, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:25:00] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2223.codfw.wmnet with reason: host reimage [10:31:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72062 and previous config saved to /var/cache/conftool/dbconfig/20250115-103149-root.json [10:32:19] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2222.codfw.wmnet with reason: host reimage [10:32:27] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [10:35:10] (03PS1) 10Jelto: sre.k8s.renumber-node: change default os to bookworm [cookbooks] - 10https://gerrit.wikimedia.org/r/1111588 (https://phabricator.wikimedia.org/T341984) [10:36:20] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2220.codfw.wmnet with OS bookworm [10:41:30] (03CR) 10FNegri: [C:03+2] Revert "Block PAWS workers nodes from all UDP traffic other than DNS & NTP" [puppet] - 10https://gerrit.wikimedia.org/r/1105036 (https://phabricator.wikimedia.org/T383261) (owner: 10FNegri) [10:42:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2221.codfw.wmnet with OS bookworm [10:44:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2223.codfw.wmnet with OS bookworm [10:46:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72063 and previous config saved to /var/cache/conftool/dbconfig/20250115-104654-root.json [10:52:50] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2222.codfw.wmnet with OS bookworm [10:53:04] RECOVERY - BGP status on lsw1-d3-codfw.mgmt is OK: BGP OK - up: 26, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:48] (03CR) 10Gkyziridis: "Thank you for working on this." [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T1100) [11:00:17] (03PS1) 10Marostegui: dbproxy2007.yaml: Replace m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1111589 (https://phabricator.wikimedia.org/T373579) [11:02:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72064 and previous config saved to /var/cache/conftool/dbconfig/20250115-110159-root.json [11:03:05] (03CR) 10Marostegui: "root@cumin1002:~# host 10.192.31.6" [puppet] - 10https://gerrit.wikimedia.org/r/1111589 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:03:15] (03CR) 10JMeybohm: [C:03+1] shellbox-video: scale down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1109459 (https://phabricator.wikimedia.org/T383317) (owner: 10Hnowlan) [11:04:15] (03CR) 10Marostegui: "# db-mysql db2234 -e "show databases"" [puppet] - 10https://gerrit.wikimedia.org/r/1111589 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:08:38] (03CR) 10Jcrespo: [C:03+1] dbproxy2007.yaml: Replace m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1111589 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:09:31] (03CR) 10Marostegui: [C:03+2] dbproxy2007.yaml: Replace m3 master [puppet] - 10https://gerrit.wikimedia.org/r/1111589 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:12:51] (03PS1) 10Marostegui: db2234: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111590 (https://phabricator.wikimedia.org/T373579) [11:13:38] (03CR) 10Marostegui: [C:03+2] db2234: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111590 (https://phabricator.wikimedia.org/T373579) (owner: 10Marostegui) [11:14:31] (03CR) 10Gkyziridis: [C:03+1] admin/data: Add user for Georgios Kyziridis (ML Team) [puppet] - 10https://gerrit.wikimedia.org/r/1109414 (owner: 10Klausman) [11:15:14] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2220-2223].codfw.wmnet [11:15:17] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2220-2223].codfw.wmnet [11:16:26] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383764 (10Jelto) 03NEW [11:17:02] (03CR) 10FNegri: "Thanks for this patch! I left a couple comments inline." [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [11:17:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72066 and previous config saved to /var/cache/conftool/dbconfig/20250115-111704-root.json [11:18:38] (03PS1) 10Filippo Giunchedi: hieradata: move k8s-mlstaging to new port [puppet] - 10https://gerrit.wikimedia.org/r/1111593 (https://phabricator.wikimedia.org/T383223) [11:19:20] (03CR) 10Bartosz Dziewoński: "They are URLs, but they are also paths to files in this repository. `wmfStaticParsePath` is for paths to files in `mediawiki/core`. I thin" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [11:20:09] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1108772 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [11:21:52] (03CR) 10Tiziano Fogli: [C:03+1] hieradata: move k8s-mlstaging to new port [puppet] - 10https://gerrit.wikimedia.org/r/1111593 (https://phabricator.wikimedia.org/T383223) (owner: 10Filippo Giunchedi) [11:23:44] (03CR) 10Filippo Giunchedi: "I'd like to go ahead with this, acme-chief is the sole user ATM, what do you think Valentin ?" [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) (owner: 10Filippo Giunchedi) [11:24:34] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10461706 (10phaultfinder) [11:25:43] (03CR) 10Vgutierrez: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) (owner: 10Filippo Giunchedi) [11:30:30] FIRING: [2x] ProbeDown: Service wdqs1013:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1013:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:31:38] (03PS7) 10Bartosz Dziewoński: Replace favicon.php with static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) [11:32:08] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline, also thanks for this !" [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [11:32:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72067 and previous config saved to /var/cache/conftool/dbconfig/20250115-113210-root.json [11:34:50] (03PS1) 10Jelto: Rename mw23[59|66|67|68] to wikikube-worker222[4-7] [puppet] - 10https://gerrit.wikimedia.org/r/1111597 (https://phabricator.wikimedia.org/T377877) [11:38:05] (03CR) 10Filippo Giunchedi: [C:04-1] "See inline, thank you for tackling this!" [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [11:38:35] (03PS3) 10Filippo Giunchedi: uwsgi: remove icinga-based monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1074409 (https://phabricator.wikimedia.org/T375271) [11:42:22] (03PS1) 10Bartosz Dziewoński: Move Beta Cluster favicons to this repository [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111598 [11:46:00] (03PS7) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [11:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:11] (03PS1) 10KartikMistry: Update cxserver to 2025-01-15-103159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111601 (https://phabricator.wikimedia.org/T377966) [11:53:04] Quick deploying of cxserver, shouldn't take much time. [11:53:42] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-01-15-103159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111601 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [11:54:45] (03CR) 10Volans: [C:03+1] "LGTM, let's make sure to test it after deploy" [puppet] - 10https://gerrit.wikimedia.org/r/1091755 (https://phabricator.wikimedia.org/T379570) (owner: 10FNegri) [11:54:50] (03Merged) 10jenkins-bot: Update cxserver to 2025-01-15-103159-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111601 (https://phabricator.wikimedia.org/T377966) (owner: 10KartikMistry) [11:56:00] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [11:56:23] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [11:57:15] (03CR) 10Volans: [C:03+1] "Great, LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) (owner: 10Cathal Mooney) [11:57:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:14] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [11:58:43] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [11:59:10] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [11:59:43] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [11:59:49] 06SRE, 06Infrastructure-Foundations, 10netops: WMF RIPE Atlas probe in Eqiad offline - https://phabricator.wikimedia.org/T382518#10461757 (10cmooney) >>! In T382518#10455949, @VRiley-WMF wrote: > This has been rebooted > > @cmooney would you be able to check this when you have a chance? Thanks for doing th... [12:00:05] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T1200). [12:02:09] !log Updated cxserver to 2025-01-15-103159-production (T377966) [12:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:13] T377966: Make cxserver Logstash logs readable and reliable - https://phabricator.wikimedia.org/T377966 [12:02:30] (03PS1) 10Stevemunene: Add FactGrid to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1111605 (https://phabricator.wikimedia.org/T381649) [12:02:32] (03PS1) 10Stevemunene: Add api.finto.fi/sparql to Wikidata query service and WCQS whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1111606 (https://phabricator.wikimedia.org/T378561) [12:02:33] (03PS1) 10Stevemunene: whitelist kg.kunsten.be on wikidata query service [puppet] - 10https://gerrit.wikimedia.org/r/1111607 (https://phabricator.wikimedia.org/T380984) [12:05:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:34] (03PS8) 10Cathal Mooney: Spicerack: find true physical int if server primary IP is on a bridge [software/spicerack] - 10https://gerrit.wikimedia.org/r/1109037 (https://phabricator.wikimedia.org/T383207) [12:08:03] (03CR) 10Jelto: [C:03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/1110813 (https://phabricator.wikimedia.org/T341984) (owner: 10JMeybohm) [12:09:09] (03CR) 10Gergő Tisza: "Hm, you are right. Maybe something to do with how the entry point is accessed via a symlink under `docroot/`? Although in theory `__DIR__`" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075211 (https://phabricator.wikimedia.org/T374997) (owner: 10Bartosz Dziewoński) [15:22:02] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2225.codfw.wmnet with OS bookworm [15:22:42] (03PS3) 10Muehlenhoff: Make maps-test2001 a bookworm maps master node (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) [15:23:36] !log homer 'lsw1-d3-codfw*' commit 'T377877' [15:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:40] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [15:24:24] (03Merged) 10jenkins-bot: Upstream release v9.1.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1111644 (owner: 10Volans) [15:25:07] (03Abandoned) 10Hnowlan: similar-users: make max queries per account configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/808923 (https://phabricator.wikimedia.org/T310646) (owner: 10Hnowlan) [15:25:49] !log homer 'lsw1-c6-codfw*' commit 'T377877' [15:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:25] (03CR) 10Btullis: [C:03+1] airflow: define pod templates enabling creating Pods from a task [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111619 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [15:26:57] !log homer 'cr*codfw*' commit 'T377877' [15:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:34] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111634 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:27:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 96, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:28:52] !log uploaded spicerack_9.1.0 to apt.wikimedia.org bullseye-wikimedia [15:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:37] (03CR) 10Brouberol: [C:03+2] airflow: define pod templates enabling creating Pods from a task [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111619 (https://phabricator.wikimedia.org/T383430) (owner: 10Brouberol) [15:30:05] !log jelto@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2192.codfw.wmnet with OS bookworm [15:30:09] !log jelto@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker2192 [15:30:09] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker2192 [15:30:33] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[2224-2227].codfw.wmnet [15:30:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[2224-2227].codfw.wmnet [15:31:12] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383764#10462745 (10Jelto) [15:31:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:32:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:33:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:35:24] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10462767 (10Jhancock.wm) this is what they sent me Steps on how to generate the SOS report: The 'sos' package provides the sos report command, which is typically installed by... [15:35:50] (03PS1) 10Jelto: Rename mw235[4-7] to wikikube-worker22[28-31] [puppet] - 10https://gerrit.wikimedia.org/r/1111646 (https://phabricator.wikimedia.org/T377877) [15:36:21] (03PS1) 10Muehlenhoff: Remove profile::java from maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111647 (https://phabricator.wikimedia.org/T381565) [15:36:53] (03PS1) 10Kamila Součková: kubernetes: rename mw142[1-5] -> kubernetes-worker110[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/1111648 (https://phabricator.wikimedia.org/T377876) [15:36:54] !log installing python-django security updates [15:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:13] (03CR) 10Jelto: [C:03+1] "lgtm 👍" [puppet] - 10https://gerrit.wikimedia.org/r/1111648 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [15:40:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111647 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:40:53] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host mw[1421-1425].eqiad.wmnet [15:41:05] (03CR) 10Kamila Součková: [C:03+2] kubernetes: rename mw142[1-5] -> kubernetes-worker110[2-6] [puppet] - 10https://gerrit.wikimedia.org/r/1111648 (https://phabricator.wikimedia.org/T377876) (owner: 10Kamila Součková) [15:41:55] (03CR) 10Elukey: [C:03+1] "I was confused at first, but I see that profile::java it is not even imported in the maps role:" [puppet] - 10https://gerrit.wikimedia.org/r/1111647 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [15:43:19] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov1004.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [15:43:41] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host mw[1421-1425].eqiad.wmnet [15:46:07] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov1004.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [15:46:07] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1421 to wikikube-worker1102 [15:46:17] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov1005.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [15:46:27] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:46:34] (03PS1) 10Marostegui: mariadb: Remove db2130 [puppet] - 10https://gerrit.wikimedia.org/r/1111650 (https://phabricator.wikimedia.org/T383766) [15:46:42] RESOLVED: JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:47:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2130.codfw.wmnet [15:47:29] (03CR) 10Marostegui: [C:03+2] mariadb: Remove db2130 [puppet] - 10https://gerrit.wikimedia.org/r/1111650 (https://phabricator.wikimedia.org/T383766) (owner: 10Marostegui) [15:48:53] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov1005.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [15:49:57] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [15:49:57] e - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:04] marostegui: caught your db2130 removal in netbox cookbook, proceeding [15:50:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv [15:50:07] e - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:50:22] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2192.codfw.wmnet with reason: host reimage [15:51:08] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1421 to wikikube-worker1102 - kamila@cumin1002" [15:51:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1421 to wikikube-worker1102 - kamila@cumin1002" [15:51:46] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:46] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1102 [15:51:55] !log marostegui@cumin1002 START - Cookbook sre.dns.netbox [15:52:25] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1422 to wikikube-worker1103 [15:53:11] !log installing libsoup2.4 security updates [15:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:17] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1102 [15:53:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1421 to wikikube-worker1102 [15:54:04] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2192.codfw.wmnet with reason: host reimage [15:54:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2130.codfw.wmnet [15:54:19] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:56:33] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10462908 (10MatthewVernon) Hi, yes, those are Red-Hat specific instructions. On Debian & Ubuntu one has to install the sosreport package. Unfortunately, the root filesystem is no... [15:56:43] (03PS10) 10Tiziano Fogli: thanos-rule: manage retention setting [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) [15:57:11] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1423 to wikikube-worker1104 [15:58:06] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1422 to wikikube-worker1103 - kamila@cumin1002" [15:58:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1422 to wikikube-worker1103 - kamila@cumin1002" [15:58:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:58:27] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1103 [15:58:46] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [15:59:04] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2130.codfw.wmnet - https://phabricator.wikimedia.org/T383766#10462916 (10Marostegui) a:05Marostegui→03None [15:59:40] FIRING: [2x] KubernetesRsyslogDown: rsyslog on mw1424:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:59:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1103 [15:59:51] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov1006.eqiad.wmnet with reason: os upgrade [16:00:06] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov1006.eqiad.wmnet with reason: os upgrade [16:00:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1422 to wikikube-worker1103 [16:00:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd2004-dev.codfw.wmnet with OS bullseye [16:00:39] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcephosd2004-dev - https://phabricator.wikimedia.org/T378825#10462926 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudcephosd2004-dev.codfw.wmnet with OS bullsey... [16:01:08] 10ops-codfw, 06DBA, 06DC-Ops, 10decommission-hardware: decommission db2130.codfw.wmnet - https://phabricator.wikimedia.org/T383766#10462928 (10Marostegui) This is ready for #dc-ops [16:01:40] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1424 to wikikube-worker1105 [16:02:17] (03CR) 10Tiziano Fogli: "Thank you for the hints. Have a look at the comments to see if they're clear enough." [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [16:02:29] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1423 to wikikube-worker1104 - kamila@cumin1002" [16:02:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1423 to wikikube-worker1104 - kamila@cumin1002" [16:02:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:02:49] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1104 [16:02:51] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:04:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1104 [16:04:45] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1423 to wikikube-worker1104 [16:05:08] (03CR) 10Volans: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 (owner: 10Ssingh) [16:05:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:57] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1424 to wikikube-worker1105 - kamila@cumin1002" [16:07:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1424 to wikikube-worker1105 - kamila@cumin1002" [16:07:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:07:02] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1105 [16:08:27] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1105 [16:08:41] (03PS1) 10Marostegui: rebuild_tables.sh: Add start and finish time [software] - 10https://gerrit.wikimedia.org/r/1111653 (https://phabricator.wikimedia.org/T382842) [16:08:47] !log kamila@cumin1002 START - Cookbook sre.hosts.rename from mw1425 to wikikube-worker1106 [16:09:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1424 to wikikube-worker1105 [16:09:07] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [16:09:53] (03CR) 10Marostegui: [C:03+2] rebuild_tables.sh: Add start and finish time [software] - 10https://gerrit.wikimedia.org/r/1111653 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [16:10:22] (03Merged) 10jenkins-bot: rebuild_tables.sh: Add start and finish time [software] - 10https://gerrit.wikimedia.org/r/1111653 (https://phabricator.wikimedia.org/T382842) (owner: 10Marostegui) [16:11:16] (03CR) 10Muehlenhoff: "Yeah, this is just a leftover Hiera config I noticed when preparing a separate maps_bookworm role" [puppet] - 10https://gerrit.wikimedia.org/r/1111647 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:11:17] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::java from maps hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111647 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:11:53] (03CR) 10Kevin Bazira: "thank you for working on this, Georgios. we shall proceed with this patch after we've fixed the issue of CI/CD not building the new predic" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111580 (https://phabricator.wikimedia.org/T383312) (owner: 10Gkyziridis) [16:13:50] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1425 to wikikube-worker1106 - kamila@cumin1002" [16:14:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming mw1425 to wikikube-worker1106 - kamila@cumin1002" [16:14:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:23] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1106 [16:14:42] !log jelto@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jelto@cumin1002" [16:15:10] !log jelto@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jelto@cumin1002" [16:15:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2192.codfw.wmnet with OS bookworm [16:16:04] !log volans@cumin2002 START - Cookbook sre.dns.netbox [16:16:22] !log volans@cumin2002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [16:16:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1106 [16:17:25] !log homer 'lsw1-d8-codfw*' commit 'T377877' [16:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:28] T377877: Migrate wikikube-codfw to containerd - https://phabricator.wikimedia.org/T377877 [16:17:33] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from mw1425 to wikikube-worker1106 [16:17:39] !log kamila@cumin1002 START - Cookbook sre.dns.wipe-cache wikikube-worker1102.eqiad.wmnet wikikube-worker1103.eqiad.wmnet wikikube-worker1104.eqiad.wmnet wikikube-worker1105.eqiad.wmnet wikikube-worker1106.eqiad.wmnet on all recursors [16:17:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) wikikube-worker1102.eqiad.wmnet wikikube-worker1103.eqiad.wmnet wikikube-worker1104.eqiad.wmnet wikikube-worker1105.eqiad.wmnet wikikube-worker1106.eqiad.wmnet on all recursors [16:18:35] (03PS1) 10Volans: tests: accept unowned as a valid owner [cookbooks] - 10https://gerrit.wikimedia.org/r/1111655 [16:19:02] (03CR) 10Volans: [C:03+2] "merging to unblock cookbook patches" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111655 (owner: 10Volans) [16:20:01] !log jelto@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2192.codfw.wmnet [16:20:03] !log jelto@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2192.codfw.wmnet [16:20:41] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1102.eqiad.wmnet with OS bookworm [16:20:44] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1102 [16:20:44] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1102 [16:20:51] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1103.eqiad.wmnet with OS bookworm [16:20:54] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1103 [16:20:54] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1103 [16:20:58] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1104.eqiad.wmnet with OS bookworm [16:21:01] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1104 [16:21:01] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1104 [16:21:03] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1105.eqiad.wmnet with OS bookworm [16:21:06] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1105 [16:21:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1105 [16:21:08] (03PS1) 10Jcrespo: dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) [16:21:10] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1106.eqiad.wmnet with OS bookworm [16:21:13] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1106 [16:21:14] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1106 [16:22:16] 10ops-codfw, 06SRE, 06DC-Ops, 07Kubernetes: hw troubleshooting: Comm Error: backplane 0 for wikikube-worker2192.codfw.wmnet - https://phabricator.wikimedia.org/T383339#10463043 (10Jelto) Thanks @Jhancock.wm for handling this hardware issue. The host is up and a reimage was successful. I added the hos... [16:23:24] (03PS2) 10Jcrespo: dbbackups: Migrate db2139 backup generation to db2239 [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) [16:24:20] (03CR) 10Jcrespo: [C:04-1] "Do not merge yet, data is not ready (rebuilding tables)." [puppet] - 10https://gerrit.wikimedia.org/r/1111656 (https://phabricator.wikimedia.org/T373579) (owner: 10Jcrespo) [16:24:33] (03PS3) 10Ssingh: sre.dns.admin: update show to use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 [16:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10463065 (10phaultfinder) [16:27:24] (03PS1) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) [16:27:45] (03CR) 10CI reject: [V:04-1] Add separate maps master/replica roles for the new Bookworm setup (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [16:29:39] (03CR) 10Herron: [C:03+1] "Appreciate them thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1111599 (https://phabricator.wikimedia.org/T352756) (owner: 10Tiziano Fogli) [16:31:08] (03PS1) 10Volans: sre.hosts.downtime: skip START log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1111660 [16:31:38] (03PS2) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) [16:32:25] (03CR) 10Volans: "Using this cookbook as beta-tester for this new feature as it's almost always a quick one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1111660 (owner: 10Volans) [16:33:51] 10ops-eqiad, 06cloud-services-team, 06DC-Ops: Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10463105 (10Andrew) DC people, is this anything? This same alert has popped up a few times in the last few days. [16:33:51] (03CR) 10Volans: [C:03+1] "Sukhbir, I'll ping you once the new release is deployed to all cumin hosts and this can be merged." [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 (owner: 10Ssingh) [16:33:55] (03PS3) 10Muehlenhoff: Add separate maps master/replica roles for the new Bookworm setup [puppet] - 10https://gerrit.wikimedia.org/r/1111659 (https://phabricator.wikimedia.org/T381565) [16:34:12] (03CR) 10Ssingh: "No worries and thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 (owner: 10Ssingh) [16:34:50] (03CR) 10Elukey: [C:03+1] sre.hosts.downtime: skip START log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1111660 (owner: 10Volans) [16:36:42] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1104.eqiad.wmnet with reason: host reimage [16:36:52] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1105.eqiad.wmnet with reason: host reimage [16:39:41] (03CR) 10DLynch: "I mean, I imagine I would use it, though I'd need to go educate myself a bit first." [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [16:40:34] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1104.eqiad.wmnet with reason: host reimage [16:41:43] (03PS3) 10Anzx: Add dso and thq to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111661 (https://phabricator.wikimedia.org/T383785) [16:43:29] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1105.eqiad.wmnet with reason: host reimage [16:44:42] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (Hardware): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10463156 (10fnegri) [16:52:33] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1102.eqiad.wmnet with OS bookworm [16:52:56] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1102.eqiad.wmnet with OS bookworm [16:52:59] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1102 [16:52:59] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1102 [16:53:04] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1103.eqiad.wmnet with OS bookworm [16:53:18] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1103.eqiad.wmnet with OS bookworm [16:53:20] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1106.eqiad.wmnet with OS bookworm [16:53:21] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1103 [16:53:22] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1103 [16:53:34] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1106.eqiad.wmnet with OS bookworm [16:53:38] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1106 [16:53:38] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1106 [16:54:31] (03CR) 10Eevans: [C:03+2] cassandra: set target_dev to 4.x (no-op) [puppet] - 10https://gerrit.wikimedia.org/r/1109768 (https://phabricator.wikimedia.org/T380420) (owner: 10Eevans) [16:54:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P72074 and previous config saved to /var/cache/conftool/dbconfig/20250115-165434-root.json [16:54:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111350 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [16:55:28] (03PS1) 10Marostegui: db1163: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111664 [16:56:24] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [16:57:24] (03CR) 10Jsn.sherman: [C:03+1] Increase Nuke max age to 90 days (attempt 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111350 (https://phabricator.wikimedia.org/T380846) (owner: 10Chlod Alejandro) [16:58:33] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [16:58:37] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [16:58:58] (03CR) 10Marostegui: [C:03+2] db1163: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1111664 (owner: 10Marostegui) [16:59:39] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1104.eqiad.wmnet with OS bookworm [17:01:59] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1103.eqiad.wmnet with OS bookworm [17:02:07] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1105.eqiad.wmnet with OS bookworm [17:02:17] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1106.eqiad.wmnet with OS bookworm [17:02:36] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-worker1102.eqiad.wmnet with OS bookworm [17:04:48] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [17:05:09] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1105.eqiad.wmnet with OS bookworm [17:05:21] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1106.eqiad.wmnet with OS bookworm [17:05:25] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1106 [17:05:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1106 [17:05:35] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1105 [17:05:35] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1105 [17:05:39] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1102.eqiad.wmnet with OS bookworm [17:05:42] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1102 [17:05:42] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1102 [17:05:53] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [17:06:22] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1103.eqiad.wmnet with OS bookworm [17:06:25] !log kamila@cumin1002 START - Cookbook sre.hosts.move-vlan for host wikikube-worker1103 [17:06:25] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host wikikube-worker1103 [17:06:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111660 (owner: 10Volans) [17:09:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P72075 and previous config saved to /var/cache/conftool/dbconfig/20250115-170940-root.json [17:12:13] (03CR) 10Kamila Součková: [C:03+1] Rename mw235[4-7] to wikikube-worker22[28-31] [puppet] - 10https://gerrit.wikimedia.org/r/1111646 (https://phabricator.wikimedia.org/T377877) (owner: 10Jelto) [17:18:44] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Upgrading to Cassandra 4.1.7 — T380420 - eevans@cumin1002 [17:18:48] T380420: Upgrade Cassandra clusters to v4.1.7 - https://phabricator.wikimedia.org/T380420 [17:21:07] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1106.eqiad.wmnet with reason: host reimage [17:21:34] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1102.eqiad.wmnet with reason: host reimage [17:21:54] !log root@cumin1002 START - Cookbook sre.puppet.renew-cert for dbprov1006.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [17:21:57] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1105.eqiad.wmnet with reason: host reimage [17:22:09] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1103.eqiad.wmnet with reason: host reimage [17:23:29] (03PS1) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) [17:24:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1106.eqiad.wmnet with reason: host reimage [17:24:44] !log running `decommission` for 5 codfw jobrunners [17:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P72076 and previous config saved to /var/cache/conftool/dbconfig/20250115-172445-root.json [17:25:10] !log root@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for dbprov1006.eqiad.wmnet: Renew puppet certificate - root@cumin1002 [17:26:15] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [17:26:21] (03PS2) 10Hnowlan: wikikube: reimage 5 former jobrunner/videoscaler hosts to workers [puppet] - 10https://gerrit.wikimedia.org/r/1111670 (https://phabricator.wikimedia.org/T354791) [17:26:34] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [17:26:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1105.eqiad.wmnet with reason: host reimage [17:30:10] Lucas_WMDE sorry I scheduled the deploy for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1111166 but was sick today, so I wasn't present for the deployment [17:30:19] I'll re-schedule thanks [17:30:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1102.eqiad.wmnet with reason: host reimage [17:30:33] ok, I hope you’ll get better! [17:30:38] jouncebot: now [17:30:38] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [17:31:06] (I’d also be up for deploying it now if that’s okay with everyone else) [17:33:13] if you're ok we can do now [17:33:22] should I reschedule? [17:34:21] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1103.eqiad.wmnet with reason: host reimage [17:36:36] !log andrew@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [17:39:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10463552 (10phaultfinder) [17:39:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P72077 and previous config saved to /var/cache/conftool/dbconfig/20250115-173951-root.json [17:41:31] sorry, I didn’t look at the channel for a few minutes [17:41:32] jouncebot: next [17:41:33] In 0 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T1800) [17:42:12] let’s try to get it in now [17:42:31] ok [17:42:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [17:42:48] will there be anything to test on WikimediaDebug for this change? [17:42:59] (I don’t remember if these new stream configs are usually testable or not) [17:43:16] mmm don't know, but I can easily test with a curl on a mwdebug [17:43:18] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1106.eqiad.wmnet with OS bookworm [17:44:33] ok! [17:44:40] while we still have the bare-metal mwdebugs ;) [17:44:49] do I need to do something other than testing? [17:44:58] sorry I don't usually deploy these kind of changes [17:45:02] no, I’ll let you know when you can test [17:45:08] (03Merged) 10jenkins-bot: Added new stream config for haproxy_requestctl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111166 (https://phabricator.wikimedia.org/T383392) (owner: 10Fabfur) [17:45:08] I just wanted to check in advance [17:45:16] ack! [17:45:20] since the timing will probably be pretty tight ^^ [17:45:40] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1111166|Added new stream config for haproxy_requestctl (T383392)]] [17:45:43] T383392: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392 [17:47:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1105.eqiad.wmnet with OS bookworm [17:49:09] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1102.eqiad.wmnet with OS bookworm [17:51:07] !log homer cr*eqiad* commit T377876 [17:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:10] T377876: Migrate wikikube-eqiad to containerd - https://phabricator.wikimedia.org/T377876 [17:52:00] Lucas_WMDE I would say it's working! [17:52:06] thanks a lot! [17:52:12] it’s not quite ready for testing yet, in theory :P [17:52:19] but yeah ok it got synced already [17:52:23] scap is just still testing it [17:52:23] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1103.eqiad.wmnet with OS bookworm [17:52:28] ack [17:52:43] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, fabfur: Backport for [[gerrit:1111166|Added new stream config for haproxy_requestctl (T383392)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:52:45] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, fabfur: Continuing with sync [17:52:46] T383392: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392 [17:53:11] !log kamila@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker[1102-1106].eqiad.wmnet [17:53:13] !log kamila@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker[1102-1106].eqiad.wmnet [17:53:38] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host cloudcephosd1012.eqiad.wmnet [17:54:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad kubernetes nodes - https://phabricator.wikimedia.org/T383620#10463696 (10kamila) [17:54:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P72079 and previous config saved to /var/cache/conftool/dbconfig/20250115-175456-root.json [17:55:22] (03CR) 10Volans: [C:03+2] sre.hosts.downtime: skip START log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1111660 (owner: 10Volans) [17:56:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host cloudcephosd1012.eqiad.wmnet [17:58:18] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [17:59:18] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111166|Added new stream config for haproxy_requestctl (T383392)]] (duration: 13m 38s) [17:59:21] T383392: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392 [17:59:37] thanks again Lucas_WMDE [17:59:50] (03CR) 10Ssingh: [C:03+2] sre.dns.admin: update show to use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111236 (owner: 10Ssingh) [17:59:57] np :) [18:00:01] just in time :D [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T1800) [18:00:15] * Lucas_WMDE done deploying [18:00:23] :) [18:02:33] (03Merged) 10jenkins-bot: sre.hosts.downtime: skip START log to SAL [cookbooks] - 10https://gerrit.wikimedia.org/r/1111660 (owner: 10Volans) [18:04:28] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10463805 (10Dzahn) [18:05:30] 06SRE, 10SRE-tools, 06Infrastructure-Foundations: exception raised for "sre.dns.admin show" - https://phabricator.wikimedia.org/T378039#10463812 (10ssingh) 05Open→03Resolved a:03ssingh This has now been fixed, thanks to @Volans! ` sukhe@cumin1002:~$ sudo cookbook sre.dns.admin show => CURRENT STAT... [18:05:32] !log volans@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1001.eqiad.wmnet with reason: testing cookbook [18:06:02] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on sretest1002.eqiad.wmnet with reason: testing cookbook [18:08:08] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:08:25] (03CR) 10Dzahn: Add myself to releasers-mediawiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) (owner: 10MSantos) [18:09:48] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1002.eqiad.wmnet with reason: testing cookbook [18:10:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:12] PROBLEM - Disk space on ms-be2075 is CRITICAL: DISK CRITICAL - /srv/swift-storage/objects10 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be2075&var-datasource=codfw+prometheus/ops [18:12:40] (03PS2) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) [18:13:18] (03CR) 10Andrea Denisse: "Thanks for your review, I sent a new patch." [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [18:13:55] (03CR) 10CI reject: [V:04-1] wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [18:15:03] (03PS3) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) [18:16:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [18:16:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [18:16:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T371742)', diff saved to https://phabricator.wikimedia.org/P72080 and previous config saved to /var/cache/conftool/dbconfig/20250115-181629-ladsgroup.json [18:16:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [18:19:08] (03PS1) 10Volans: sre.network.peering: use CookbookInitSuccess [cookbooks] - 10https://gerrit.wikimedia.org/r/1111677 [18:20:06] (03PS1) 10Ssingh: dns.admin: show descriptive text before calling admin_state [cookbooks] - 10https://gerrit.wikimedia.org/r/1111678 [18:21:47] (03PS4) 10Andrea Denisse: wmcs: Migrate iowait stalling alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111338 (https://phabricator.wikimedia.org/T328502) [18:22:33] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [18:22:50] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [18:24:05] (03PS2) 10Dzahn: Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) (owner: 10MSantos) [18:24:42] (03CR) 10Dzahn: [C:03+2] Add myself to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) (owner: 10MSantos) [18:24:57] (03CR) 10Dzahn: [C:03+2] "has approval from manager and group owner, amended, merging" [puppet] - 10https://gerrit.wikimedia.org/r/1105952 (https://phabricator.wikimedia.org/T382616) (owner: 10MSantos) [18:27:27] (03PS1) 10BPirkle: RevisionStore: No first revision of non-existing page [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111680 (https://phabricator.wikimedia.org/T380677) [18:27:54] (03PS1) 10CDanis: urldownloader: squid_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1111681 [18:28:19] 10SRE-tools, 06Infrastructure-Foundations: Add an ownership field to cookbooks. - https://phabricator.wikimedia.org/T379258#10463938 (10Volans) 05Open→03Resolved This feature is now live and cookbook ownership can be clearly seen when listing cookbooks (`cookbook -l` or `cookbook -lv`) and at the botto... [18:29:54] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454#10463940 (10Volans) 05Open→03Resolved This is now live, see the related documentation in https://doc.wikimedia.org/spicerack/master/api/spice... [18:30:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to releasers-mediawiki for MSantos - https://phabricator.wikimedia.org/T382616#10463942 (10Dzahn) 05In progress→03Resolved a:05Bmueller→03Dzahn @MSantos After this now had both needed approvals I took the liberty to slightly amen... [18:30:32] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655#10463945 (10Volans) 05Open→03Resolved This is now live, see the related documentation in https://doc.wikimedia.org/spicerack/master/api/spicerack.cookboo... [18:32:05] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111681 (owner: 10CDanis) [18:32:16] (03CR) 10Majavah: "The external services that MW talks to (hopefully) use HTTPS, so Squid only sees a CONNECT request and then only the encrypted version of " [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [18:33:27] 06SRE, 06Infrastructure-Foundations: Spicerack fails to find host physical interface for ganeti nodes - https://phabricator.wikimedia.org/T383207#10463971 (10Volans) 05Open→03Resolved a:03Volans This is now live and works as expected: ` >>> spicerack.netbox_server("ganeti2027").access_vlan 'private1-... [18:34:59] (03CR) 10Volans: [C:03+1] "LGTM if you prefer this format for the output :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111678 (owner: 10Ssingh) [18:35:08] (03CR) 10Dzahn: [C:03+2] ci: Install memcached for MediaWiki success cache [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) (owner: 10Dduvall) [18:35:44] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.197 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:36:04] (03CR) 10Dzahn: [C:03+2] "it seemed pointless to check a puppet run on the relevant VM in cloud since it was already cherry-picked anyways" [puppet] - 10https://gerrit.wikimedia.org/r/1111295 (https://phabricator.wikimedia.org/T383243) (owner: 10Dduvall) [18:37:40] (03CR) 10Ssingh: [C:03+2] dns.admin: show descriptive text before calling admin_state [cookbooks] - 10https://gerrit.wikimedia.org/r/1111678 (owner: 10Ssingh) [18:38:08] (03CR) 10Dzahn: [C:04-1] "after double checking the yaml that is created for the blackbox checks I am now voting against this and would say we should keep it as is" [puppet] - 10https://gerrit.wikimedia.org/r/1108112 (https://phabricator.wikimedia.org/T382964) (owner: 10AOkoth) [18:39:11] (03CR) 10CDanis: "Yes, of course you are right, most of the traffic is indeed CONNECT -- although there is some plaintext as well, especially to archive.org" [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [18:40:24] (03CR) 10Dzahn: [C:03+2] gerrit: restore IP addresses in ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [18:43:27] (03Merged) 10jenkins-bot: dns.admin: show descriptive text before calling admin_state [cookbooks] - 10https://gerrit.wikimedia.org/r/1111678 (owner: 10Ssingh) [18:45:05] (03PS2) 10CDanis: urldownloader: squid_exporter [puppet] - 10https://gerrit.wikimedia.org/r/1111681 [18:45:20] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1111681 (owner: 10CDanis) [18:47:07] (03PS1) 10Ssingh: Revert "dns.admin: show descriptive text before calling admin_state" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111684 [18:49:50] (03Abandoned) 10Ssingh: Revert "dns.admin: show descriptive text before calling admin_state" [cookbooks] - 10https://gerrit.wikimedia.org/r/1111684 (owner: 10Ssingh) [18:49:55] (03PS1) 10Ssingh: dns.admin: clarify show being called and dump admin_state [cookbooks] - 10https://gerrit.wikimedia.org/r/1111685 [18:50:19] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [18:51:22] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [18:53:48] 10ops-codfw, 06SRE, 06DC-Ops: Tracking List: Relocating servers to free up 10G switch space in codfw - https://phabricator.wikimedia.org/T383709#10463997 (10Jhancock.wm) [18:56:57] (03CR) 10Ssingh: [C:03+2] dns.admin: clarify show being called and dump admin_state [cookbooks] - 10https://gerrit.wikimedia.org/r/1111685 (owner: 10Ssingh) [18:57:38] (03CR) 10Majavah: "Huh, I shouldn't have assumed practically everything is encrypted these days then. That approach sounds fine to me." [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) (owner: 10CDanis) [19:00:00] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw kubernetes nodes - https://phabricator.wikimedia.org/T383764#10463999 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [19:00:05] brennen and dduvall: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T1900). [19:00:09] o/ [19:01:11] just noticing a blocker that i somehow missed earlier. [19:01:36] doing a backport first here. [19:02:48] (03Merged) 10jenkins-bot: dns.admin: clarify show being called and dump admin_state [cookbooks] - 10https://gerrit.wikimedia.org/r/1111685 (owner: 10Ssingh) [19:03:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111680 (https://phabricator.wikimedia.org/T380677) (owner: 10BPirkle) [19:03:50] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Frequent disk resets on ms-be2075 - https://phabricator.wikimedia.org/T382707#10464019 (10Jhancock.wm) thanks for the update and clarification. I've updated the ticket with the added info. maybe they'll quit stalling [19:04:48] (03CR) 10Dzahn: [C:03+2] "gerrit1003.. Ssh::Client/File[/etc/ssh/ssh_known_hosts]/content: content changed .." [puppet] - 10https://gerrit.wikimedia.org/r/1111163 (https://phabricator.wikimedia.org/T303828) (owner: 10Hashar) [19:05:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:07:47] !log sukhe@cumin1002 START - Cookbook sre.dns.admin DNS admin: depool site ulsfo [reason: this is a test, not actual depool, no task ID specified] [19:07:54] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.dns.admin (exit_code=99) DNS admin: depool site ulsfo [reason: this is a test, not actual depool, no task ID specified] [19:08:04] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:08:08] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:08:17] (03PS3) 10CDanis: urldownloader: squid_exporter monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1111681 [19:08:17] (03PS2) 10CDanis: urldownloader: scrub outbound privacy-sensitive hdrs [puppet] - 10https://gerrit.wikimedia.org/r/1111265 (https://phabricator.wikimedia.org/T340552) [19:08:31] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:10:29] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10464034 (10phaultfinder) [19:14:44] (03CR) 10Dzahn: [C:03+1] "ship it?:)" [puppet] - 10https://gerrit.wikimedia.org/r/1077466 (https://phabricator.wikimedia.org/T332220) (owner: 10BCornwall) [19:15:22] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111688 [19:15:22] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:15:41] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:20:17] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@cd03eb7]: Cascading backfill under projectview hourly [19:21:24] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@cd03eb7]: Cascading backfill under projectview hourly (duration: 01m 06s) [19:22:04] (03Merged) 10jenkins-bot: RevisionStore: No first revision of non-existing page [core] (wmf/1.44.0-wmf.12) - 10https://gerrit.wikimedia.org/r/1111680 (https://phabricator.wikimedia.org/T380677) (owner: 10BPirkle) [19:22:29] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: Netbox report network (instance netbox1003) - https://phabricator.wikimedia.org/T383303#10464049 (10cmooney) 05Open→03Resolved I've added dummy interfaces in Netbox on all the fr-tech hosts and connected them to the switch ports... [19:22:35] !log brennen@deploy2002 Started scap sync-world: Backport for [[gerrit:1111680|RevisionStore: No first revision of non-existing page (T380677)]] [19:22:40] T380677: Wikimedia\Assert\ParameterAssertionException: Bad value for parameter $page: must represent an existing page - https://phabricator.wikimedia.org/T380677 [19:26:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T371742)', diff saved to https://phabricator.wikimedia.org/P72081 and previous config saved to /var/cache/conftool/dbconfig/20250115-192631-ladsgroup.json [19:26:35] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [19:28:50] (03PS1) 10Andrew Bogott: partman: change recipe for cloudcephosd1012 [puppet] - 10https://gerrit.wikimedia.org/r/1111691 [19:28:59] !log brennen@deploy2002 bpirkle, brennen: Backport for [[gerrit:1111680|RevisionStore: No first revision of non-existing page (T380677)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:29:03] T380677: Wikimedia\Assert\ParameterAssertionException: Bad value for parameter $page: must represent an existing page - https://phabricator.wikimedia.org/T380677 [19:34:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10464085 (10phaultfinder) [19:34:46] !log brennen@deploy2002 bpirkle, brennen: Continuing with sync [19:35:39] (03PS18) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [19:37:19] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [19:39:36] !log brennen@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111680|RevisionStore: No first revision of non-existing page (T380677)]] (duration: 17m 00s) [19:39:40] T380677: Wikimedia\Assert\ParameterAssertionException: Bad value for parameter $page: must represent an existing page - https://phabricator.wikimedia.org/T380677 [19:41:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P72082 and previous config saved to /var/cache/conftool/dbconfig/20250115-194138-ladsgroup.json [19:43:07] (03CR) 10Ssingh: alerts: add alert for ferm_mss_cfg Prometheus metric (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [19:43:08] (03PS19) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [19:43:34] (03CR) 10Andrew Bogott: [C:03+2] partman: change recipe for cloudcephosd1012 [puppet] - 10https://gerrit.wikimedia.org/r/1111691 (owner: 10Andrew Bogott) [19:44:20] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [19:44:26] (03PS20) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [19:45:37] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [19:46:40] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:46:57] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:48:18] (03PS21) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [19:49:29] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [19:49:39] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111693 (https://phabricator.wikimedia.org/T382363) [19:49:41] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111693 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [19:50:26] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111693 (https://phabricator.wikimedia.org/T382363) (owner: 10TrainBranchBot) [19:53:21] (03PS1) 10CDanis: varnish: x-analytics: Authorization header summary [puppet] - 10https://gerrit.wikimedia.org/r/1111695 [19:54:51] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:55:04] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1012.eqiad.wmnet with OS bookworm [19:56:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P72083 and previous config saved to /var/cache/conftool/dbconfig/20250115-195645-ladsgroup.json [19:57:28] (03PS22) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [19:58:40] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [19:59:43] !log Removing 1 file for legal compliance [19:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:16] (03PS23) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:01:30] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.12 refs T382363 [20:01:34] T382363: 1.44.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T382363 [20:09:03] (03PS5) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) [20:09:23] (03PS24) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:09:27] (03CR) 10Thcipriani: [C:03+1] add kemayo to deployers [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [20:10:18] (03CR) 10CI reject: [V:04-1] wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [20:10:29] (03CR) 10Thcipriani: [C:03+1] "Happy to pair during one of the backport windows to help get you started! Thanks for volunteering for deploys 😊" [puppet] - 10https://gerrit.wikimedia.org/r/1110867 (owner: 10CDanis) [20:11:15] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:11:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T371742)', diff saved to https://phabricator.wikimedia.org/P72085 and previous config saved to /var/cache/conftool/dbconfig/20250115-201152-ladsgroup.json [20:11:56] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [20:12:53] (03PS6) 10Andrea Denisse: wmcs: Migrate network saturation alerts to the alerts.git repository [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) [20:14:06] (03PS25) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:14:45] 06SRE-OnFire, 10Sustainability (Incident Followup): create a place (whiteboard) where SRE advertises current site status / things for awareness - https://phabricator.wikimedia.org/T378038#10464234 (10Dzahn) Since T378039 has been resolved we now have this: ` sukhe@cumin1002:~$ sudo cookbook sre.dns.admin sho... [20:14:55] (03CR) 10Andrea Denisse: "Thank you very much for your review, I sent a new patch." [alerts] - 10https://gerrit.wikimedia.org/r/1111328 (https://phabricator.wikimedia.org/T328502) (owner: 10Andrea Denisse) [20:15:17] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:20:38] !log Removing 2 files for legal compliance [20:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:49] (03PS26) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:23:02] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:25:02] (03PS27) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:26:13] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:26:20] (03PS2) 10CDanis: varnish: x-analytics: Authorization header summary [puppet] - 10https://gerrit.wikimedia.org/r/1111695 [20:27:37] (03PS28) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:28:13] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudcephosd1012.eqiad.wmnet with OS bookworm [20:28:48] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:31:04] !log Removing 1 file for legal compliance [20:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:33] (03PS29) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:33:45] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:35:36] (03PS30) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:36:50] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:37:59] (03PS31) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:39:09] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:42:34] (03PS1) 10CDanis: conftool: stub out extension configuration [puppet] - 10https://gerrit.wikimedia.org/r/1111703 [20:45:38] (03PS32) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:46:50] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:47:46] (03PS33) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:48:57] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:50:40] (03PS34) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:51:51] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:53:16] (03PS35) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:54:28] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [20:54:38] !log Removing 1 file for legal compliance [20:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:22] (03PS36) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [20:58:33] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:54] (03PS37) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:02:05] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:03:00] Is there a deployer available to deploy a config patch for a security task? since it is for a security task, the patch has CR+1 in phab but is not yet uploaded to gerrit [21:03:43] (03PS38) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:04:54] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:05:04] TheresNoTime: maybe you? [21:06:52] (03PS39) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:07:08] JJMC89: can do in about 5 minutes — what task? [21:07:50] T383747 - I can upload the patch to gerrit now [21:08:04] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:08:08] Please do :) [21:09:16] * TheresNoTime is ready [21:09:26] (03PS40) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:10:39] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:10:44] (03PS2) 10JJMC89: do not allow temp users to edit on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111706 (https://phabricator.wikimedia.org/T383747) [21:10:58] TheresNoTime: ^ [21:11:10] ack [21:11:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111706 (https://phabricator.wikimedia.org/T383747) (owner: 10JJMC89) [21:12:36] (03Merged) 10jenkins-bot: do not allow temp users to edit on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111706 (https://phabricator.wikimedia.org/T383747) (owner: 10JJMC89) [21:12:53] (03PS41) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:13:07] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1111706|do not allow temp users to edit on loginwiki (T383747)]] [21:14:04] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:14:23] (03PS42) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:14:50] JJMC89: will you want to test this? [21:15:12] 10ops-codfw, 06DC-Ops: restbase2037 periodically rebooting(?) - https://phabricator.wikimedia.org/T383820#10464462 (10Eevans) p:05Triage→03Medium [21:15:33] I don't have the debug extenstion to test [21:15:34] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:16:00] (03PS43) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:16:22] https://login.wikimedia.org/wiki/Special:ListGroupRights should reflect - I can check after a full sync [21:16:23] !log roll restarting eventgate-analytics to pick up new stream configuration for haproxy_requestctl [21:16:23] - T383392 [21:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:25] T383392: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392 [21:16:41] 10ops-codfw, 10Cassandra, 06DC-Ops: restbase2037 periodically rebooting(?) - https://phabricator.wikimedia.org/T383820#10464469 (10Eevans) [21:16:42] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: sync [21:16:56] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: sync [21:17:03] 10ops-codfw, 10Cassandra, 06DC-Ops: restbase2037 is crashy - https://phabricator.wikimedia.org/T383820#10464471 (10Eevans) [21:17:20] ack, will just sync it then [21:17:39] !log Removing 11 files for legal compliance [21:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:48] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: sync [21:17:53] !log samtar@deploy2002 samtar, jjmc89: Backport for [[gerrit:1111706|do not allow temp users to edit on loginwiki (T383747)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:18:00] !log samtar@deploy2002 samtar, jjmc89: Continuing with sync [21:18:14] 06SRE, 06Traffic: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10464483 (10Ottomata) [21:18:18] 06SRE, 06Traffic: Define a schema for analytics pipeline ingestion - https://phabricator.wikimedia.org/T383392#10464485 (10Ottomata) [21:18:32] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: sync [21:19:22] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: sync [21:19:32] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: sync [21:22:17] (03PS44) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:22:41] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1111706|do not allow temp users to edit on loginwiki (T383747)]] (duration: 09m 34s) [21:22:50] JJMC89: sync'd [21:23:26] TheresNoTime: looks good - thank you [21:23:29] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:23:32] np! [21:25:52] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:26:41] (03PS45) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:27:52] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:28:53] (03PS46) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:31:46] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T383638#10464550 (10phaultfinder) [21:34:22] 06SRE, 06Infrastructure-Foundations: Console domain and property access request - https://phabricator.wikimedia.org/T381904#10464574 (10Scott_French) Alright, after discussion yesterday with @NBaca-WMF and @LSobanski, I believe the next steps in order to facilitate this involve building a couple of lists of do... [21:34:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1093:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1093 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:35:03] (03PS47) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:36:16] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:36:57] (03PS48) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:38:10] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:38:40] (03PS49) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:39:40] RESOLVED: KubernetesRsyslogDown: rsyslog on wikikube-worker1093:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1093 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:39:52] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:40:54] !log Removing 10 files for legal compliance [21:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:11] (03PS50) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:47:23] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:53:42] (03PS51) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:54:53] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [21:55:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111260 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:56:15] (03PS52) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [21:56:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:57:29] (03CR) 10Clare Ming: [C:03+1] "scheduled for tomorrow's deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111260 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:57:56] (03CR) 10Clare Ming: [C:03+1] "scheduled for tomorrow's deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111261 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:58:25] (03CR) 10Clare Ming: [C:03+1] "will schedule for tomorrow's deployment window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111262 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [21:58:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111262 (https://phabricator.wikimedia.org/T381964) (owner: 10Phuedx) [22:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T2200) [22:00:09] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:00:45] (03PS53) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:01:59] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:02:30] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2025-01-08-142250 to 2025-01-15-052609 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111626 (https://phabricator.wikimedia.org/T378785) [22:02:34] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2025-01-08-142250 to 2025-01-15-052609 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111626 (https://phabricator.wikimedia.org/T378785) (owner: 10Jforrester) [22:02:46] (03PS54) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:03:51] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2025-01-08-142250 to 2025-01-15-052609 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1111626 (https://phabricator.wikimedia.org/T378785) (owner: 10Jforrester) [22:04:33] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:05:21] (03PS55) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:06:35] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:06:51] (03PS56) 10CDobbins: alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 [22:08:07] (03CR) 10CI reject: [V:04-1] alerts: add alert for ferm_mss_cfg Prometheus metric [alerts] - 10https://gerrit.wikimedia.org/r/1110843 (owner: 10CDobbins) [22:10:37] !log dmartin@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:12:17] !log dmartin@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:14:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T383383#10464689 (10phaultfinder) [22:15:58] !log dmartin@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:17:04] !log dmartin@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:17:45] !log dmartin@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:18:40] !log dmartin@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:24:53] (03PS1) 10Jdlrobson: WebUIClick: Increase sampling rate to 100% for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) [22:25:27] (03CR) 10C. Scott Ananian: [C:03+1] Turn on Parsoid Read Views on test2wiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [22:25:36] (03PS2) 10Jdlrobson: Web UI actions: Increase sampling rate to 100% for beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) [22:26:14] PROBLEM - Check if active EventStreams endpoint is delivering messages. on alert1002 is CRITICAL: CRITICAL: No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [22:26:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1280.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1322.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1306.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, mw1442.eqiad.wmnet, wikikube-worker1281.eqiad.wmnet, wikikube [22:26:16] 036.eqiad.wmnet, wikikube-worker1029.eqiad.wmnet, wikikube-worker1049.eqiad.wmnet, mw1462.eqiad.wmnet, mw1480.eqiad.wmnet, parse1009.eqiad.wmnet, mw1484.eqiad.wmnet, wikikube-worker1016.eqiad.wmnet, wikikube-worker1273.eqiad.wmnet, wikikube-worker1071.eqiad.wmnet, wikikube-worker1279.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, wikikube-worker1307.eqiad.wmnet, mw1488.eqiad.wmnet, wikikube-worker1244.eqiad [22:26:16] wikikube-worker1037.eqiad.wmnet, wikikube-worker1058.eqiad.wmnet, wikikube-worker1261.eqiad.wmnet, mw1466.eqiad.wmnet, mw1483.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube-work https://wikitech.wikimedia.org/wiki/PyBal [22:26:50] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventstreams_4892: Servers wikikube-worker1051.eqiad.wmnet, wikikube-worker1291.eqiad.wmnet, wikikube-worker1042.eqiad.wmnet, parse1013.eqiad.wmnet, wikikube-worker1268.eqiad.wmnet, wikikube-worker1304.eqiad.wmnet, wikikube-worker1259.eqiad.wmnet, wikikube-worker1101.eqiad.wmnet, mw1442.eqiad.wmnet, wikikube-worker1050.eqiad.wmnet, wikikube-worker103 [22:26:50] wmnet, wikikube-worker1029.eqiad.wmnet, mw1470.eqiad.wmnet, wikikube-worker1079.eqiad.wmnet, mw1484.eqiad.wmnet, wikikube-worker1260.eqiad.wmnet, wikikube-worker1282.eqiad.wmnet, wikikube-worker1263.eqiad.wmnet, mw1467.eqiad.wmnet, mw1488.eqiad.wmnet, parse1010.eqiad.wmnet, wikikube-worker1287.eqiad.wmnet, wikikube-worker1270.eqiad.wmnet, wikikube-worker1244.eqiad.wmnet, wikikube-worker1278.eqiad.wmnet, wikikube-worker1106.eqiad.wmnet, wi [22:26:50] orker1289.eqiad.wmnet, mw1465.eqiad.wmnet, mw1483.eqiad.wmnet, mw1486.eqiad.wmnet, wikikube-worker1309.eqiad.wmnet, wikikube-worker1062.eqiad.wmnet, wikikube-worker1272.eqiad.wmnet, wik https://wikitech.wikimedia.org/wiki/PyBal [22:27:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111715 (https://phabricator.wikimedia.org/T382080) (owner: 10Jdlrobson) [22:27:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1108135 (https://phabricator.wikimedia.org/T332728) (owner: 10Kimberly Sarabia) [22:28:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, January 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1107964 (https://phabricator.wikimedia.org/T376446) (owner: 10Jdlrobson) [22:29:37] (03CR) 10Subramanya Sastry: "I'll schedule this for backport tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1111325 (https://phabricator.wikimedia.org/T378645) (owner: 10Subramanya Sastry) [22:30:55] (03PS1) 10Eevans: restbase: new hosts (refresh) restbase104[3-5] [puppet] - 10https://gerrit.wikimedia.org/r/1111717 (https://phabricator.wikimedia.org/T383673) [22:31:02] (03CR) 10Ryan Kemper: [C:03+2] whitelist kg.kunsten.be on wikidata query service [puppet] - 10https://gerrit.wikimedia.org/r/1111607 (https://phabricator.wikimedia.org/T380984) (owner: 10Stevemunene) [22:31:12] (03CR) 10Ryan Kemper: [C:03+2] Add linkeddata.cultureelerfgoed.nl to SPARQL allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1105882 (https://phabricator.wikimedia.org/T381717) (owner: 10Stevemunene) [22:31:15] (03CR) 10Ryan Kemper: [C:03+2] Add FactGrid to WDQS allowlist [puppet] - 10https://gerrit.wikimedia.org/r/1111605 (https://phabricator.wikimedia.org/T381649) (owner: 10Stevemunene) [22:31:17] (03CR) 10Ryan Kemper: [C:03+2] Add api.finto.fi/sparql to Wikidata query service and WCQS whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1111606 (https://phabricator.wikimedia.org/T378561) (owner: 10Stevemunene) [22:34:29] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, and 2 others: Q3:rack/setup/install restbase104[345] - https://phabricator.wikimedia.org/T383673#10464754 (10Eevans) a:05Eevans→03None [22:35:51] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [22:35:52] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.restart (exit_code=99) [22:36:04] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [22:37:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:42] FIRING: JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:40:42] (03PS1) 10Ryan Kemper: wdqs: fix kg.kunsten.be URL [puppet] - 10https://gerrit.wikimedia.org/r/1111718 (https://phabricator.wikimedia.org/T380984) [22:41:17] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.wdqs.restart (exit_code=97) [22:42:10] (03CR) 10Bking: [C:03+1] wdqs: fix kg.kunsten.be URL [puppet] - 10https://gerrit.wikimedia.org/r/1111718 (https://phabricator.wikimedia.org/T380984) (owner: 10Ryan Kemper) [22:42:13] (03CR) 10Ryan Kemper: [C:03+2] wdqs: fix kg.kunsten.be URL [puppet] - 10https://gerrit.wikimedia.org/r/1111718 (https://phabricator.wikimedia.org/T380984) (owner: 10Ryan Kemper) [22:43:23] FIRING: [3x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:43:26] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:43:29] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:43:36] FIRING: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [22:43:41] FIRING: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [22:43:44] FIRING: [4x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:43:50] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:44:16] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:45:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [22:45:42] FIRING: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:47:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:23] RESOLVED: [7x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:48:32] RESOLVED: [3x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:48:36] RESOLVED: NELNotReported: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [22:48:41] RESOLVED: NELByCountryNotReported: NEL metrics by country not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryNotReported [22:49:40] !log restarted thanos-query-fronted on titan100[12] [22:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:53:26] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:56:14] RECOVERY - Check if active EventStreams endpoint is delivering messages. on alert1002 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250115T2300) [23:10:29] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:27:30] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [23:27:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T370903)', diff saved to https://phabricator.wikimedia.org/P72087 and previous config saved to /var/cache/conftool/dbconfig/20250115-232737-ladsgroup.json [23:27:41] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:36:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T370903)', diff saved to https://phabricator.wikimedia.org/P72088 and previous config saved to /var/cache/conftool/dbconfig/20250115-233617-ladsgroup.json [23:36:21] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:38:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.restart (exit_code=0) [23:51:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P72089 and previous config saved to /var/cache/conftool/dbconfig/20250115-235123-ladsgroup.json [23:52:15] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.restart [23:55:54] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Idle https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status