[00:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962226 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962226 (owner: 10TrainBranchBot) [00:42:14] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:42] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:08] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:13] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/962226 (owner: 10TrainBranchBot) [01:10:24] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:11:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:10] 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10serviceops-radar, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10Vgutierrez) the idea of creatin... [01:42:48] 10SRE, 10Traffic: Varnish should allow PURGE requests only from a unix domain socket - https://phabricator.wikimedia.org/T347192 (10Vgutierrez) p:05Triage→03Medium [02:29:04] PROBLEM - Check systemd state on dumpsdata1006 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:47] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:10] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:56] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:08:46] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:10:12] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:42:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:42:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:44:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:45:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:47:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:47:10] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:47:30] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:45] (03PS1) 10Ilias Sarantopoulos: ml-services: update damaging staging to kserve 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962247 (https://phabricator.wikimedia.org/T346446) [05:00:35] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update damaging staging to kserve 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962247 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:01:36] (03Merged) 10jenkins-bot: ml-services: update damaging staging to kserve 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962247 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [05:10:53] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:11:27] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [05:40:34] (03PS4) 10Stevemunene: druid: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) [05:40:36] (03PS1) 10Stevemunene: druid: Bring druid1010.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962248 (https://phabricator.wikimedia.org/T336042) [05:40:38] (03PS1) 10Stevemunene: druid: Bring druid1011.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/962249 (https://phabricator.wikimedia.org/T336042) [05:40:40] (03PS1) 10Stevemunene: druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) [05:42:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:53:55] (03CR) 10Ayounsi: dnsbox: add ntp.anycast.wmnet as the anycasted NTP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [07:00:05] Amir1, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T0700). [07:00:06] No Gerrit patches in the queue for this window AFAICS. [07:03:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:04:10] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:04:33] o/ indeed nothing to do [07:08:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:18:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 42.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:23:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 42.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:26:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 43.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:28:21] (03CR) 10Filippo Giunchedi: [C: 03+2] maps: remove per-host healthchck [puppet] - 10https://gerrit.wikimedia.org/r/961062 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [07:28:36] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [07:30:02] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [07:30:17] (03CR) 10Elukey: [C: 03+1] ml-services: update revertrisk-language-agnostic model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [07:30:33] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: enable wal on grafana sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/961510 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite) [07:30:50] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1006 - taavi@cumin1001" [07:31:39] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1006 - taavi@cumin1001" [07:31:39] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:31:42] (03CR) 10Elukey: "Quick question to double check - does the new model need https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/96" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [07:32:01] !log taavi@cumin1001 START - Cookbook sre.dns.netbox [07:33:07] (03CR) 10Filippo Giunchedi: "LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/961129 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [07:33:43] (03CR) 10Filippo Giunchedi: [C: 03+1] pyrra add service dns entries [dns] - 10https://gerrit.wikimedia.org/r/961132 (https://phabricator.wikimedia.org/T302995) (owner: 10Herron) [07:35:35] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1006 - taavi@cumin1001" [07:36:02] (03CR) 10Filippo Giunchedi: [C: 03+2] "Very nice! Thank you" [software] - 10https://gerrit.wikimedia.org/r/959366 (https://phabricator.wikimedia.org/T345190) (owner: 10Krinkle) [07:36:24] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign new IPs to cloudcontrol1006 - taavi@cumin1001" [07:36:24] !log taavi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:37:08] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [07:37:15] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [07:38:27] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) [07:39:10] (03PS1) 10Majavah: site: Put cloudcontrol1006 back into service [puppet] - 10https://gerrit.wikimedia.org/r/962355 (https://phabricator.wikimedia.org/T346891) [07:43:05] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/962355 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [07:45:23] !log taavi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudcontrol1006 - taavi@cumin1001" [07:46:25] !log taavi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudcontrol1006 - taavi@cumin1001" [07:47:16] !log taavi@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS bullseye [07:47:28] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by taavi@cumin1001 for host cloudcontrol1006.eqi... [07:47:35] (03CR) 10Majavah: [C: 03+2] site: Put cloudcontrol1006 back into service [puppet] - 10https://gerrit.wikimedia.org/r/962355 (https://phabricator.wikimedia.org/T346891) (owner: 10Majavah) [07:48:59] (03PS4) 10Volans: locking: add new module for distributed locking [software/spicerack] - 10https://gerrit.wikimedia.org/r/938822 (https://phabricator.wikimedia.org/T341973) [07:49:01] (03PS4) 10Volans: cookbook: add --no-locks CLI argument [software/spicerack] - 10https://gerrit.wikimedia.org/r/938823 (https://phabricator.wikimedia.org/T341973) [07:49:03] (03PS1) 10Volans: tests: simplify _cookbook tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/962357 [07:49:27] !log +150G to prometheus@k8s in codfw [07:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 49.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:55:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:56:09] (03CR) 10Volans: "Ready for review. It will still be a NOOP until we change the config file via puppet to set the etcd_config path to the etcd config. I was" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [08:00:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 43.98% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:00:27] (03CR) 10AikoChou: [C: 03+1] SLOs: Add SLO for Liftwing Readability isvc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [08:01:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 43.52% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:04:36] (03PS2) 10AikoChou: ml-services: update revertrisk-language-agnostic model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) [08:06:59] (03CR) 10Elukey: [C: 03+1] "Looks good! This deployment is the first one with clients already requesting data from Revert Risk, so we'll need to be extra careful when" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [08:07:09] (03CR) 10AikoChou: ml-services: update revertrisk-language-agnostic model binary (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [08:10:30] (03CR) 10AikoChou: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [08:11:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 46.76% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:11:29] (03Merged) 10jenkins-bot: ml-services: update revertrisk-language-agnostic model binary [deployment-charts] - 10https://gerrit.wikimedia.org/r/962049 (https://phabricator.wikimedia.org/T347330) (owner: 10AikoChou) [08:17:59] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:18:12] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:20:54] 10SRE, 10LDAP-Access-Requests: Grant Access to to ldap/wmf for AKhatun - https://phabricator.wikimedia.org/T347546 (10Jelto) 05Open→03Resolved p:05Triage→03Medium a:03Jelto `AKhatun` was added to wmf ldap group. I'm closing this task. Feel free to re-open if you have any problems. [08:21:23] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudcontrol1006 [08:21:46] !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcontrol1006 [08:24:58] (03PS1) 10Ayounsi: Add "Auto-Submitted: auto-generated" headers to I/F scripts [puppet] - 10https://gerrit.wikimedia.org/r/962358 [08:24:59] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [08:28:59] (03PS2) 10Ayounsi: Add "Auto-Submitted: auto-generated" headers to I/F scripts [puppet] - 10https://gerrit.wikimedia.org/r/962358 [08:31:21] !log taavi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage [08:34:15] (03PS3) 10Ayounsi: Add "Auto-Submitted: auto-generated" headers to I/F scripts [puppet] - 10https://gerrit.wikimedia.org/r/962358 (https://phabricator.wikimedia.org/T347835) [08:34:31] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage [08:38:45] (03CR) 10Brouberol: wdqs.data_transfer: refactor spicerack class api (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [08:43:26] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) I'm going to start draining nodes from `D5`: cloudcephosd1011 cloudcephosd1012 cloudcephosd1013 cloudcep... [08:46:52] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) >>! In T316544#9214514, @dcaro wrote: > I'm going to start draining nodes from `D5`: @dcaro that's gre... [08:49:57] (03PS1) 10Ayounsi: Add Auto-Submitted: auto-generated to I/F scripts using mail [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) [08:50:35] (03CR) 10CI reject: [V: 04-1] Add Auto-Submitted: auto-generated to I/F scripts using mail [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [08:53:04] (03CR) 10Klausman: [C: 03+2] SLOs: Add SLO for Liftwing Readability isvc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [08:53:41] (03CR) 10Klausman: [V: 03+2 C: 03+2] SLOs: Add SLO for Liftwing Readability isvc [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/961701 (https://phabricator.wikimedia.org/T334182) (owner: 10Klausman) [08:53:59] (03CR) 10MVernon: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/962358 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [08:57:00] (03CR) 10MVernon: Add Auto-Submitted: auto-generated to I/F scripts using mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [08:57:35] (03CR) 10Volans: "In general LGTM, minor nits inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [09:01:07] (03PS2) 10Ayounsi: Add Auto-Submitted: auto-generated to I/F scripts using mail [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) [09:01:37] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10Language-Team ( Language-2023-October-December): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491 (10Pginer-WMF) [09:01:48] (03CR) 10Ayounsi: Add Auto-Submitted: auto-generated to I/F scripts using mail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [09:06:08] (03CR) 10MVernon: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [09:06:24] !log taavi@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1006.eqiad.wmnet with OS bullseye [09:06:27] (03CR) 10Ayounsi: [C: 03+1] Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) (owner: 10Cathal Mooney) [09:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:11:13] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by taavi@cumin1001 for host cloudcontrol1006.eqiad.wmnet with OS bullseye... [09:11:42] 10SRE, 10Cloud-VPS, 10User-aborrero: Certain systems failing to resolve DNS entries under toolforge.org, wmcloud.org, wmflabs.org, toolserver.org - https://phabricator.wikimedia.org/T346177 (10cmooney) Posting the below recently published RFC as it provides a little more clarity, https://www.rfc-editor.org... [09:13:11] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10dcaro) [09:16:01] (03CR) 10Ayounsi: "Cool, it makes sens to me!" [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [09:16:08] (03CR) 10Ayounsi: [C: 03+1] Add automation to define ESI-LAGs on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [09:18:49] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) In terms of the other nodes in that rack we have the following cloudvirts, and should consider possibly... [09:20:59] 10SRE, 10Traffic: Varnish should allow PURGE requests only from a unix domain socket - https://phabricator.wikimedia.org/T347192 (10Fabfur) 05Open→03Resolved Mark this as Resolved, following work on the `purged` side will have dedicated tasks. [09:25:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:28:18] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) We met with Google to discuss this further. Google will provide more details on this soon, but the crux of the mat... [09:28:30] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [09:28:38] (03PS1) 10Hnowlan: svg: default to "en" when a language is not specified [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/962563 (https://phabricator.wikimedia.org/T337139) [09:30:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [09:33:50] (03CR) 10Stevemunene: [C: 03+2] druid: Bring druid1009.eqiad.wmnet into service [puppet] - 10https://gerrit.wikimedia.org/r/959147 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [09:35:41] 10SRE, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10hnowlan) 05Open→03Resolved [09:36:47] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10hnowlan) [09:36:52] 10SRE, 10Release Pipeline, 10serviceops, 10Epic, 10Release-Engineering-Team (Seen): Migrate production services to kubernetes using the pipeline - https://phabricator.wikimedia.org/T198901 (10hnowlan) [09:38:34] (03CR) 10Cathal Mooney: [C: 03+2] Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) (owner: 10Cathal Mooney) [09:39:42] (03Merged) 10jenkins-bot: Adjust EVPN BGP type-5 route creation / export to include host routes [homer/public] - 10https://gerrit.wikimedia.org/r/888219 (https://phabricator.wikimedia.org/T329369) (owner: 10Cathal Mooney) [09:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [09:41:07] (03CR) 10Cathal Mooney: [C: 03+2] Add automation to define ESI-LAGs on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [09:41:39] (03Merged) 10jenkins-bot: Add automation to define ESI-LAGs on EVPN switches [homer/public] - 10https://gerrit.wikimedia.org/r/961927 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [09:43:38] (03CR) 10Hnowlan: modules: add base.statsd (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/959803 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [09:44:36] (03CR) 10Hnowlan: "Adding CCs to clarify if this is a sensible default to make across all wikis" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/962563 (https://phabricator.wikimedia.org/T337139) (owner: 10Hnowlan) [09:47:31] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add codfw new switches - cmooney@cumin1001" [09:48:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add codfw new switches - cmooney@cumin1001" [09:56:16] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Vgutierrez) p:05Triage→03Medium [10:00:06] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T1000) [10:00:26] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudcontrol1006: move to new network setup - https://phabricator.wikimedia.org/T346891 (10taavi) 05Open→03Resolved [10:00:37] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [10:01:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/962358 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [10:03:47] 10SRE, 10Infrastructure-Foundations, 10netops: Plan codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney) [10:04:48] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Export routes generated from ARP/ND in EVPN - https://phabricator.wikimedia.org/T329369 (10cmooney) 05Open→03Resolved Change merged and pushed out to live devices. No change to announced routes on existing devices, e.g. type 5 routes a... [10:05:03] 10SRE, 10Traffic: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10Fabfur) [10:05:22] 10SRE, 10Traffic: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10Fabfur) p:05Triage→03Low [10:06:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [10:10:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/962357 (owner: 10Volans) [10:12:24] (03CR) 10Volans: [C: 03+2] tests: simplify _cookbook tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/962357 (owner: 10Volans) [10:13:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938823 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:14:43] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/938824 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:16:51] (03Merged) 10jenkins-bot: tests: simplify _cookbook tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/962357 (owner: 10Volans) [10:20:47] (03Abandoned) 10Hnowlan: conftool: clean up references to obsolete restbase service [puppet] - 10https://gerrit.wikimedia.org/r/747098 (https://phabricator.wikimedia.org/T244843) (owner: 10Hnowlan) [10:23:45] (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade kserve in prod to 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962565 (https://phabricator.wikimedia.org/T346446) [10:27:07] (03CR) 10Elukey: [C: 03+1] ml-services: upgrade kserve in prod to 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962565 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [10:29:36] (03PS10) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) [10:29:58] (03CR) 10Arturo Borrero Gonzalez: cloudgw: refactor to set up routes independently from keepalived (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:30:42] (03PS8) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [10:30:50] (03CR) 10Jbond: "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [10:32:44] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, and 2 others: cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) [10:33:09] (03CR) 10Jbond: [V: 03+1 C: 03+2] rsyslog: update code to support cfssl and puppet [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [10:34:18] !log depool cp4040 to test new purged version (T347837) [10:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:21] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [10:34:27] (03CR) 10CI reject: [V: 04-1] puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [10:35:35] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] rsyslog: update code to support cfssl and puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956481 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [10:37:20] (03CR) 10Cathal Mooney: cloudgw: refactor to set up routes independently from keepalived (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/922104 (https://phabricator.wikimedia.org/T347687) (owner: 10Arturo Borrero Gonzalez) [10:37:34] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: upgrade kserve in prod to 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962565 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [10:38:53] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:53] (03Merged) 10jenkins-bot: ml-services: upgrade kserve in prod to 0.11.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962565 (https://phabricator.wikimedia.org/T346446) (owner: 10Ilias Sarantopoulos) [10:40:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10aborrero) 05Open→03Resolved [10:40:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10aborrero) [10:42:13] (03CR) 10Effie Mouzeli: [C: 03+2] push-notifications: Make queueing non-verbose by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/962053 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [10:43:00] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:43:07] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [10:43:14] (03Merged) 10jenkins-bot: push-notifications: Make queueing non-verbose by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/962053 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [10:43:24] (03CR) 10Clément Goubert: [C: 03+1] k8s, cassandra: add entries for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/961774 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [10:47:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10cmooney) > I think they should be converted all to be /32 both on Netbox and on the instances. This will also let the automation know that they are proper VIPs and will p... [10:49:18] !log swap purged on cp4040 to use UDS instead of TCP for Varnish (T347837) [10:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:22] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [10:50:23] (03CR) 10Hnowlan: [C: 03+2] k8s, cassandra: add entries for {edit,editor,page}-analytics [puppet] - 10https://gerrit.wikimedia.org/r/961774 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [10:50:42] (03PS1) 10Hnowlan: service: add {edit,editor,page}-analytics services [puppet] - 10https://gerrit.wikimedia.org/r/962570 (https://phabricator.wikimedia.org/T336391) [10:54:52] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:55:08] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [10:58:22] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [10:58:35] (03CR) 10FNegri: "@Volans are you ok with merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [10:58:37] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [10:59:01] (03CR) 10Brouberol: [C: 03+2] druid: update to use puppetdb_query instead of query_classes [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [10:59:36] (03CR) 10Brouberol: [C: 03+2] "Approved, given that the PCC jobs returns with a NOOP. I'm not familiar enough with puppetDB queries or its syntax to go further than that" [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:00:29] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:00:47] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:07] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:01:21] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:33] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:02:30] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) Tests on cp4040.ulsfo.wmnet shows that the new `purged` version connecting to the proper Varnish socket correctly processes the PURGE requests. An example snippet from `varnishlog`: ` * << Re... [11:04:10] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:19] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:10:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:15:12] btullis: WRT to the kafka broker decommission: we've already evacuated more topics than I anticipated! 275/1052 (26%) [11:15:30] sry. wrong chan [11:17:27] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops: Netbox - PuppetDB audit 2021-11 - https://phabricator.wikimedia.org/T295762 (10cmooney) [11:17:56] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10cmooney) 05Resolved→03Open I'm gonna re-open this for now, as it looks like the issue isn't fully solved. On the cloudnet side of this particular link the VIP is sti... [11:19:03] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:19:17] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:20] 10SRE, 10ops-eqiad, 10Goal, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloud @ eqiad: hardware re-racking plan - https://phabricator.wikimedia.org/T341494 (10taavi) [11:19:29] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:21:21] (03CR) 10Jbond: [V: 03+1 C: 03+2] druid: update to use puppetdb_query instead of query_classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961841 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:22:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [11:23:18] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:26:10] (03CR) 10Jbond: [C: 03+2] scap::dsh::group: switch from query_nodes to puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/961850 (https://phabricator.wikimedia.org/T341373) (owner: 10Jbond) [11:26:58] (03PS2) 10Hnowlan: admin: add namespaces for remaining aqs2 services, add config for page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/961782 (https://phabricator.wikimedia.org/T336391) [11:27:33] (03CR) 10Clément Goubert: [C: 03+1] admin: add namespaces for remaining aqs2 services, add config for page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/961782 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:28:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:32:20] (03CR) 10Hnowlan: [C: 03+2] admin: add namespaces for remaining aqs2 services, add config for page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/961782 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:33:25] (03CR) 10Clément Goubert: [C: 03+1] Add druid-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961786 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [11:33:52] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10Volans) @cmooney this change would affect a lot of VIPs assigned by puppet all over production so we must check carefully the consequences of any changes. That said I'm h... [11:34:42] (03Merged) 10jenkins-bot: admin: add namespaces for remaining aqs2 services, add config for page-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/961782 (https://phabricator.wikimedia.org/T336391) (owner: 10Hnowlan) [11:35:23] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [11:35:30] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [11:38:27] !log hnowlan@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:40:03] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:40:06] !log hnowlan@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:40:19] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:31] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:42:16] !log hnowlan@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:42:53] !log hnowlan@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:44:38] (03PS1) 10Hnowlan: media-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962588 (https://phabricator.wikimedia.org/T336380) [11:45:08] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:46:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:47:01] (03PS1) 10Jgiannelos: Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 [11:47:05] (03CR) 10Hnowlan: [C: 03+2] media-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962588 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [11:47:20] (03CR) 10CI reject: [V: 04-1] Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [11:47:22] (03PS1) 10Dreamy Jazz: clienthints: Enable display on testwikis and four production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962591 (https://phabricator.wikimedia.org/T341110) [11:47:35] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org on all recursors [11:47:39] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org on all recursors [11:47:47] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:50] (03Merged) 10jenkins-bot: media-analytics: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962588 (https://phabricator.wikimedia.org/T336380) (owner: 10Hnowlan) [11:47:59] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:49:01] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [11:49:11] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:49:15] (03CR) 10Kosta Harlan: [C: 03+1] clienthints: Enable display on testwikis and four production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962591 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [11:49:41] (03CR) 10Ayounsi: [C: 03+2] Add "Auto-Submitted: auto-generated" headers to I/F scripts [puppet] - 10https://gerrit.wikimedia.org/r/962358 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [11:49:50] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:51:03] (03CR) 10Ayounsi: [C: 03+2] Add Auto-Submitted: auto-generated to I/F scripts using mail [puppet] - 10https://gerrit.wikimedia.org/r/962361 (https://phabricator.wikimedia.org/T347835) (owner: 10Ayounsi) [11:51:34] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/media-analytics: apply [11:51:52] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [11:52:02] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [11:52:40] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [11:53:13] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [11:53:40] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [11:56:28] (03PS9) 10Jbond: puppet: Add new PuppetServer class [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 [11:56:54] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [11:57:10] (03CR) 10Jbond: "ready for another pass" [software/spicerack] - 10https://gerrit.wikimedia.org/r/954739 (owner: 10Jbond) [11:57:20] (03CR) 10FNegri: cluster::cloud_management allow access to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [11:57:58] (03PS5) 10Jbond: docker::registry::web: remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/961814 (https://phabricator.wikimedia.org/T340743) [11:58:09] (03CR) 10FNegri: cluster::cloud_management allow access to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [11:58:41] (03PS2) 10Jgiannelos: Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 [11:58:59] (03CR) 10CI reject: [V: 04-1] Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [12:00:24] (03CR) 10Ssingh: [V: 03+1] dnsbox: add ntp.anycast.wmnet as the anycasted NTP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [12:00:31] (03PS2) 10Ssingh: dnsbox: add ntp.anycast.wmnet as the anycasted NTP address [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) [12:01:29] (03CR) 10Ssingh: dnsbox: add ntp.anycast.wmnet as the anycasted NTP address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [12:03:16] (03CR) 10Jgiannelos: "recheck" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [12:03:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:04:13] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [12:04:31] (03CR) 10Jbond: [C: 03+2] postgress: update to use /etc/ssl/certs/wmf-ca-certificates.crt CA [puppet] - 10https://gerrit.wikimedia.org/r/961839 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond) [12:09:34] (HelmReleaseBadStatus) firing: Helm release page-analytics/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=page-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:09:43] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [12:10:11] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:23] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:10:56] (03PS3) 10Jgiannelos: Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 [12:11:27] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:12:05] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [12:12:22] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [12:12:33] (03CR) 10Ayounsi: [C: 03+1] "Thanks, sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [12:14:34] (HelmReleaseBadStatus) resolved: Helm release page-analytics/main on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=page-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:16:35] (03CR) 10Jgiannelos: "I fetched the upstream tegola v0.19 branch and added some config so we can tag the docker image with the tegola version." [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [12:17:43] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:17:53] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:18:13] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:18:57] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:22:30] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [12:25:00] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack codfw1dev - aborrero@cumin1001" [12:29:08] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: openstack codfw1dev - aborrero@cumin1001" [12:29:08] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:31:13] (03CR) 10FNegri: "The latest patchset is not touching wikireplicas, so I think this patch can be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [12:31:15] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org on all recursors [12:31:20] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org on all recursors [12:34:57] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [12:35:04] (03CR) 10Jbond: [C: 03+2] wmcs: add wmcs-roots to roles where it is missing [puppet] - 10https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [12:36:31] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:38:45] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:39:12] !log aborrero@cumin1001 START - Cookbook sre.dns.wipe-cache bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org on all recursors [12:39:16] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org on all recursors [12:39:53] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) [12:40:03] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:40:03] (03PS1) 10Cathal Mooney: Add include for WMCS codfw new public VIPs [dns] - 10https://gerrit.wikimedia.org/r/962597 (https://phabricator.wikimedia.org/T347858) [12:41:10] (03CR) 10Arturo Borrero Gonzalez: Add include for WMCS codfw new public VIPs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/962597 (https://phabricator.wikimedia.org/T347858) (owner: 10Cathal Mooney) [12:41:19] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:42:34] (03PS2) 10Cathal Mooney: Add include for WMCS codfw new public VIPs [dns] - 10https://gerrit.wikimedia.org/r/962597 (https://phabricator.wikimedia.org/T347858) [12:42:54] (03CR) 10Cathal Mooney: Add include for WMCS codfw new public VIPs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/962597 (https://phabricator.wikimedia.org/T347858) (owner: 10Cathal Mooney) [12:43:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add include for WMCS codfw new public VIPs [dns] - 10https://gerrit.wikimedia.org/r/962597 (https://phabricator.wikimedia.org/T347858) (owner: 10Cathal Mooney) [12:44:36] (03CR) 10FNegri: [C: 03+2] cluster::cloud_management allow access to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952448 (https://phabricator.wikimedia.org/T325067) (owner: 10FNegri) [12:47:10] (03CR) 10Cathal Mooney: [C: 03+2] Add include for WMCS codfw new public VIPs [dns] - 10https://gerrit.wikimedia.org/r/962597 (https://phabricator.wikimedia.org/T347858) (owner: 10Cathal Mooney) [12:48:47] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:49:03] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:13] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [12:51:20] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) [12:51:38] (03CR) 10Jelto: [C: 03+2] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962104 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [12:52:28] (03Merged) 10jenkins-bot: miscweb: update research-landing-page image tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/962104 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [12:57:44] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, and 2 others: cloudcumin: decide sudoers rules for users without global root - https://phabricator.wikimedia.org/T325067 (10fnegri) 05Open→03Resolved a:03fnegri The patch above has been merged and now all members of the `wmcs-roots` grou... [12:58:36] (03PS1) 10Jbond: O:puppetdb: Add permissions for replication user [puppet] - 10https://gerrit.wikimedia.org/r/962601 (https://phabricator.wikimedia.org/T346016) [12:59:26] (03CR) 10Jbond: [C: 03+2] O:puppetdb: Add permissions for replication user [puppet] - 10https://gerrit.wikimedia.org/r/962601 (https://phabricator.wikimedia.org/T346016) (owner: 10Jbond) [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T1300). [13:00:06] Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:12] \o [13:00:31] I'm finishing up something, can deploy afterwards unless no-one else is faster [13:01:03] !log jelto@cumin1001 START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1003.wikimedia.org to gitlab2002.wikimedia.org [13:02:07] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) p:05High→03Medium The post-event retro will be published later this week. Sorry for the delay.... [13:03:20] (03PS1) 10Jbond: puppetdb: correct replication database [puppet] - 10https://gerrit.wikimedia.org/r/962602 (https://phabricator.wikimedia.org/T346016) [13:03:34] (03CR) 10CI reject: [V: 04-1] puppetdb: correct replication database [puppet] - 10https://gerrit.wikimedia.org/r/962602 (https://phabricator.wikimedia.org/T346016) (owner: 10Jbond) [13:03:38] (03PS2) 10Jbond: puppetdb: correct replication database [puppet] - 10https://gerrit.wikimedia.org/r/962602 (https://phabricator.wikimedia.org/T346016) [13:04:12] (03CR) 10Jbond: [C: 03+2] puppetdb: correct replication database [puppet] - 10https://gerrit.wikimedia.org/r/962602 (https://phabricator.wikimedia.org/T346016) (owner: 10Jbond) [13:06:32] Dreamy_Jazz: hi, looking at your patch now [13:06:43] Thanks [13:07:35] 10Puppet, 10Patch-For-Review: pg replication lag UNKNOWN for puppetdb2003 - https://phabricator.wikimedia.org/T346016 (10jbond) 05In progress→03Resolved a:03jbond This has now been corrected [13:07:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962591 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:08:35] (03Merged) 10jenkins-bot: clienthints: Enable display on testwikis and four production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962591 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:08:47] (JobUnavailable) firing: (6) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:09:59] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service,druid-middlemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:09] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:10:26] I can test on testwikis but would need CU granted to be able to test. [13:10:29] kostajh: you have an undeployed patch (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962028) in mw-config. it's touching beta files only so I'm going to pull that myself, but please remember to do that in the future [13:11:00] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Platform Team Initiatives (Containerise Services): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [13:11:05] !log taavi@deploy2002 Started scap: Backport for [[gerrit:962591|clienthints: Enable display on testwikis and four production wikis (T341110)]] [13:11:13] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:11:16] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [13:11:27] I can also vouch for that patch btw. [13:11:33] As I'm working with kosta on that. [13:12:14] If you also want me to test that I can do so. [13:13:02] !log disable puppet on A:dns-rec to merge CR 961818 [13:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:09] the ReportIncident patch was already deployed to beta when it was merged, so I don't really care if it works or not at this moment :P [13:13:16] (03CR) 10Ssingh: [C: 03+2] dnsbox: add ntp.anycast.wmnet as the anycasted NTP address [puppet] - 10https://gerrit.wikimedia.org/r/961818 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [13:13:19] Okay. Thanks. [13:13:49] I had assumed your comment meant it wasn't applied to the beta wikis [13:14:04] but please do relay the message about pulling them down to the deployment server to prevent confusion [13:14:20] I have over slack. Thanks. [13:14:28] nah, the beta mediawiki update scripts just pull the latest commits from git [13:15:58] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:18:45] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:19:03] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review, 10Puppet (Puppet 7.0): convert uses of query_resources - https://phabricator.wikimedia.org/T341373 (10jbond) [13:19:03] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:19:13] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [13:19:23] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) 05Open→03Resolved All disruptive switchover-related work is finished and things are stable. The switchover went smoothly an... [13:19:42] Dreamy_Jazz: just realized `testwikis` doesn't seem to be a valid database list, so your patch is noop for testwiki [13:19:50] Hmm. [13:19:54] !log taavi@deploy2002 taavi and dreamyjazz: Backport for [[gerrit:962591|clienthints: Enable display on testwikis and four production wikis (T341110)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:19:56] I thought that was the way to specify all testwikis [13:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:05] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [13:20:06] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) 05Open→03Resolved [13:20:10] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10kamila) Thanks a lot @Trizek-WMF ! I updated the parent task with a summary of how it went, feel free to use i... [13:20:15] In that https://noc.wikimedia.org/conf/dblists/testwikis.dblist exists [13:20:17] there is a testwikis dblist.. [13:20:33] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [13:21:03] So does this now apply to testwiki? [13:21:19] https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/4520/console shows no changes on testwiki [13:21:38] Odd. [13:21:39] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppet JMX mappings - https://phabricator.wikimedia.org/T342253 (10jbond) 05Open→03In progress p:05Triage→03Medium [13:21:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond) [13:21:49] maybe it needs to be in the `DB_LISTS` constant in multiversion/MWMultiVersion.php? [13:22:27] Probably [13:22:50] can you send a patch? [13:23:04] Sure. Do you want a new patch for that or modify the current one? [13:23:19] new patch, since the current one was already merged [13:23:40] Okay. [13:25:40] (03PS1) 10Fabfur: purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) [13:26:05] (03CR) 10CI reject: [V: 04-1] purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:27:15] (03CR) 10Effie Mouzeli: [C: 03+1] Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [13:27:17] (03PS1) 10Dreamy Jazz: Add 'testwikis' DB list to MWMultiVersion::DB_LISTS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962612 (https://phabricator.wikimedia.org/T341110) [13:27:30] Created the above which adds testwikis to the DB list [13:27:55] !log taavi@deploy2002 Sync cancelled. [13:28:03] yeah, https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/4521/console looks much better [13:28:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962612 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:29:03] (03CR) 10Jforrester: "Why not test2wiki too?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962612 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:29:05] (03Merged) 10jenkins-bot: Add 'testwikis' DB list to MWMultiVersion::DB_LISTS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962612 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:29:17] !log taavi@deploy2002 Started scap: Backport for [[gerrit:962612|Add 'testwikis' DB list to MWMultiVersion::DB_LISTS (T341110)]] [13:29:20] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [13:29:37] (03PS2) 10Fabfur: purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) [13:29:46] (03CR) 10Dreamy Jazz: Add 'testwikis' DB list to MWMultiVersion::DB_LISTS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962612 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:30:02] (03CR) 10CI reject: [V: 04-1] purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:36] !log taavi@deploy2002 taavi and dreamyjazz: Backport for [[gerrit:962612|Add 'testwikis' DB list to MWMultiVersion::DB_LISTS (T341110)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:31:10] PROBLEM - Host 10.3.0.2 is DOWN: PING CRITICAL - Packet loss = 100% [13:31:16] (03PS3) 10Fabfur: purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) [13:31:30] ok, both should be available for testing now. and granted your staff account the rights to do so on testwiki [13:31:38] Thanks. [13:33:41] Test successful. [13:34:27] thx, syncing [13:34:29] !log taavi@deploy2002 taavi and dreamyjazz: Continuing with sync [13:35:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1227.eqiad.wmnet with OS bullseye [13:35:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1228.eqiad.wmnet with OS bullseye [13:35:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1227.eqiad.wmnet with OS bullseye [13:35:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1228.eqiad.wmnet with OS bullseye [13:35:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [13:35:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye [13:36:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage [13:36:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [13:36:25] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [13:36:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)... [13:36:42] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43801/console" [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:37:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [13:38:27] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [13:38:46] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [13:38:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1020.eqiad.wmnet with OS bullseye [13:38:52] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [13:39:00] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1020.eqiad.wmnet with OS bullseye [13:39:11] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [13:39:27] (03CR) 10Hnowlan: [C: 03+2] Add druid-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961786 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [13:39:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1227.eqiad.wmnet with reason: host reimage [13:39:46] !log stevemunene@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [13:39:58] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [13:40:25] (03Merged) 10jenkins-bot: Add druid-http-gateway chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961786 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [13:40:29] (03PS1) 10Dreamy Jazz: Add test2wiki to the testwikis dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962613 [13:40:32] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:962612|Add 'testwikis' DB list to MWMultiVersion::DB_LISTS (T341110)]] (duration: 11m 15s) [13:40:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:37] T341110: Deploy client hints functionality - https://phabricator.wikimedia.org/T341110 [13:40:57] (03CR) 10Dreamy Jazz: Add 'testwikis' DB list to MWMultiVersion::DB_LISTS (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962612 (https://phabricator.wikimedia.org/T341110) (owner: 10Dreamy Jazz) [13:41:04] Thanks for the deploy! [13:41:23] yw [13:41:36] (03PS2) 10Dreamy Jazz: Add test2wiki to testwikis.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962613 [13:41:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1228.eqiad.wmnet with reason: host reimage [13:42:20] (03PS1) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) [13:42:24] (03CR) 10Jforrester: "Follows up I1c94f1af5f0788ccea1fac0e7053ba0758837cec where it was missing even then, indeed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962613 (owner: 10Dreamy Jazz) [13:42:43] (03CR) 10Jgiannelos: [C: 03+2] Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [13:42:54] (03CR) 10CI reject: [V: 04-1] Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) (owner: 10Cathal Mooney) [13:43:32] (03Merged) 10jenkins-bot: Prefix docker image tag with branch [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962590 (owner: 10Jgiannelos) [13:43:34] (03PS1) 10Ssingh: dnsbox: update healthcheck for ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/962615 [13:43:59] (03CR) 10Dreamy Jazz: Add test2wiki to testwikis.dblist (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962613 (owner: 10Dreamy Jazz) [13:44:46] (03PS4) 10Fabfur: purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) [13:45:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST hostendpoints) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:47:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [13:48:12] (03CR) 10Majavah: [C: 04-1] "`testwikis` is used by scap for the automatic test wiki deployment on Tuesday mornings. I believe it's intentional that test2wiki is not i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962613 (owner: 10Dreamy Jazz) [13:48:37] (03PS1) 10Jbond: prometheus_reporter: Add new reporter for providing prometheus metricts [puppet] - 10https://gerrit.wikimedia.org/r/962617 (https://phabricator.wikimedia.org/T342125) [13:48:39] (03PS1) 10Jbond: augeas_core: update augeas_core [puppet] - 10https://gerrit.wikimedia.org/r/962618 [13:49:02] (03Abandoned) 10Dreamy Jazz: Add test2wiki to testwikis.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962613 (owner: 10Dreamy Jazz) [13:49:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:51] (03PS5) 10Fabfur: purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) [13:51:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: WMCS VIPs: Netbox netmask inconsistencies - https://phabricator.wikimedia.org/T295774 (10cmooney) @Volans yep thanks. I created a provisional patch but I agree we need to consider all the cases. I believe from looking through the code... [13:52:10] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1020.eqiad.wmnet with reason: host reimage [13:53:30] (03CR) 10Ssingh: [C: 03+2] dnsbox: update healthcheck for ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/962615 (owner: 10Ssingh) [13:53:57] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:54:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1020.eqiad.wmnet with reason: host reimage [13:55:39] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:56:17] (03PS6) 10Fabfur: purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) [13:56:46] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:57:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:57:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1227.eqiad.wmnet with OS bullseye [13:57:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1227.eqiad.wmnet with OS bullseye completed: - db1227 (**PASS**) - Removed f... [13:58:01] (03PS1) 10Joal: Bump mw-page-content-change-enrich parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/962620 [13:58:06] ottomata: --^ [13:58:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [13:58:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1228.eqiad.wmnet with OS bullseye [13:58:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1228.eqiad.wmnet with OS bullseye completed: - db1228 (**PASS**) - Removed f... [13:58:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db1229.eqiad.wmnet with OS bullseye [13:58:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye [13:59:17] (03CR) 10Vgutierrez: [C: 03+1] "looking good :)" [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [13:59:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jhancock.wm) [14:00:57] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:14] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:01:24] (03PS1) 10FNegri: Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" [puppet] - 10https://gerrit.wikimedia.org/r/962207 [14:03:52] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:04:06] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Spicerack: add distributed locking support - https://phabricator.wikimedia.org/T341973 (10Volans) To ensure that the generated read/write traffic on the etcd cluster will be ok and not cause any issue I've made some tests using the... [14:04:42] (03PS1) 10Jclark-ctr: add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) [14:05:18] (03CR) 10Jclark-ctr: [C: 03+2] add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [14:05:37] 10SRE, 10ops-codfw, 10decommission-hardware: decommission frauth2001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340153 (10Papaul) 05Resolved→03Open We need to clean interfaces on the switch [14:06:24] (03PS2) 10Jclark-ctr: add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) [14:06:26] (03CR) 10Fabfur: purged: parametrize purged frontend and backend address (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:07:02] (03CR) 10Jclark-ctr: [C: 03+2] add an-master100(3,4) to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/962621 (https://phabricator.wikimedia.org/T342291) (owner: 10Jclark-ctr) [14:07:07] (03PS1) 10Dreamy Jazz: Define wgReportIncidentEmailFromAddress on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962623 (https://phabricator.wikimedia.org/T339275) [14:07:10] (03CR) 10Ottomata: Bump mw-page-content-change-enrich parallelism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962620 (owner: 10Joal) [14:08:04] (03PS1) 10Jgiannelos: ci: Fix branch variable name [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962624 [14:08:12] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43805/console" [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:08:57] (03CR) 10Jgiannelos: "For some reason `branch` was `null` so the previous tag was not correct." [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962624 (owner: 10Jgiannelos) [14:09:05] (03PS2) 10Joal: Bump mw-page-content-change-enrich parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/962620 [14:09:11] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [14:09:13] (03CR) 10Joal: Bump mw-page-content-change-enrich parallelism (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962620 (owner: 10Joal) [14:09:14] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [14:09:40] (03PS1) 10Daimona Eaytoy: beta: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962625 (https://phabricator.wikimedia.org/T339246) [14:14:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454 (10Jclark-ctr) a:03Jclark-ctr [14:14:38] (03CR) 10Jgiannelos: [C: 04-1] "It wont work because of the slashes on the branch name." [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962624 (owner: 10Jgiannelos) [14:15:52] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:17:25] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:17:50] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:18:49] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1020.eqiad.wmnet with OS bullseye [14:18:51] (03CR) 10Fabfur: [V: 03+1 C: 03+2] purged: parametrize purged frontend and backend address [puppet] - 10https://gerrit.wikimedia.org/r/962611 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [14:19:00] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1020.eqiad.wmnet with OS bullseye c... [14:19:08] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [14:19:44] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [14:20:54] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1021.eqiad.wmnet with OS bullseye [14:21:05] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1021.eqiad.wmnet with OS bullseye [14:22:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:23:44] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10Trizek-WMF) >>! In T345263#9215507, @kamila wrote: > The switchover went smoothly and had minimal user impact.... [14:23:49] !log importing into bullseye-wikimedia package purged_0.21+deb11u1_amd64 (T347837) [14:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:52] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [14:26:20] (03Abandoned) 10Jgiannelos: ci: Fix branch variable name [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962624 (owner: 10Jgiannelos) [14:27:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:08] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [14:28:12] (03PS3) 10Jelto: Revert "gitlab: swap replica records" [dns] - 10https://gerrit.wikimedia.org/r/961709 [14:28:22] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [14:28:52] (03PS1) 10Jgiannelos: Revert "Prefix docker image tag with branch" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962208 [14:34:21] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1021.eqiad.wmnet with reason: host reimage [14:37:25] (03CR) 10Jelto: [C: 03+2] Revert "gitlab: change service_name on replica hosts" [puppet] - 10https://gerrit.wikimedia.org/r/961710 (owner: 10Jelto) [14:37:30] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1021.eqiad.wmnet with reason: host reimage [14:40:57] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid public cluster: Roll restart of Druid jvm daemons. [14:43:05] (03CR) 10Effie Mouzeli: [C: 03+1] Revert "Prefix docker image tag with branch" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962208 (owner: 10Jgiannelos) [14:44:03] (03CR) 10Jgiannelos: [C: 03+2] Revert "Prefix docker image tag with branch" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962208 (owner: 10Jgiannelos) [14:44:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:44:47] (03Merged) 10jenkins-bot: Revert "Prefix docker image tag with branch" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/962208 (owner: 10Jgiannelos) [14:46:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding new host ganeti-test2004 - jhancock@cumin2002" [14:47:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding new host ganeti-test2004 - jhancock@cumin2002" [14:47:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:47] (03PS1) 10Esanders: DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 [14:48:28] (03CR) 10CI reject: [V: 04-1] DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [14:48:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:48:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti-test2004.mgmt.codfw.wmnet with reboot policy FORCED [14:48:47] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:26] (03PS1) 10Ssingh: dnsbox: revert ntp.anycast.wmnet to add custom check [puppet] - 10https://gerrit.wikimedia.org/r/962630 [14:50:11] (03PS1) 10Jgiannelos: tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/962631 [14:51:35] !log upgrade purged package to version 0.21+deb11u1 on all cp hosts (T347837) [14:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:38] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [14:53:56] PROBLEM - Druid historical on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:54:02] (03PS1) 10Jbond: P:puppetserver: Add profile to create puppet-prometheus_reporter config [puppet] - 10https://gerrit.wikimedia.org/r/962632 [14:54:06] PROBLEM - Check systemd state on druid1009 is CRITICAL: CRITICAL - degraded: The following units failed: druid-historical.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:55:44] !log restart kubelet on ml-serve1001 (high latencies registered) [14:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:21] (03CR) 10CI reject: [V: 04-1] P:puppetserver: Add profile to create puppet-prometheus_reporter config [puppet] - 10https://gerrit.wikimedia.org/r/962632 (owner: 10Jbond) [14:56:34] (03CR) 10Kosta Harlan: [C: 03+1] Define wgReportIncidentEmailFromAddress on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962623 (https://phabricator.wikimedia.org/T339275) (owner: 10Dreamy Jazz) [14:57:24] (03CR) 10Ammarpad: add throttle rule for UIUC Wikipedia edit-a-thon October 13, 2023 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [14:58:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:32] PROBLEM - Druid middlemanager on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:00:04] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [15:00:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1229.eqiad.wmnet with OS bullseye [15:00:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db1229.eqiad.wmnet with OS bullseye executed with errors: - db1229 (**FAIL**)... [15:01:53] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [15:02:20] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.timer,wmf_auto_restart_ssh-gitlab.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:42] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1021.eqiad.wmnet with OS bullseye [15:02:53] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1021.eqiad.wmnet with OS bullseye c... [15:03:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:30] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:04:30] (03PS1) 10Fabfur: Repackage for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 [15:05:05] (03CR) 10Anzx: "thanks, fixed in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/962209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [15:09:46] (03Abandoned) 10Ssingh: dnsbox: revert ntp.anycast.wmnet to add custom check [puppet] - 10https://gerrit.wikimedia.org/r/962630 (owner: 10Ssingh) [15:09:48] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST bgppeers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:10:35] (03PS1) 10EoghanGaffney: [gitlab/failover] Increase alert downtime duration [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 [15:14:04] (03PS1) 10Ssingh: dnsbox: fix healthcheck for ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/962637 [15:14:10] (03CR) 10CI reject: [V: 04-1] [gitlab/failover] Increase alert downtime duration [cookbooks] - 10https://gerrit.wikimedia.org/r/962636 (owner: 10EoghanGaffney) [15:14:48] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (LIST bgppeers) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:06] (03CR) 10Ssingh: [C: 03+2] dnsbox: fix healthcheck for ntp.anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/962637 (owner: 10Ssingh) [15:16:42] (03PS1) 10Jbond: puppetserver: add logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/962638 (https://phabricator.wikimedia.org/T330490) [15:17:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43806/console" [puppet] - 10https://gerrit.wikimedia.org/r/962638 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:18:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: add logstash reporter [puppet] - 10https://gerrit.wikimedia.org/r/962638 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [15:18:27] (03CR) 10Jelto: [C: 03+2] Revert "gitlab: swap replica records" [dns] - 10https://gerrit.wikimedia.org/r/961709 (owner: 10Jelto) [15:18:34] RECOVERY - Druid historical on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:18:46] RECOVERY - Check systemd state on druid1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:47] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:54] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [15:19:10] RECOVERY - Druid middlemanager on druid1009 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [15:20:05] !log jelto@cumin1001 START - Cookbook sre.dns.wipe-cache https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [15:20:09] !log jelto@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) https://gitlab-replica.wikimedia.org/ https://gitlab-replica-old.wikimedia.org/ on all recursors [15:21:08] (03CR) 10Brouberol: [C: 03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/962620 (owner: 10Joal) [15:23:23] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.failover (exit_code=0) Failover of gitlab from gitlab1003.wikimedia.org to gitlab2002.wikimedia.org [15:24:10] (JobUnavailable) firing: (7) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:36] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1021.eqiad.wmnet [15:24:36] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1021.eqiad.wmnet [15:25:23] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search, 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) [15:25:32] (03CR) 10Ottomata: [C: 03+2] Bump mw-page-content-change-enrich parallelism [deployment-charts] - 10https://gerrit.wikimedia.org/r/962620 (owner: 10Joal) [15:26:49] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [15:27:05] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) [15:27:19] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1028.eqiad.wmnet [15:27:19] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1028.eqiad.wmnet [15:27:37] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1028.eqiad.wmnet with OS bullseye [15:27:49] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1028.eqiad.wmnet with OS bullseye [15:28:38] !log joal@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [15:28:42] !log joal@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [15:28:53] (03PS2) 10Fabfur: Repackage for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) [15:29:13] !log enable puppet on A:dns-rec and force agent run [15:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:05] jan_drewniak: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T1530). [15:32:08] RECOVERY - Host 10.3.0.2 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [15:34:21] (03PS1) 10Jbond: puppetserver: logstash reports [puppet] - 10https://gerrit.wikimedia.org/r/962640 (https://phabricator.wikimedia.org/T342125) [15:34:41] (03CR) 10Jbond: [C: 03+2] puppetserver: logstash reports [puppet] - 10https://gerrit.wikimedia.org/r/962640 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:36:56] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:09] * kostajh taavi: I missed your ping earlier, sorry. What do you mean by "I'm going to pull that myself, but please remember to do that in the future". Do you mean that I should run `scap backport` or do something else? [15:39:06] kostajh: that, or for beta-only patches pull /srv/mediawiki-staging by hand [15:39:19] (03PS2) 10Jbond: augeas_core: update augeas_core [puppet] - 10https://gerrit.wikimedia.org/r/962618 [15:40:47] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1028.eqiad.wmnet with reason: host reimage [15:41:12] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [15:41:12] taavi: ok, so `git pull` in `mediawiki-staging` after +2'ing a patch that affects only labs? [15:41:24] (03PS2) 10Jbond: prometheus_reporter: Add new reporter for providing prometheus metricts [puppet] - 10https://gerrit.wikimedia.org/r/962617 (https://phabricator.wikimedia.org/T342125) [15:41:26] (03PS2) 10Jbond: P:puppetserver: Add profile to create puppet-prometheus_reporter config [puppet] - 10https://gerrit.wikimedia.org/r/962632 [15:41:30] (03PS1) 10Jbond: puppetserver: Add prometheus to reporters [puppet] - 10https://gerrit.wikimedia.org/r/962641 (https://phabricator.wikimedia.org/T342125) [15:41:32] (03CR) 10Effie Mouzeli: [C: 03+1] tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/962631 (owner: 10Jgiannelos) [15:42:03] (03CR) 10Jgiannelos: [C: 03+2] tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/962631 (owner: 10Jgiannelos) [15:42:12] (03CR) 10Kosta Harlan: [C: 03+2] Define wgReportIncidentEmailFromAddress on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962623 (https://phabricator.wikimedia.org/T339275) (owner: 10Dreamy Jazz) [15:42:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43807/console" [puppet] - 10https://gerrit.wikimedia.org/r/962641 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:42:53] (03Merged) 10jenkins-bot: tegola: Bump staging image to latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/962631 (owner: 10Jgiannelos) [15:42:55] (03Merged) 10jenkins-bot: Define wgReportIncidentEmailFromAddress on beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962623 (https://phabricator.wikimedia.org/T339275) (owner: 10Dreamy Jazz) [15:42:59] (03PS1) 10Elukey: Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278) [15:43:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1028.eqiad.wmnet with reason: host reimage [15:43:44] (03CR) 10CI reject: [V: 04-1] P:puppetserver: Add profile to create puppet-prometheus_reporter config [puppet] - 10https://gerrit.wikimedia.org/r/962632 (owner: 10Jbond) [15:43:59] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: apply [15:44:33] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: apply [15:45:51] (03CR) 10Ssingh: [C: 03+1] Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [15:46:08] (03PS3) 10Jbond: P:puppetserver: Add profile to create puppet-prometheus_reporter config [puppet] - 10https://gerrit.wikimedia.org/r/962632 (https://phabricator.wikimedia.org/T342125) [15:46:10] (03PS2) 10Jbond: puppetserver: Add prometheus to reporters [puppet] - 10https://gerrit.wikimedia.org/r/962641 (https://phabricator.wikimedia.org/T342125) [15:47:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43808/console" [puppet] - 10https://gerrit.wikimedia.org/r/962641 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:48:50] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: Add new reporter for providing prometheus metricts [puppet] - 10https://gerrit.wikimedia.org/r/962617 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:48:52] (03CR) 10Jbond: [C: 03+2] P:puppetserver: Add profile to create puppet-prometheus_reporter config [puppet] - 10https://gerrit.wikimedia.org/r/962632 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:48:55] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: Add prometheus to reporters [puppet] - 10https://gerrit.wikimedia.org/r/962641 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [15:50:13] (03CR) 10Ssingh: [C: 03+1] Repackage for bookworm (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [15:53:13] (03PS3) 10Fabfur: Version 0.21+deb12u1 for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) [15:53:48] (03PS4) 10Fabfur: Release 0.21+deb12u1 for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) [15:54:34] (03PS5) 10Fabfur: Release 0.21+deb12u1 for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) [15:55:07] (03CR) 10Fabfur: Release 0.21+deb12u1 for bookworm (031 comment) [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [15:55:11] (03CR) 10Ssingh: [C: 03+1] Release 0.21+deb12u1 for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [15:55:53] (03PS1) 10Jgiannelos: tegola: Bump codfw to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/962645 [15:59:06] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) [15:59:15] (03CR) 10Fabfur: [C: 03+2] Release 0.21+deb12u1 for bookworm [software/purged] - 10https://gerrit.wikimedia.org/r/962635 (https://phabricator.wikimedia.org/T347837) (owner: 10Fabfur) [15:59:46] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:50] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:59:50] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team, and 3 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10aborrero) [16:06:57] !log importing into bookworm-wikimedia package purged_0.21+deb12u1_amd64 (T347837) [16:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:06] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [16:08:29] 10SRE, 10Traffic, 10Patch-For-Review: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [16:08:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1028.eqiad.wmnet with OS bullseye [16:08:49] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1028.eqiad.wmnet with OS bullseye c... [16:10:43] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [16:10:56] 10ops-codfw, 10Content-Transform-Team, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) 05Open→03Resolved [16:13:53] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1031.eqiad.wmnet with OS bullseye [16:14:52] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1031.eqiad.wmnet with OS bullseye [16:18:11] (03CR) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [16:20:57] (03PS1) 10Jbond: puppetserver: if using prometheus add puppet to the prometheus-node-exporter user [puppet] - 10https://gerrit.wikimedia.org/r/962647 (https://phabricator.wikimedia.org/T342125) [16:21:32] (03CR) 10Klausman: [C: 03+1] Remove ores.svc.{eqiad,codfw}.wmnet records [dns] - 10https://gerrit.wikimedia.org/r/962642 (https://phabricator.wikimedia.org/T347278) (owner: 10Elukey) [16:23:20] (03CR) 10CI reject: [V: 04-1] puppetserver: if using prometheus add puppet to the prometheus-node-exporter user [puppet] - 10https://gerrit.wikimedia.org/r/962647 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:24:04] (03PS2) 10Jbond: puppetserver: puppet to the prometheus-node-exporter user if needed [puppet] - 10https://gerrit.wikimedia.org/r/962647 (https://phabricator.wikimedia.org/T342125) [16:24:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) cp1106 - B 7. U 27. CableID 5021 port 31 cp1107 - B 7. U 28. CableID 5061 port 37 cp1108 - C 7. U 30. CableID 230304500235 port 36 cp1109 - C 7. U 31. CableID 4780 port 28 c... [16:25:40] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [16:26:36] (03CR) 10Jbond: [C: 03+2] puppetserver: puppet to the prometheus-node-exporter user if needed [puppet] - 10https://gerrit.wikimedia.org/r/962647 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [16:26:56] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage [16:27:43] (03PS13) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [16:29:24] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1031.eqiad.wmnet with reason: host reimage [16:30:29] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer (T347624, testing new cookbook changes) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet, repooling both afterwards w/ encryption [16:30:37] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [16:31:00] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [16:32:42] (03PS14) 10Ryan Kemper: wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) [16:34:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:35:40] (03CR) 10CI reject: [V: 04-1] wdqs.data_transfer: refactor spicerack class api [cookbooks] - 10https://gerrit.wikimedia.org/r/961878 (https://phabricator.wikimedia.org/T347624) (owner: 10Ryan Kemper) [16:37:34] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [16:38:22] (03PS9) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [16:39:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T347624, testing new cookbook changes) xfer categories from wdqs2024.codfw.wmnet -> wdqs2025.codfw.wmnet, repooling both afterwards w/ encryption [16:39:19] T347624: Refactor sre.wdqs.data-transfer to use new spicerack class api - https://phabricator.wikimedia.org/T347624 [16:40:21] (03CR) 10Stevemunene: airflow-wmde: configure wmde airflow instance (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) (owner: 10Stevemunene) [16:42:18] (03PS10) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [16:45:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:01] (03PS1) 10Ssingh: ntp: lower the warning and critical thresholds for check_ntp_peer [puppet] - 10https://gerrit.wikimedia.org/r/962648 [16:54:12] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43810/console" [puppet] - 10https://gerrit.wikimedia.org/r/962648 (owner: 10Ssingh) [16:55:05] (03PS1) 10Jforrester: wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264) [16:55:23] (03CR) 10BBlack: [C: 03+1] ntp: lower the warning and critical thresholds for check_ntp_peer [puppet] - 10https://gerrit.wikimedia.org/r/962648 (owner: 10Ssingh) [16:55:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1031.eqiad.wmnet with OS bullseye [16:55:49] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1031.eqiad.wmnet with OS bullseye c... [16:56:08] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1022.eqiad.wmnet with OS bullseye [16:56:19] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1022.eqiad.wmnet with OS bullseye [16:58:38] (03PS2) 10Cathal Mooney: Interface automation: skip import of existing int IPs and VIPs [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/962614 (https://phabricator.wikimedia.org/T295774) [16:59:35] (03PS1) 10Elukey: admin_ng: bump cpu/memory limits for ml-serve's articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/962651 [17:00:06] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T1700) [17:00:06] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T1700). [17:00:14] !log upgrade purged package to version 0.21+deb12u1 cp4052 (bookworm) (T347837) [17:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:27] T347837: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 [17:01:16] 10SRE, 10Traffic: Repackage purged for bullseye and bookworm - https://phabricator.wikimedia.org/T347837 (10Fabfur) [17:05:39] (03PS2) 10Elukey: admin_ng: bump cpu/memory limits for ml-serve's articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/962651 [17:09:05] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1022.eqiad.wmnet with reason: host reimage [17:12:12] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [17:12:19] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1022.eqiad.wmnet with reason: host reimage [17:12:55] (03CR) 10Elukey: [C: 03+2] admin_ng: bump cpu/memory limits for ml-serve's articlequality [deployment-charts] - 10https://gerrit.wikimedia.org/r/962651 (owner: 10Elukey) [17:14:46] (03PS1) 10Jbond: prometheus_reporter: switch to custom version [puppet] - 10https://gerrit.wikimedia.org/r/962652 [17:15:22] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: switch to custom version [puppet] - 10https://gerrit.wikimedia.org/r/962652 (owner: 10Jbond) [17:17:22] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:17:31] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:17:39] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:17:48] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:18:06] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [17:18:27] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [17:23:20] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [17:23:51] (03CR) 10Ssingh: [V: 03+1 C: 03+2] ntp: lower the warning and critical thresholds for check_ntp_peer [puppet] - 10https://gerrit.wikimedia.org/r/962648 (owner: 10Ssingh) [17:24:13] !log sudo cumin "A:dns-rec" "disable-puppet 'merging CR 962648'" [17:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:01] (03PS1) 10Jbond: prometheus_reporter: update branch [puppet] - 10https://gerrit.wikimedia.org/r/962654 [17:30:01] !log A:dns-rec enable puppet and run agent [17:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:38] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: update branch [puppet] - 10https://gerrit.wikimedia.org/r/962654 (owner: 10Jbond) [17:35:20] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:38:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1022.eqiad.wmnet with OS bullseye [17:38:18] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1022.eqiad.wmnet with OS bullseye c... [17:38:51] (03PS1) 10Jbond: prometheus_reporter: bump version [puppet] - 10https://gerrit.wikimedia.org/r/962655 [17:39:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:39:49] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [17:39:55] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [17:39:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [17:40:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [17:41:27] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: bump version [puppet] - 10https://gerrit.wikimedia.org/r/962655 (owner: 10Jbond) [17:41:56] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:43:48] ^ bird.service refreshes, nothing to worry [17:50:33] (03PS1) 10Jbond: prometheus_reporter: refresh branch [puppet] - 10https://gerrit.wikimedia.org/r/962656 [17:51:45] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: refresh branch [puppet] - 10https://gerrit.wikimedia.org/r/962656 (owner: 10Jbond) [17:52:03] 10SRE, 10Traffic, 10Patch-For-Review: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 (10ssingh) `ntp.anycast.wmnet` exists and the VIP `10.3.0.2/32` is being announced from all DNS hosts. The next step is to merge http... [17:56:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [17:57:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [17:58:52] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [17:59:02] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1022.eqiad.wmnet [17:59:03] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1022.eqiad.wmnet [17:59:51] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1023.eqiad.wmnet with OS bullseye [18:00:02] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1023.eqiad.wmnet with OS bullseye [18:00:22] (03PS1) 10Bking: flink: Add correct contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/962660 (https://phabricator.wikimedia.org/T341792) [18:01:02] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/962660 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [18:02:21] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [18:10:25] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:10:57] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [18:13:52] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1023.eqiad.wmnet with reason: host reimage [18:16:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1023.eqiad.wmnet with reason: host reimage [18:21:51] 10SRE, 10Traffic, 10Patch-For-Review: Alert on Varnish high thread count - https://phabricator.wikimedia.org/T323723 (10BCornwall) @Vgutierrez thoughts on this? Care to rebut? [18:35:24] (03PS1) 10Jbond: prometheus_reporter: update with branch [puppet] - 10https://gerrit.wikimedia.org/r/962669 [18:38:43] (03PS1) 10Fabfur: Add version print option [software/purged] - 10https://gerrit.wikimedia.org/r/962670 (https://phabricator.wikimedia.org/T347839) [18:39:14] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: update with branch [puppet] - 10https://gerrit.wikimedia.org/r/962669 (owner: 10Jbond) [18:39:16] (03CR) 10Ebernhardson: [C: 03+1] flink: Add correct contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/962660 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [18:40:38] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1023.eqiad.wmnet with OS bullseye [18:40:49] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1023.eqiad.wmnet with OS bullseye c... [18:42:17] !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for restbase1023.eqiad.wmnet [18:42:18] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for restbase1023.eqiad.wmnet [18:43:08] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [18:43:22] (03PS1) 10Bartosz Dziewoński: Ignore only site notices [extensions/DismissableSiteNotice] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962212 (https://phabricator.wikimedia.org/T347645) [18:43:33] (03PS1) 10Bartosz Dziewoński: HookUtils: Fix checking page props [extensions/DiscussionTools] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962213 (https://phabricator.wikimedia.org/T347878) [18:43:37] (03CR) 10Bking: [C: 03+2] flink: Add correct contactgroups [puppet] - 10https://gerrit.wikimedia.org/r/962660 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [18:43:50] (03PS1) 10Bartosz Dziewoński: Fix diff title escaping [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962214 (https://phabricator.wikimedia.org/T347578) [18:44:00] (03PS1) 10Bartosz Dziewoński: Diff: Add missing .mw-diff-inline-moved selector [core] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962215 [18:44:01] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1024.eqiad.wmnet [18:44:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1024.eqiad.wmnet [18:54:17] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [18:56:03] (03CR) 10CI reject: [V: 04-1] Fix diff title escaping [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962214 (https://phabricator.wikimedia.org/T347578) (owner: 10Bartosz Dziewoński) [18:56:08] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1024.eqiad.wmnet [18:56:09] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1024.eqiad.wmnet [19:00:04] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [19:00:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [19:00:12] (03PS1) 10Jbond: prometheus_reporter: sync to branch [puppet] - 10https://gerrit.wikimedia.org/r/962675 [19:00:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed with error... [19:00:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed with error... [19:00:27] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003.eqiad.wmnet'] [19:00:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1003.eqiad.wmnet'] [19:01:30] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:01:36] (03CR) 10Jbond: [C: 03+2] prometheus_reporter: sync to branch [puppet] - 10https://gerrit.wikimedia.org/r/962675 (owner: 10Jbond) [19:01:59] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1003.eqiad.wmnet with OS bullseye [19:02:01] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1029.eqiad.wmnet [19:02:06] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1029.eqiad.wmnet [19:02:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye [19:02:40] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1024.eqiad.wmnet with OS bullseye [19:02:53] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1024.eqiad.wmnet with OS bullseye [19:08:09] jouncebot: next [19:08:09] In 0 hour(s) and 51 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T2000) [19:09:08] i may have overloaded the window… if any deployer is around and feeling bored, we could start my maintenance scripts? (but it's not urgent, and if we run out of time, i'll just schedule them tomorrow instead) [19:11:28] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-master1004.eqiad.wmnet with OS bullseye [19:11:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye [19:13:04] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-master1003'] [19:13:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['an-master1003'] [19:13:34] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:15:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:16:39] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1024.eqiad.wmnet with reason: host reimage [19:17:05] 10SRE, 10Traffic, 10Patch-For-Review: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10Fabfur) 05Open→03In progress [19:19:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1024.eqiad.wmnet with reason: host reimage [19:22:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:22:23] (03PS1) 10Andrew Bogott: radosgw: remove a duplicate firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/962679 (https://phabricator.wikimedia.org/T338937) [19:24:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.330 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:28:47] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:40:26] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1024.eqiad.wmnet with OS bullseye [19:40:37] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1024.eqiad.wmnet with OS bullseye c... [19:40:45] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1029.eqiad.wmnet [19:41:09] (03PS1) 10Bartosz Dziewoński: REST: Fix phpstan by creating LocalSettings.php [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962216 [19:41:15] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1029.eqiad.wmnet [19:41:24] (03PS2) 10Bartosz Dziewoński: Fix diff title escaping [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962214 (https://phabricator.wikimedia.org/T347578) [19:47:07] PROBLEM - Blazegraph Port for wdqs-categories on wdqs1016 is CRITICAL: connect to address 127.0.0.1 and port 9990: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [19:48:09] (PuppetConstantChange) firing: Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:48:54] (03PS2) 10DLynch: DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [19:49:30] (SystemdUnitFailed) firing: (3) wdqs-updater.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:39] (03CR) 10CI reject: [V: 04-1] DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [19:50:34] (03PS1) 10Jdlrobson: Promose several Wikipedias to Vector 2022 as default skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962684 (https://phabricator.wikimedia.org/T347321) [19:51:41] (03PS3) 10DLynch: DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [19:52:37] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:53:27] !log installing libvpx security updates [19:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:51] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase1029.eqiad.wmnet [19:53:52] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1029.eqiad.wmnet [19:54:20] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [19:54:32] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T2000). [20:00:06] danisztls, MatmaRex, and kemayo: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:21] 👋 My config patch is a no-op to prepare the way for something on the train, so merge it whenever in the process you'd like and I'll have nothing to test. [20:00:38] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: remove a duplicate firewall rule [puppet] - 10https://gerrit.wikimedia.org/r/962679 (https://phabricator.wikimedia.org/T338937) (owner: 10Andrew Bogott) [20:00:48] hi [20:01:00] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1029.eqiad.wmnet with OS bullseye [20:01:10] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye e... [20:01:49] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [20:01:59] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:04:54] I can deploy [20:04:55] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [20:08:04] danisztls: are you around? [20:10:06] sry was late [20:10:10] o/ [20:10:46] np, I'm going to deploy yours and Kemayo's first, then MatmaRex [20:11:13] cool, thanks [20:11:21] !log kindrobot@deploy2002 Backport cancelled. [20:11:33] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1029.eqiad.wmnet with OS bullseye [20:11:43] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye e... [20:12:22] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:12:25] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:12:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962105 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:12:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [20:13:10] (03PS2) 10Stef Dunlap: Undeploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962105 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:13:17] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:13:20] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:13:48] (03PS4) 10Stef Dunlap: DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [20:13:56] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962105 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:13:58] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [20:15:54] (03Merged) 10jenkins-bot: Undeploy Reader Demographics 2 pilot survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962105 (https://phabricator.wikimedia.org/T345951) (owner: 10DDesouza) [20:17:01] (03PS5) 10Stef Dunlap: DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [20:17:16] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [20:18:10] (03Merged) 10jenkins-bot: DiscussionTools: Disable timestamp links in production initially [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962629 (owner: 10Esanders) [20:18:25] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:962105|Undeploy Reader Demographics 2 pilot survey (T345951)]], [[gerrit:962629|DiscussionTools: Disable timestamp links in production initially]] [20:18:29] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:19:46] !log kindrobot@deploy2002 esanders and dani and kindrobot: Backport for [[gerrit:962105|Undeploy Reader Demographics 2 pilot survey (T345951)]], [[gerrit:962629|DiscussionTools: Disable timestamp links in production initially]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:20:05] (03PS1) 10Eevans: install_server: add restbase1029 (as 3-ssd reuse) [puppet] - 10https://gerrit.wikimedia.org/r/962693 (https://phabricator.wikimedia.org/T331713) [20:20:15] danisztls: please confirm [20:20:52] kindrobot: looks good [20:21:11] Ok, syncing. [20:21:15] !log kindrobot@deploy2002 esanders and dani and kindrobot: Continuing with sync [20:21:43] thanks kindrobot [20:22:09] (03CR) 10Eevans: [C: 03+2] install_server: add restbase1029 (as 3-ssd reuse) [puppet] - 10https://gerrit.wikimedia.org/r/962693 (https://phabricator.wikimedia.org/T331713) (owner: 10Eevans) [20:22:13] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1003.eqiad.wmnet with OS bullseye [20:22:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1003.eqiad.wmnet with OS bullseye executed with error... [20:23:37] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:23:58] MatmaRex: do you have the commands to run fixInconsistentRedirects.php on all wikis? [20:25:33] kindrobot: just `foreachwiki maintenance/fixInconsistentRedirects.php` with no arguments [20:25:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:25:44] (03PS3) 10Brion VIBBER: Video transcode update for experimental HLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) [20:27:00] (03PS1) 10DCausse: rdf-streaming-updater: bump image version to flink-1.16.1-rdf-0.3.133 [deployment-charts] - 10https://gerrit.wikimedia.org/r/962694 (https://phabricator.wikimedia.org/T347515) [20:27:03] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [20:27:15] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:962105|Undeploy Reader Demographics 2 pilot survey (T345951)]], [[gerrit:962629|DiscussionTools: Disable timestamp links in production initially]] (duration: 08m 49s) [20:27:19] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:27:25] T345951: Deploy pilot on enwiki for Global Readers Demographic Survey - https://phabricator.wikimedia.org/T345951 [20:27:32] !log mw-page-content-change-enrich - increase replicas to 12 to process backlog - T347676 [20:27:32] (03PS4) 10Brion VIBBER: Drop old VP8 video transcodes, enable HLS on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/961864 (https://phabricator.wikimedia.org/T312152) [20:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:36] T347676: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 [20:28:00] MatmaRex: OK, and I should do that after syncing the 4 prior changes? Is it OK for me to sync those four changes together? [20:29:03] (03PS2) 10Eevans: install_server: utilize reuse recipe for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/962048 (https://phabricator.wikimedia.org/T331713) [20:29:49] kindrobot: the maintenance scripts do not depend on the backports, so you can run them whenever [20:30:02] kindrobot: yes, okay to sync all together, each of these backport is an unrelated bug [20:30:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:30:46] OK. I'll start syncing them now. To be safe I'll run the scripts after. [20:31:41] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-master1004.eqiad.wmnet with OS bullseye [20:31:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, and 2 others: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-master1004.eqiad.wmnet with OS bullseye executed with error... [20:31:58] MatmaRex: Change '962214' has dependencies '[962216]', which are not merged or scheduled for backport [20:32:04] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1029.eqiad.wmnet with OS bullseye [20:32:17] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye e... [20:32:52] kindrobot: oh oops, yes, that should be backported too [20:33:01] it's a no-op in production, but it fixes a CI failure [20:33:28] sorry, i forgot to add it to the schedule after i noticed the problem [20:34:30] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:34:52] OK. I'll +2 it [20:35:20] (03CR) 10Stef Dunlap: [C: 03+2] REST: Fix phpstan by creating LocalSettings.php [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962216 (owner: 10Bartosz Dziewoński) [20:35:53] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [20:36:02] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1029.eqiad.wmnet with OS bullseye [20:36:06] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:36:15] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye e... [20:37:23] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [20:37:31] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1029.eqiad.wmnet with OS bullseye [20:37:34] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:37:42] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye e... [20:40:46] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [20:40:54] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host restbase1029.eqiad.wmnet with OS bullseye [20:40:58] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:41:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:41:03] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye e... [20:42:06] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1029.eqiad.wmnet with OS bullseye [20:42:18] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye [20:46:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:17] (03Merged) 10jenkins-bot: REST: Fix phpstan by creating LocalSettings.php [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962216 (owner: 10Bartosz Dziewoński) [20:50:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [extensions/DismissableSiteNotice] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962212 (https://phabricator.wikimedia.org/T347645) (owner: 10Bartosz Dziewoński) [20:50:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [extensions/DiscussionTools] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962213 (https://phabricator.wikimedia.org/T347878) (owner: 10Bartosz Dziewoński) [20:50:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962214 (https://phabricator.wikimedia.org/T347578) (owner: 10Bartosz Dziewoński) [20:50:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962215 (owner: 10Bartosz Dziewoński) [20:52:13] (03Merged) 10jenkins-bot: Ignore only site notices [extensions/DismissableSiteNotice] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962212 (https://phabricator.wikimedia.org/T347645) (owner: 10Bartosz Dziewoński) [20:53:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:54:20] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#9214743, @SCherukuwada wrote: > We met with Google to discuss this further. Google will provide more detail... [20:54:27] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.289 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:54:48] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1029.eqiad.wmnet with reason: host reimage [20:56:24] !log mw-page-content-change-enrich - increase replicas to 24 to process backlog - T347676 [20:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:28] T347676: Partition reassignment on kafka-jumbo negatively impacting mw-page-content-change-enrich - https://phabricator.wikimedia.org/T347676 [20:56:32] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:56:34] (03Merged) 10jenkins-bot: HookUtils: Fix checking page props [extensions/DiscussionTools] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962213 (https://phabricator.wikimedia.org/T347878) (owner: 10Bartosz Dziewoński) [20:56:36] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:56:53] (03PS6) 10Sbailey: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) [20:57:39] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:57:41] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:57:57] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1029.eqiad.wmnet with reason: host reimage [20:58:46] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [20:58:50] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [20:59:52] !log mw-page-content-change-enrich - CORRECTION - increase replicas to 20 to process backlog - T347676 [20:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T2100). [21:05:28] (03Merged) 10jenkins-bot: Fix diff title escaping [extensions/Wikibase] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962214 (https://phabricator.wikimedia.org/T347578) (owner: 10Bartosz Dziewoński) [21:05:32] (03Merged) 10jenkins-bot: Diff: Add missing .mw-diff-inline-moved selector [core] (wmf/1.41.0-wmf.28) - 10https://gerrit.wikimedia.org/r/962215 (owner: 10Bartosz Dziewoński) [21:07:45] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:962212|Ignore only site notices (T347645)]], [[gerrit:962213|HookUtils: Fix checking page props (T347878)]], [[gerrit:962214|Fix diff title escaping (T347578)]], [[gerrit:962215|Diff: Add missing .mw-diff-inline-moved selector]] [21:07:52] T347578: HTML tags in Entity diff title - https://phabricator.wikimedia.org/T347578 [21:07:52] T347645: Sitenotice hide button is displayed even if there is no sitenotice - https://phabricator.wikimedia.org/T347645 [21:07:52] T347878: Talk page empty state appears on pages with __NONEWSECTIONLINK__ - https://phabricator.wikimedia.org/T347878 [21:08:42] (03PS7) 10Sbailey: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) [21:08:59] !log kindrobot@deploy2002 kindrobot and matmarex: Backport for [[gerrit:962212|Ignore only site notices (T347645)]], [[gerrit:962213|HookUtils: Fix checking page props (T347878)]], [[gerrit:962214|Fix diff title escaping (T347578)]], [[gerrit:962215|Diff: Add missing .mw-diff-inline-moved selector]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:09] testing [21:09:15] MatmaRex: please confirm [21:09:24] (03CR) 10CI reject: [V: 04-1] Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [21:10:10] kindrobot: all changes look good [21:10:22] MatmaRex: I'm seeing the following warnings Expectation (readQueryRows <= 10000) by MediaWiki::main not met (actual: 11254) in trx #2d10b8242e: [21:10:25] SELECT pi_property_id,pi_info FROM `wb_property_info` [21:10:37] (03CR) 10Sbailey: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [21:11:13] kindrobot: probably unrelated, that's a different kind of property than in my change 962213 [21:11:30] OK. Syncing [21:11:34] !log kindrobot@deploy2002 kindrobot and matmarex: Continuing with sync [21:12:48] that looks like a fairly common warning: https://logstash.wikimedia.org/goto/8aa3707780cb4ad622eb766ca3852e81 [21:14:42] kindrobot: do you still want to do the maintenance runs today, or should i reschedule for tomorrow? [21:15:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:17:51] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:962212|Ignore only site notices (T347645)]], [[gerrit:962213|HookUtils: Fix checking page props (T347878)]], [[gerrit:962214|Fix diff title escaping (T347578)]], [[gerrit:962215|Diff: Add missing .mw-diff-inline-moved selector]] (duration: 10m 06s) [21:17:58] T347578: HTML tags in Entity diff title - https://phabricator.wikimedia.org/T347578 [21:17:58] T347645: Sitenotice hide button is displayed even if there is no sitenotice - https://phabricator.wikimedia.org/T347645 [21:17:58] T347878: Talk page empty state appears on pages with __NONEWSECTIONLINK__ - https://phabricator.wikimedia.org/T347878 [21:18:56] Running maintenance scripts [21:20:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (DELETE pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:21:03] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1029.eqiad.wmnet with OS bullseye [21:21:29] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [21:21:35] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1032.eqiad.wmnet [21:21:46] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1032.eqiad.wmnet [21:21:46] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1029.eqiad.wmnet with OS bullseye c... [21:22:08] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1032.eqiad.wmnet [21:22:12] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1032.eqiad.wmnet [21:23:52] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1032.eqiad.wmnet [21:23:56] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1032.eqiad.wmnet [21:24:39] MatmaRex: do you need the output of the first one? [21:25:28] kindrobot: not really, unless it printed some errors [21:28:11] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1029.eqiad.wmnet [21:28:53] !log eevans@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts restbase1029.eqiad.wmnet [21:29:00] (03PS1) 10Andrew Bogott: Add radosgw apis to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962707 (https://phabricator.wikimedia.org/T276961) [21:30:59] MatmaRex: started the second maintenance script. I'll send it to when it's done via phab. What's your phab username? [21:31:52] kindrobot: @matmarex, or you can just paste it on https://phabricator.wikimedia.org/T347218 , it doesn't print anything secret [21:32:00] thank you! [21:32:48] np [21:32:59] !log end UTC late backport window [21:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:25] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.796 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:33] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:43] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:37:45] (03PS1) 10Andrew Bogott: Add fake radosgw eqiad1 key data [labs/private] - 10https://gerrit.wikimedia.org/r/962709 (https://phabricator.wikimedia.org/T276961) [21:37:58] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add fake radosgw eqiad1 key data [labs/private] - 10https://gerrit.wikimedia.org/r/962709 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [21:44:19] PROBLEM - puppet last run on wdqs1016 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:46:45] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs1016 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:46:57] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1016 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:47:10] ^^ sorry for the alert spam [21:47:11] RECOVERY - Blazegraph process -wdqs-categories- on wdqs1016 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:47:31] RECOVERY - Blazegraph Port for wdqs-categories on wdqs1016 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:47:47] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [21:47:51] RECOVERY - Query Service HTTP Port on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:49:20] (03PS1) 10Andrew Bogott: eqiad1: add caps for radosgw user [puppet] - 10https://gerrit.wikimedia.org/r/962713 (https://phabricator.wikimedia.org/T276961) [21:49:30] (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:49:49] RECOVERY - puppet last run on wdqs1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:51:00] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1: add caps for radosgw user [puppet] - 10https://gerrit.wikimedia.org/r/962713 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [21:51:05] (03CR) 10Cwhite: [C: 03+2] profile: enable wal on grafana sqlite db [puppet] - 10https://gerrit.wikimedia.org/r/961510 (https://phabricator.wikimedia.org/T345362) (owner: 10Cwhite) [21:51:48] (03PS2) 10Andrew Bogott: Add radosgw apis to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/962707 (https://phabricator.wikimedia.org/T276961) [21:52:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:53:00] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:53:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:53:49] !log Deployed patch for T347704 [21:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:34] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:00:35] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:00:56] !log eevans@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts restbase1032.eqiad.wmnet [22:01:26] !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase1032.eqiad.wmnet [22:01:52] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host restbase1032.eqiad.wmnet [22:04:37] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [22:09:30] !log eevans@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts restbase1032.eqiad.wmnet [22:16:36] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS bullseye [22:16:48] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1032.eqiad.wmnet with OS bullseye [22:26:32] (03PS2) 10Jforrester: wikifunctions: Use function-orchestrator image with better logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/962650 (https://phabricator.wikimedia.org/T346264) [22:26:34] (03PS1) 10Jforrester: wikifunctions: Begin split of function-evaluator into js and python services [deployment-charts] - 10https://gerrit.wikimedia.org/r/962716 (https://phabricator.wikimedia.org/T343388) [22:26:36] (03PS1) 10Jforrester: wikifunctions: Switch execution from main to language-specific evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/962717 (https://phabricator.wikimedia.org/T343388) [22:26:38] (03PS1) 10Jforrester: wikifunctions: Drop references to legacy main evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962718 (https://phabricator.wikimedia.org/T343388) [22:26:40] (03PS1) 10Jforrester: wikifunctions: Drop lgeacy main evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/962719 (https://phabricator.wikimedia.org/T343388) [22:30:27] !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host restbase1032.eqiad.wmnet with OS bullseye [22:30:37] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1032.eqiad.wmnet with OS bullseye e... [22:30:39] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host restbase1032.eqiad.wmnet with OS bullseye [22:30:50] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host restbase1032.eqiad.wmnet with OS bullseye [22:36:46] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:38:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:38:34] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [22:41:42] jouncebot: refresh [22:41:43] I refreshed my knowledge about deployments. [22:41:50] jouncebot: nowandlater [22:41:54] jouncebot: nowandnext [22:41:54] For the next 0 hour(s) and 18 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231002T2100) [22:41:54] In 3 hour(s) and 18 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231003T0200) [22:43:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [22:43:19] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1032.eqiad.wmnet with reason: host reimage [22:44:13] jouncebot is running a new version of my ib3 library and from a python 2.11 contain now. If you see it having issues, file a bug at https://phabricator.wikimedia.org/tag/jouncebot/ please and thank you. [22:44:28] *python 3.11 [22:46:28] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1032.eqiad.wmnet with reason: host reimage [22:52:10] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [22:59:25] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445 (10tstarling) Some numbers to help us choose limits. JPEG width statistics. Number of JPEG images in a sample with width exceeding the bucket size. ` MariaDB... [23:01:39] puppet-merge locked? cc andrewbogott [23:02:00] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:09:10] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1032.eqiad.wmnet with OS bullseye [23:09:21] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host restbase1032.eqiad.wmnet with OS bullseye c... [23:13:12] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:20] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:14:33] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering, 10Patch-For-Review: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [23:24:55] (03PS1) 10Krinkle: Profiler: Enable logging of caught Redis exceptions to Logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962725 (https://phabricator.wikimedia.org/T347916) [23:27:43] (03CR) 10Tim Starling: [C: 04-1] thumbor: add imagemagick policy file (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/962061 (https://phabricator.wikimedia.org/T333445) (owner: 10Hnowlan) [23:28:47] (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:31:03] (03PS1) 10Superpes15: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) [23:31:43] (03CR) 10CI reject: [V: 04-1] [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) (owner: 10Superpes15) [23:33:53] (03PS2) 10Superpes15: [enwiki] Throttle exemption for Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/962729 (https://phabricator.wikimedia.org/T347874) [23:47:30] (03PS4) 10Superpes15: [fiwiki] Add an editautoreviewprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) [23:47:42] (03PS5) 10Superpes15: [fiwiki] Add an editautoreviewprotected level protection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/960201 (https://phabricator.wikimedia.org/T347069) [23:58:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:59:34] PROBLEM - Druid overlord on druid1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid