[00:00:46] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 3.1756730353684612s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:18:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 45.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:22:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 45.67% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:23:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.319 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:24:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.221 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:25] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/977233 [00:38:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/977233 (owner: 10TrainBranchBot) [00:56:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/977233 (owner: 10TrainBranchBot) [01:58:05] PROBLEM - MD RAID on aqs1013 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:00:55] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:22] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [02:38:26] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:26] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:24:28] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:54:34] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186 (10Tgr) [04:11:41] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:38:25] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:21:58] (03PS3) 10KartikMistry: cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) [05:23:46] (03CR) 10KartikMistry: [C: 03+2] cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [05:24:49] (03Merged) 10jenkins-bot: cxserver: Force 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/975740 (https://phabricator.wikimedia.org/T349118) (owner: 10KartikMistry) [05:29:56] (03PS3) 10KartikMistry: Update cxserver to 2023-11-20-052250-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/976369 (https://phabricator.wikimedia.org/T341458) [05:33:26] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-11-20-052250-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/976369 (https://phabricator.wikimedia.org/T341458) (owner: 10KartikMistry) [05:34:28] (03Merged) 10jenkins-bot: Update cxserver to 2023-11-20-052250-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/976369 (https://phabricator.wikimedia.org/T341458) (owner: 10KartikMistry) [05:43:01] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:43:24] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:58:28] (03CR) 10Stevemunene: [C: 03+1] Setup kubeconfigs for spark-history/spark-history-test on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol) [06:04:37] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:05:21] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:06:04] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:12:16] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:12:45] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:16:57] !log Update cxserver to 2023-11-20-052250-production (T341458, T349118) [06:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:04] T341458: Set MinT as the default for languages where it is optional but frequently used - https://phabricator.wikimedia.org/T341458 [06:17:04] T349118: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 [06:27:02] (03PS2) 10KartikMistry: Update Apertium to 2023-11-23-055425-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977183 (https://phabricator.wikimedia.org/T346997) [06:33:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switch [06:33:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switch [06:34:02] (03PS1) 10Marostegui: Revert "mariadb: Promote db1119 to m2 master" [puppet] - 10https://gerrit.wikimedia.org/r/977199 [06:34:26] (03CR) 10CI reject: [V: 04-1] Revert "mariadb: Promote db1119 to m2 master" [puppet] - 10https://gerrit.wikimedia.org/r/977199 (owner: 10Marostegui) [06:37:43] (03PS1) 10Marostegui: mariadb: Promote db1195 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/977319 (https://phabricator.wikimedia.org/T351863) [06:37:51] (03Abandoned) 10Marostegui: Revert "mariadb: Promote db1119 to m2 master" [puppet] - 10https://gerrit.wikimedia.org/r/977199 (owner: 10Marostegui) [06:38:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1195 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/977319 (https://phabricator.wikimedia.org/T351863) (owner: 10Marostegui) [06:40:16] !log Failover m2 from db1119 to db1195 - T351863 [06:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:26] T351863: Switchover m2 master db1119 -> db1195 - https://phabricator.wikimedia.org/T351863 [06:43:55] (03PS1) 10KartikMistry: Update cxserver to 2023-11-24-152117-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977320 (https://phabricator.wikimedia.org/T351932) [06:46:45] (03PS1) 10Marostegui: db1119: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/977321 (https://phabricator.wikimedia.org/T351990) [06:47:49] (03CR) 10Marostegui: [C: 03+2] db1119: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/977321 (https://phabricator.wikimedia.org/T351990) (owner: 10Marostegui) [06:52:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2134.codfw.wmnet with OS bookworm [07:08:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2134.codfw.wmnet with reason: host reimage [07:11:20] (03PS1) 10Giuseppe Lavagetto: Release 4.0.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/977437 [07:11:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2134.codfw.wmnet with reason: host reimage [07:18:17] (03PS2) 10Marostegui: oathauth_users: Prepare for removal [puppet] - 10https://gerrit.wikimedia.org/r/977123 (https://phabricator.wikimedia.org/T348693) [07:24:28] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:24:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2134.codfw.wmnet with OS bookworm [07:28:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:40] (03PS1) 10Kevin Bazira: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/977234 (https://phabricator.wikimedia.org/T343123) [07:41:36] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:53] (03CR) 10Elukey: [C: 03+1] ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/977234 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [07:42:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:57:02] (03CR) 10Kevin Bazira: [C: 03+2] "Thanks for the review :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977234 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [07:58:05] (03Merged) 10jenkins-bot: ml-services: update article-descriptions isvc image in the experimental namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/977234 (https://phabricator.wikimedia.org/T343123) (owner: 10Kevin Bazira) [08:00:04] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T0800) [08:00:04] sergi0 and Kizule: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:41] hello [08:06:43] Any deployer around? [08:07:07] i can deploy, since I have a patch of my own that I need to deploy too [08:07:33] sergi0: ah, this is the patch we talked about last week :D [08:07:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976804 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno) [08:08:06] taavi: right, first in the week :) [08:08:47] (03Merged) 10jenkins-bot: GrowthExperiments: enable AddLink frontend for 16,17th rounds of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976804 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno) [08:09:10] (03CR) 10Brouberol: [C: 03+2] Setup kubeconfigs for spark-history/spark-history-test on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/976733 (https://phabricator.wikimedia.org/T351711) (owner: 10Brouberol) [08:09:42] !log taavi@deploy2002 Started scap: Backport for [[gerrit:976804|GrowthExperiments: enable AddLink frontend for 16,17th rounds of wikis (T308142 T308143)]] [08:09:48] T308142: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 [08:09:48] T308143: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 [08:14:32] !log installing dpkg bugfix updates on bullseye [08:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:49] !log taavi@deploy2002 taavi and sgimeno: Backport for [[gerrit:976804|GrowthExperiments: enable AddLink frontend for 16,17th rounds of wikis (T308142 T308143)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:52] finally.. sergi0: please test [08:19:00] T308142: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 [08:19:00] T308143: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 [08:19:00] sure [08:21:48] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:22:00] taavi: looks good on my end [08:22:10] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:23:20] logstash seems ok too, syncing [08:23:21] !log taavi@deploy2002 taavi and sgimeno: Continuing with sync [08:23:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:23:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 2.260 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:24:25] (03CR) 10Majavah: [C: 03+1] oathauth_users: Prepare for removal [puppet] - 10https://gerrit.wikimedia.org/r/977123 (https://phabricator.wikimedia.org/T348693) (owner: 10Marostegui) [08:24:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.467 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:26:38] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [08:27:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:29:36] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:976804|GrowthExperiments: enable AddLink frontend for 16,17th rounds of wikis (T308142 T308143)]] (duration: 19m 54s) [08:29:44] (03CR) 10Muehlenhoff: [C: 03+2] profile::mediawiki::php: Remove support for PHP 7.2 [puppet] - 10https://gerrit.wikimedia.org/r/976934 (owner: 10Muehlenhoff) [08:29:47] T308142: Deploy "add a link" to 16th round of wikis - https://phabricator.wikimedia.org/T308142 [08:29:47] T308143: Deploy "add a link" to 17th round of wikis - https://phabricator.wikimedia.org/T308143 [08:32:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.56% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:32:32] sergi0: your patch is live [08:32:59] cool, thanks for the assistance! [08:33:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966598 (https://phabricator.wikimedia.org/T348484) (owner: 10Majavah) [08:35:16] (03Merged) 10jenkins-bot: Add virtual domain mapping for OATHAuth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966598 (https://phabricator.wikimedia.org/T348484) (owner: 10Majavah) [08:35:31] !log taavi@deploy2002 Started scap: Backport for [[gerrit:966598|Add virtual domain mapping for OATHAuth (T348484)]] [08:35:39] T348484: Migrate OATHAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348484 [08:36:57] !log taavi@deploy2002 taavi: Backport for [[gerrit:966598|Add virtual domain mapping for OATHAuth (T348484)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:37:30] !log taavi@deploy2002 taavi: Continuing with sync [08:38:25] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:41:33] !log restart prometheus/k8s-staging in eqiad - T343529 [08:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:38] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [08:43:13] (03CR) 10Majavah: [C: 03+2] interface: attempt to resolve ordering issues with tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/971406 (owner: 10Majavah) [08:43:24] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:966598|Add virtual domain mapping for OATHAuth (T348484)]] (duration: 07m 53s) [08:43:29] T348484: Migrate OATHAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348484 [08:44:13] (KubernetesAPINotScrapable) firing: (2) k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:44:22] (03CR) 10Muehlenhoff: [C: 03+2] Add new Hiera option to be used to selectively test defs_from_etcd on nftables [puppet] - 10https://gerrit.wikimedia.org/r/977132 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [08:44:45] (03PS4) 10Muehlenhoff: Add support to write out blocked networks from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) [08:45:43] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/686/con" [puppet] - 10https://gerrit.wikimedia.org/r/977095 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [08:46:22] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] rsyslog: get ::conf to notify the correct instance [puppet] - 10https://gerrit.wikimedia.org/r/977095 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [08:48:40] (03PS6) 10Majavah: interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 [08:48:42] (03PS5) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 [08:48:44] (03PS3) 10Majavah: interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 [08:49:14] (KubernetesAPINotScrapable) resolved: k8s-staging@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:49:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/687/con" [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [09:04:04] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [09:05:50] (03PS1) 10Filippo Giunchedi: prometheus: remove unused parameter [puppet] - 10https://gerrit.wikimedia.org/r/977592 (https://phabricator.wikimedia.org/T351179) [09:05:52] (03PS1) 10Filippo Giunchedi: prometheus: use per-instance retention hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) [09:12:01] (03PS1) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) [09:12:18] (03PS2) 10Muehlenhoff: Enable requestctl-driven network blocks for sretest [puppet] - 10https://gerrit.wikimedia.org/r/977166 (https://phabricator.wikimedia.org/T348734) [09:12:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:15:48] (03PS2) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) [09:16:27] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10ayounsi) That's a great question. I don't think we have the resources to do an extensive investigation. I see 2 options: # either we only subtract the tunnel header from the default MSS... [09:18:43] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Add option to enable PKI-based RAPI cert [puppet] - 10https://gerrit.wikimedia.org/r/977175 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:19:12] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/689/con" [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:31:43] (03PS1) 10Muehlenhoff: Switch ganeti-test to PKI [puppet] - 10https://gerrit.wikimedia.org/r/977597 (https://phabricator.wikimedia.org/T350686) [09:35:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977597 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:38:35] (03PS1) 10Majavah: P:openstack::designate: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977598 [09:38:38] (03PS1) 10Majavah: P:openstack::magnum: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977599 [09:42:56] (03PS2) 10Majavah: P:openstack::designate: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977598 [09:42:58] (03PS2) 10Majavah: P:openstack::magnum: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977599 [09:43:00] (03PS1) 10Majavah: hieradata: eqiad1: permit memcached access via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/977600 [09:45:12] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/977599 (owner: 10Majavah) [09:46:19] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti-test to PKI [puppet] - 10https://gerrit.wikimedia.org/r/977597 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:47:41] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove unused parameter [puppet] - 10https://gerrit.wikimedia.org/r/977592 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:51:27] PROBLEM - HTTPS Ganeti RAPI codfw on ganeti-test2003 is CRITICAL: connect to address ganeti-test01.svc.codfw.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [09:52:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/691/console" [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:52:26] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45706 [09:52:57] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45706 [09:53:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58485 [09:53:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58485 [09:53:37] PROBLEM - HTTPS Ganeti RAPI eqiad on ganeti-test1001 is CRITICAL: connect to address ganeti-test01.svc.eqiad.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [09:54:52] (JobUnavailable) firing: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:31] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:00:35] (03CR) 10Majavah: [C: 03+1] "one question inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:02:10] (03CR) 10Majavah: [C: 03+1] hieradata: cap prometheus size for k8s and ops instances [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:02:13] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:14] (03CR) 10Filippo Giunchedi: [V: 03+1] prometheus: use per-instance retention hiera variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [10:06:05] (03PS2) 10Filippo Giunchedi: prometheus: use per-instance retention hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) [10:06:07] (03PS3) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) [10:06:21] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs: Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10CodeReviewBot) jynus merged https://gitlab.wikimedia.org/repos/sre/wmfbackups/-/merge_requests/5... [10:06:24] (03CR) 10Volans: "some questions inline" [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:06:30] (03CR) 10Filippo Giunchedi: prometheus: use per-instance retention hiera variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:08:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977209 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:09:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "we might have to fix something re: ingress but it looks like a good candidate for migration" [deployment-charts] - 10https://gerrit.wikimedia.org/r/977210 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:09:55] (03PS3) 10Jcrespo: dbbackups: Unify configuration of checks and prepare for 0.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) [10:10:40] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977209 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:10:42] (03CR) 10Ladsgroup: [C: 03+1] dbbackups: Unify configuration of checks and prepare for 0.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) (owner: 10Jcrespo) [10:10:56] (03PS3) 10Filippo Giunchedi: prometheus: use per-instance retention hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) [10:10:58] (03PS4) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) [10:11:16] (03CR) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:11:31] (03Merged) 10jenkins-bot: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977209 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:12:53] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Unify configuration of checks and prepare for 0.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/977208 (https://phabricator.wikimedia.org/T340741) (owner: 10Jcrespo) [10:16:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/693/con" [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:19:25] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:19:43] PROBLEM - Host ml-serve2007 is DOWN: PING CRITICAL - Packet loss = 100% [10:20:21] (03PS1) 10Ladsgroup: beta: Stop writing to the old pagelinks columns in simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977602 (https://phabricator.wikimedia.org/T352010) [10:20:32] (03CR) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:22:05] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:22:19] (03CR) 10Ladsgroup: [C: 03+2] beta: Stop writing to the old pagelinks columns in simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977602 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [10:23:38] (KubernetesCalicoDown) firing: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:23:38] (03Merged) 10jenkins-bot: beta: Stop writing to the old pagelinks columns in simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977602 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [10:36:42] (03CR) 10Volans: [C: 03+1] "LGTM, one non-blocking question inline" [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:41:20] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:43:24] (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate thumbnailrender to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/977210 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:44:33] (03Merged) 10jenkins-bot: jobqueue: migrate thumbnailrender to k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/977210 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:47:24] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [10:47:37] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:47:54] (03CR) 10Filippo Giunchedi: hieradata: cap prometheus size for k8s and ops instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:48:36] (03PS1) 10Jcrespo: dbbackups: Execute the backup checks under the new Unix user backupscheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 [10:49:05] (03CR) 10CI reject: [V: 04-1] dbbackups: Execute the backup checks under the new Unix user backupscheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [10:50:19] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [10:50:24] (03PS2) 10Jcrespo: dbbackups: Execute the backup checks under the new Unix user backupscheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 [10:50:30] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:51:44] (03PS3) 10Jcrespo: dbbackups: Execute the backup checks under the new Unix user backupscheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 [10:52:06] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [10:52:11] (03CR) 10CI reject: [V: 04-1] dbbackups: Execute the backup checks under the new Unix user backupscheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [10:53:05] (03PS4) 10Jcrespo: dbbackups: Execute the backup checks under the new Unix user backupscheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 [10:53:28] (03PS2) 10Ayounsi: Expose Netbox's BGP servers to Homer [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) [10:53:58] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [10:55:29] (03CR) 10Ayounsi: "Thanks, no silly questions :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [10:56:24] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10fnegri) 05Open→03Resolved > Hello, > > https://templatetransclusioncheck.toolforge.org/ > > https://templatetransclusioncheck.toolforge.org/?lang=de&name=Vo... [10:57:53] (03PS5) 10Jcrespo: dbbackups: Execute the checks under the new Unix user backupcheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 [10:58:31] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [10:58:52] !log powercycle ml-serve2007 (OEM/DIMM error registered in getsel) [10:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:45] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:00:00] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:00:07] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T1100) [11:00:50] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:01:16] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:01:59] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 185, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:02:03] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:02:05] RECOVERY - Host ml-serve2007 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [11:02:32] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:03:38] (KubernetesCalicoDown) resolved: ml-serve2007.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2007.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:04:24] (03PS6) 10Jcrespo: dbbackups: Execute the checks under the new Unix user backupcheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 [11:05:42] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [11:07:05] (03CR) 10Volans: [C: 03+1] "I didn't test it but the logic makes sense, minor comments inline" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/976749 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [11:08:49] (03CR) 10Jcrespo: [C: 03+1] "Looks good: https://puppet-compiler.wmflabs.org/output/977603/2703/backupmon1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [11:13:33] (03PS1) 10Ilias Sarantopoulos: ml-services: update articlequality and articletopic to kserve 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/977605 (https://phabricator.wikimedia.org/T351633) [11:15:41] (03PS1) 10Hnowlan: changeprop-jobqueue: increase ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/977626 [11:16:38] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: update articlequality and articletopic to kserve 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/977605 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [11:23:51] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update articlequality and articletopic to kserve 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/977605 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [11:24:40] (03Merged) 10jenkins-bot: ml-services: update articlequality and articletopic to kserve 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/977605 (https://phabricator.wikimedia.org/T351633) (owner: 10Ilias Sarantopoulos) [11:25:09] (03CR) 10Clément Goubert: [C: 03+1] changeprop-jobqueue: increase ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/977626 (owner: 10Hnowlan) [11:26:11] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: increase ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/977626 (owner: 10Hnowlan) [11:27:38] (03Merged) 10jenkins-bot: changeprop-jobqueue: increase ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/977626 (owner: 10Hnowlan) [11:29:07] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:29:32] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:30:17] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:30:38] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:31:09] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:34:36] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [11:35:20] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:35:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [11:35:43] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:36:03] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:36:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [11:39:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [11:40:55] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Execute the checks under the new Unix user backupcheck [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [11:41:38] (03PS1) 10Kamila Součková: mobileapps: increase replicas to 114 [deployment-charts] - 10https://gerrit.wikimedia.org/r/977628 (https://phabricator.wikimedia.org/T350846) [11:43:49] (03PS1) 10Urbanecm: Compress geui_data json blobs [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977613 (https://phabricator.wikimedia.org/T351898) [11:45:14] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:45:24] (03PS1) 10Urbanecm: UserImpact: Make smaller SQL queries [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) [11:45:34] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:45:52] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [11:47:52] hi Amir1, i'll be backporting patches for T351898 sometime today (thanks for filling the task). While doing that, do you want me to check for anything else than "the Impact module still works"? [11:47:53] T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898 [11:50:55] urbanecm: no. Don't worry about our part ^_^ [11:51:22] Ack, just double checking :). [11:54:53] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:55:03] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:55:12] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:57:54] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [11:58:30] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [11:58:41] (03PS1) 10Btullis: Upgrade airflow on analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/977631 (https://phabricator.wikimedia.org/T351621) [11:58:43] (03PS1) 10Btullis: Upgrade airflow on the analytics instance [puppet] - 10https://gerrit.wikimedia.org/r/977632 (https://phabricator.wikimedia.org/T351621) [11:58:47] (03PS1) 10Btullis: Upgrade airflow on the search instance [puppet] - 10https://gerrit.wikimedia.org/r/977633 (https://phabricator.wikimedia.org/T351621) [11:58:50] (03PS1) 10Btullis: Upgrade airflow on the research instance [puppet] - 10https://gerrit.wikimedia.org/r/977634 (https://phabricator.wikimedia.org/T351621) [11:58:52] (03PS1) 10Btullis: Upgrade airflow on the platform_eng instance [puppet] - 10https://gerrit.wikimedia.org/r/977635 (https://phabricator.wikimedia.org/T351621) [11:58:55] (03PS1) 10Btullis: Upgrade airflow on the analytics_product [puppet] - 10https://gerrit.wikimedia.org/r/977636 (https://phabricator.wikimedia.org/T351621) [11:58:57] (03PS1) 10Btullis: Upgrade airflow on wmde [puppet] - 10https://gerrit.wikimedia.org/r/977637 (https://phabricator.wikimedia.org/T351621) [11:59:00] (03PS1) 10Btullis: Update default airflow_version and remove overrides [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) [12:00:54] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/694/con" [puppet] - 10https://gerrit.wikimedia.org/r/977631 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:02:31] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/695/con" [puppet] - 10https://gerrit.wikimedia.org/r/977632 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:03:59] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:04:06] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:04:10] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [12:04:59] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/696/con" [puppet] - 10https://gerrit.wikimedia.org/r/977633 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:05:24] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [12:05:26] * kart_ updating cxserver.. [12:05:48] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [12:05:54] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-11-24-152117-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977320 (https://phabricator.wikimedia.org/T351932) (owner: 10KartikMistry) [12:06:12] (03CR) 10CI reject: [V: 04-1] UserImpact: Make smaller SQL queries [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:06:27] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/697/con" [puppet] - 10https://gerrit.wikimedia.org/r/977634 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:06:42] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:06:45] (03Merged) 10jenkins-bot: Update cxserver to 2023-11-24-152117-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977320 (https://phabricator.wikimedia.org/T351932) (owner: 10KartikMistry) [12:07:38] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:08:18] !log jayme@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [12:08:22] (03CR) 10Jcrespo: [C: 03+2] "It didn't work:" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [12:08:29] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [12:08:30] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:08:42] !log jayme@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [12:08:52] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:09:55] (03CR) 10Urbanecm: [C: 03+2] Compress geui_data json blobs [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977613 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:12:00] (03CR) 10Majavah: [C: 03+2] interface: new define for managing routing table names [puppet] - 10https://gerrit.wikimedia.org/r/976734 (owner: 10Majavah) [12:12:50] (03PS1) 10Sergio Gimeno: GrowthExperiments: enable frontend for 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977644 (https://phabricator.wikimedia.org/T308141) [12:13:01] (03PS2) 10Urbanecm: UserImpact: Make smaller SQL queries [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) [12:13:03] (03PS1) 10Urbanecm: User impact: timezone cleanup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) [12:13:41] (03PS6) 10Majavah: interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 [12:13:43] (03PS4) 10Majavah: interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 [12:13:49] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977644 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno) [12:13:56] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:14:02] (03CR) 10Majavah: interface::route: add support for cloud-private bgp routes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [12:14:31] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:14:40] (03CR) 10Urbanecm: [C: 03+2] User impact: timezone cleanup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [12:14:44] (03CR) 10Urbanecm: [C: 03+2] UserImpact: Make smaller SQL queries [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:15:28] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:15:56] (03CR) 10Majavah: interface: new define for managing routing rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [12:15:57] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:17:05] (03CR) 10Majavah: [C: 03+2] interface::route: add support for cloud-private bgp routes [puppet] - 10https://gerrit.wikimedia.org/r/976800 (owner: 10Majavah) [12:18:27] !log Updated cxserver to 2023-11-24-152117-production (T351932) [12:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:32] T351932: Gallery adaptation fails with updated MW Dom Spec - https://phabricator.wikimedia.org/T351932 [12:22:13] (03CR) 10Btullis: [V: 03+1 C: 03+2] Upgrade airflow on analytics_test [puppet] - 10https://gerrit.wikimedia.org/r/977631 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:22:34] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/698/con" [puppet] - 10https://gerrit.wikimedia.org/r/977635 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:23:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:23:47] (03CR) 10Jcrespo: [C: 03+2] "It works if I visudo manually:" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [12:23:58] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/699/con" [puppet] - 10https://gerrit.wikimedia.org/r/977636 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:24:44] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977235 [12:29:41] RECOVERY - HTTPS Ganeti RAPI codfw on ganeti-test2003 is OK: HTTP OK: Status line output matched 401 - 308 bytes in 0.013 second response time https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [12:29:57] 10SRE-swift-storage, 10Commons: Incomplete files uploaded (10 MB interruption) - https://phabricator.wikimedia.org/T350917 (10MatthewVernon) Picking a recent failure: ` mvernon@cumin1001:~$ sudo cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F '0/0e/Wikidata_43.jpg' /var/log/swift/prox... [12:30:16] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/977638 (https://phabricator.wikimedia.org/T351621) (owner: 10Btullis) [12:30:26] 10SRE-swift-storage, 10Commons, 10UploadWizard: Incomplete files uploaded (10 MB interruption) - https://phabricator.wikimedia.org/T350917 (10MatthewVernon) [12:30:37] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:39] (03PS5) 10Jbond: interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [12:31:37] (03CR) 10Jbond: [C: 03+1] "lgtm (i added a minor fix)" [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [12:32:53] (03Merged) 10jenkins-bot: Compress geui_data json blobs [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977613 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:33:01] (03CR) 10CI reject: [V: 04-1] User impact: timezone cleanup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [12:33:07] (03CR) 10CI reject: [V: 04-1] UserImpact: Make smaller SQL queries [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:33:11] (03CR) 10Urbanecm: User impact: timezone cleanup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [12:33:27] (03CR) 10Urbanecm: [C: 03+2] "flaky browser test?" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [12:33:32] (03CR) 10Urbanecm: [C: 03+2] "flaky browser test?" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:34:02] !log jayme@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [12:34:17] !log jayme@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [12:35:03] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:39:36] (03CR) 10Jcrespo: [C: 03+2] "Will send patch with fix." [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [12:42:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [12:42:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:42:44] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/977659 (https://phabricator.wikimedia.org/T351074) [12:45:24] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) [12:45:40] (03PS1) 10Muehlenhoff: Pass RAPI key/cert depending on whether configured for PKI or not [puppet] - 10https://gerrit.wikimedia.org/r/977661 (https://phabricator.wikimedia.org/T350686) [12:46:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/703/con" [puppet] - 10https://gerrit.wikimedia.org/r/977600 (owner: 10Majavah) [12:47:49] (03PS2) 10Majavah: hieradata: eqiad1: permit memcached access via cloud-private [puppet] - 10https://gerrit.wikimedia.org/r/977600 [12:47:51] (03PS3) 10Majavah: P:openstack::designate: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977598 [12:47:53] (03PS3) 10Majavah: P:openstack::magnum: use cloud-private for memcached [puppet] - 10https://gerrit.wikimedia.org/r/977599 [12:48:45] (03PS1) 10Jcrespo: dbbackups: Remove quotes from sudo command due to sudoers flattening [puppet] - 10https://gerrit.wikimedia.org/r/977662 [12:49:32] (03CR) 10Jcrespo: [C: 03+2] "Followup: https://gerrit.wikimedia.org/r/c/operations/puppet/+/977662" [puppet] - 10https://gerrit.wikimedia.org/r/977603 (owner: 10Jcrespo) [12:49:56] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/704/con" [puppet] - 10https://gerrit.wikimedia.org/r/977600 (owner: 10Majavah) [12:50:27] (03CR) 10Majavah: [C: 03+2] interface: new define for managing routing rules [puppet] - 10https://gerrit.wikimedia.org/r/976944 (owner: 10Majavah) [12:51:07] (03PS2) 10Jcrespo: dbbackups: Remove quotes from sudo command due to sudoers flattening [puppet] - 10https://gerrit.wikimedia.org/r/977662 [12:51:10] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:51:15] (03PS3) 10Jcrespo: dbbackups: Remove quotes from sudo command due to sudoers flattening [puppet] - 10https://gerrit.wikimedia.org/r/977662 [12:51:22] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977662 (owner: 10Jcrespo) [12:52:42] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use per-instance retention hiera variables [puppet] - 10https://gerrit.wikimedia.org/r/977593 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [12:56:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 47.6% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:56:20] (03Merged) 10jenkins-bot: User impact: timezone cleanup [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977645 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [12:56:23] (03Merged) 10jenkins-bot: UserImpact: Make smaller SQL queries [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977629 (https://phabricator.wikimedia.org/T351898) (owner: 10Urbanecm) [12:56:41] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:977613|Compress geui_data json blobs (T351898)]], [[gerrit:977645|User impact: timezone cleanup (T329700)]], [[gerrit:977629|UserImpact: Make smaller SQL queries (T351898)]] [12:56:47] T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898 [12:56:48] T329700: Clean up GrowthExperiments UserImpact timezone handling - https://phabricator.wikimedia.org/T329700 [12:58:29] (03PS1) 10Filippo Giunchedi: prometheus: fix storage_retention_size for k8s [puppet] - 10https://gerrit.wikimedia.org/r/977667 (https://phabricator.wikimedia.org/T351179) [12:58:53] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] prometheus: fix storage_retention_size for k8s [puppet] - 10https://gerrit.wikimedia.org/r/977667 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:03:14] (03CR) 10Muehlenhoff: admin: reserve uid/gid for authdns user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [13:03:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977661 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:04:18] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:977613|Compress geui_data json blobs (T351898)]], [[gerrit:977645|User impact: timezone cleanup (T329700)]], [[gerrit:977629|UserImpact: Make smaller SQL queries (T351898)]] (duration: 07m 37s) [13:04:26] T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898 [13:04:26] T329700: Clean up GrowthExperiments UserImpact timezone handling - https://phabricator.wikimedia.org/T329700 [13:05:28] (03PS5) 10Filippo Giunchedi: hieradata: cap prometheus size for ops instance [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) [13:06:14] (03CR) 10Filippo Giunchedi: hieradata: cap prometheus size for ops instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:06:27] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: cap prometheus size for ops instance [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:06:41] (03PS1) 10Clément Goubert: mw-web, mw-api-int: Add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977669 (https://phabricator.wikimedia.org/T350430) [13:07:49] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-int: Add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977669 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert) [13:08:37] (03Merged) 10jenkins-bot: mw-web, mw-api-int: Add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/977669 (https://phabricator.wikimedia.org/T350430) (owner: 10Clément Goubert) [13:09:31] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:09:43] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:09:50] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:12:10] (03PS1) 10Filippo Giunchedi: hieradata: restore default retention for k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/977670 (https://phabricator.wikimedia.org/T351179) [13:12:30] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] hieradata: restore default retention for k8s prometheus [puppet] - 10https://gerrit.wikimedia.org/r/977670 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:12:46] (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/977659 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [13:15:31] (03CR) 10Volans: hieradata: cap prometheus size for ops instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:18:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977661 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:19:27] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:19:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [13:19:45] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [13:19:51] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:19:55] (03PS1) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/977672 (https://phabricator.wikimedia.org/T348974) [13:20:00] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:20:52] (03Abandoned) 10Kamila Součková: mobileapps: increase replicas to 114 [deployment-charts] - 10https://gerrit.wikimedia.org/r/977628 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [13:20:56] (03CR) 10Muehlenhoff: [C: 03+2] Pass RAPI key/cert depending on whether configured for PKI or not [puppet] - 10https://gerrit.wikimedia.org/r/977661 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:21:13] (03CR) 10Klausman: [C: 03+1] "One nit, otherwise LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [13:23:22] (03PS1) 10Urbanecm: UserImpact: Bump VERSION to 10 [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) [13:24:52] (03CR) 10Urbanecm: [C: 03+2] UserImpact: Bump VERSION to 10 [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [13:25:18] (03PS5) 10Jbond: yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) [13:25:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [13:26:58] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host druid1007.eqiad.wmnet with OS bullseye [13:28:26] (03PS1) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/977673 [13:29:00] (03CR) 10CI reject: [V: 04-1] yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [13:29:27] (03PS1) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/977674 [13:30:45] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:31:14] (03CR) 10Jbond: [C: 03+1] "LGTM, will also need a changlog entry and a fix to the setup.py file before doing a release" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/977672 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [13:31:36] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:31:55] (03Abandoned) 10Jbond: Merge branch 'master' into 2.x [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/977674 (owner: 10Jbond) [13:32:24] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [13:32:55] PROBLEM - HTTPS Ganeti RAPI codfw on ganeti-test2003 is CRITICAL: connect to address ganeti-test01.svc.codfw.wmnet and port 5080: Connection refused https://www.mediawiki.org/wiki/Ganeti%23RAPI_daemon [13:36:29] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: cap prometheus size for ops instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:36:47] (03PS2) 10Urbanecm: UserImpact: Bump VERSION to 10 [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) [13:36:52] (03CR) 10Urbanecm: [C: 03+2] UserImpact: Bump VERSION to 10 [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [13:36:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [13:37:46] !log roll-restart prometheus/ops in eqiad/codfw to apply space-based retention - T351179 [13:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:54] T351179: LVM vg0 close to getting full on prometheus eqiad - https://phabricator.wikimedia.org/T351179 [13:38:09] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:38:26] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:38:42] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [13:40:30] (03CR) 10Volans: hieradata: cap prometheus size for ops instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:43:05] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on druid1007.eqiad.wmnet with reason: host reimage [13:44:21] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:07] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [13:45:16] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:45:17] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:45:24] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [13:45:58] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on druid1007.eqiad.wmnet with reason: host reimage [13:46:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:46:48] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: cap prometheus size for ops instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977594 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:47:19] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:50:33] (03PS1) 10Muehlenhoff: Use separate /etc/ganeti/ssl directory if using PKI [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) [13:51:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [13:52:56] (03PS6) 10Jbond: yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) [13:54:53] (JobUnavailable) firing: Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:04] (03CR) 10Jbond: "@jesse going through old changes and wonder if you are interested in keeping this around / perusing this route. note even with the config" [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [13:56:16] (03Abandoned) 10Jbond: ca nrpe checking: exclude managing public certificates [puppet] - 10https://gerrit.wikimedia.org/r/559443 (https://phabricator.wikimedia.org/T238833) (owner: 10Jbond) [13:56:36] (03Abandoned) 10Jbond: etcd: add cert parameter to enable client auth [puppet] - 10https://gerrit.wikimedia.org/r/561818 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:56:41] (03CR) 10CI reject: [V: 04-1] yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [13:56:48] (03Abandoned) 10Jbond: etcd: remove username/password [puppet] - 10https://gerrit.wikimedia.org/r/561819 (https://phabricator.wikimedia.org/T240941) (owner: 10Jbond) [13:58:28] (03PS2) 10Slyngshede: Ensure that build directories are cleaned up [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/977672 (https://phabricator.wikimedia.org/T348974) [13:59:06] (03Abandoned) 10Jbond: wmflib::require_domains: use require_domains instead of require_realm [puppet] - 10https://gerrit.wikimedia.org/r/570348 (https://phabricator.wikimedia.org/T244222) (owner: 10Jbond) [13:59:35] (03Abandoned) 10Jbond: beaker: application testing [puppet] - 10https://gerrit.wikimedia.org/r/567080 (owner: 10Jbond) [13:59:45] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T1400). [14:00:06] physikerwelt, sergi0, Kizule, and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:24] please wait with the window a bit, currently deploying an UBN fix. [14:00:37] and i can deploy window patches too [14:00:40] thank you, I am here, and can wait [14:00:49] (03Abandoned) 10Jbond: storeconfigs: add debug option to test $settings variable [puppet] - 10https://gerrit.wikimedia.org/r/617156 (owner: 10Jbond) [14:00:58] hi physikerwelt [14:01:00] * Lucas_WMDE also here [14:01:08] (but nothing in particular to do) [14:01:09] o/ [14:01:13] hi [14:01:46] hi everyone [14:01:51] (03Merged) 10jenkins-bot: UserImpact: Bump VERSION to 10 [extensions/GrowthExperiments] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/977614 (https://phabricator.wikimedia.org/T329700) (owner: 10Urbanecm) [14:01:55] finally [14:02:06] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:977614|UserImpact: Bump VERSION to 10 (T329700)]] [14:02:31] T329700: Clean up GrowthExperiments UserImpact timezone handling - https://phabricator.wikimedia.org/T329700 [14:02:52] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:03:05] Hi, I'm finally here, for my patch for backport. :) [14:03:08] Has deploying started? [14:03:10] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [14:03:21] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:977614|UserImpact: Bump VERSION to 10 (T329700)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:03:32] Kizule: hi, i'm currently deploying an UBN fix for Growth. Once done, will start with window. [14:03:47] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:03:49] urbanecm: Great, I'm happy that I'm not late and for this window. ;) [14:03:50] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352027 (10phaultfinder) [14:04:01] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host druid1007.eqiad.wmnet with OS bullseye [14:04:21] urbanecm: do you need support for testing the UBN issue? [14:04:24] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: enable frontend for 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977644 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno) [14:04:33] oh, it's done :) [14:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:04:43] sergi0: nope, thanks for offering. i just verified the module appears for the users it broke previously. [14:05:13] (03PS3) 10Urbanecm: zghwiki: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975378 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [14:05:16] (03CR) 10Urbanecm: [C: 03+2] zghwiki: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975378 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [14:05:18] (03Merged) 10jenkins-bot: GrowthExperiments: enable frontend for 15th round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977644 (https://phabricator.wikimedia.org/T308141) (owner: 10Sergio Gimeno) [14:05:30] (03PS5) 10Urbanecm: bbcwiki: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975376 (https://phabricator.wikimedia.org/T350373) (owner: 10Anzx) [14:05:33] (03CR) 10Urbanecm: [C: 03+2] bbcwiki: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975376 (https://phabricator.wikimedia.org/T350373) (owner: 10Anzx) [14:06:04] (03Merged) 10jenkins-bot: zghwiki: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975378 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [14:07:15] (03Merged) 10jenkins-bot: bbcwiki: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975376 (https://phabricator.wikimedia.org/T350373) (owner: 10Anzx) [14:10:03] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:977614|UserImpact: Bump VERSION to 10 (T329700)]] (duration: 07m 56s) [14:10:21] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:977644|GrowthExperiments: enable frontend for 15th round of wikis (T308141)]], [[gerrit:975378|zghwiki: add timezone, wgSitename (T350241)]], [[gerrit:975376|bbcwiki: add timezone, wgSitename (T350373)]] [14:10:22] T329700: Clean up GrowthExperiments UserImpact timezone handling - https://phabricator.wikimedia.org/T329700 [14:10:39] T308141: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 [14:10:40] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [14:10:40] T350373: Post-creation work for bbcwiki - https://phabricator.wikimedia.org/T350373 [14:10:52] aanzx: sergi0: your patches (first two for aanzx) are now being deployed. will ping once testable. [14:11:03] ok [14:11:19] alright [14:11:36] !log urbanecm@deploy2002 sgimeno and anzx and urbanecm: Backport for [[gerrit:977644|GrowthExperiments: enable frontend for 15th round of wikis (T308141)]], [[gerrit:975378|zghwiki: add timezone, wgSitename (T350241)]], [[gerrit:975376|bbcwiki: add timezone, wgSitename (T350373)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:12:05] urbanecm: checking [14:12:06] aanzx: sergi0: please test your patches at mwdebug2001 [14:12:21] testing [14:12:54] (03CR) 10Bking: [C: 03+2] query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [14:14:39] urbanecm: looks good on my end [14:14:43] ty [14:15:15] 10ops-eqiad, 10Cloud-VPS, 10decommission-hardware, 10cloud-services-team (Hardware): reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10taavi) [14:15:30] urbanecm: looks good [14:15:36] 10ops-eqiad, 10Cloud-VPS, 10decommission-hardware, 10cloud-services-team (Hardware): reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10taavi) [14:15:47] !log urbanecm@deploy2002 sgimeno and anzx and urbanecm: Continuing with sync [14:15:49] ty, proceeding [14:20:09] (03PS1) 10Kamila Součková: mw-api-int: increase replicas by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/977683 (https://phabricator.wikimedia.org/T350846) [14:21:34] (03PS3) 10Urbanecm: bjnwikiquote: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975377 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:21:41] (03PS5) 10Urbanecm: dgawiki: add logos, timezone and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975594 (https://phabricator.wikimedia.org/T350229) (owner: 10Anzx) [14:21:43] (03CR) 10Urbanecm: [C: 03+2] bjnwikiquote: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975377 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:21:45] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:977644|GrowthExperiments: enable frontend for 15th round of wikis (T308141)]], [[gerrit:975378|zghwiki: add timezone, wgSitename (T350241)]], [[gerrit:975376|bbcwiki: add timezone, wgSitename (T350373)]] (duration: 11m 23s) [14:21:46] (03CR) 10Urbanecm: [C: 03+2] dgawiki: add logos, timezone and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975594 (https://phabricator.wikimedia.org/T350229) (owner: 10Anzx) [14:21:53] T308141: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 [14:21:53] T350241: Post-creation work for zghwiki - https://phabricator.wikimedia.org/T350241 [14:21:54] T350373: Post-creation work for bbcwiki - https://phabricator.wikimedia.org/T350373 [14:22:36] (03Merged) 10jenkins-bot: bjnwikiquote: add timezone, wgSitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975377 (https://phabricator.wikimedia.org/T350235) (owner: 10Anzx) [14:22:40] (03Merged) 10jenkins-bot: dgawiki: add logos, timezone and sitename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975594 (https://phabricator.wikimedia.org/T350229) (owner: 10Anzx) [14:22:42] Kizule: hi, i'm unsure whether partial action blocks can be deployed to any wiki that asks for it. was this discussed with Niharika on a task please? [14:23:16] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:975377|bjnwikiquote: add timezone, wgSitename (T350235)]], [[gerrit:975594|dgawiki: add logos, timezone and sitename (T350229)]] [14:23:22] T350235: Post-creation work for bjnwikiquote - https://phabricator.wikimedia.org/T350235 [14:23:23] T350229: Post-creation work for dgawiki - https://phabricator.wikimedia.org/T350229 [14:24:15] (03PS7) 10ArielGlenn: use virtual db domain for CentralAuth and GlobalBlocking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) [14:24:32] !log urbanecm@deploy2002 urbanecm and anzx: Backport for [[gerrit:975377|bjnwikiquote: add timezone, wgSitename (T350235)]], [[gerrit:975594|dgawiki: add logos, timezone and sitename (T350229)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2087.mgmt.codfw.wmnet with reboot policy FORCED [14:25:08] aanzx: can you test please? [14:25:12] checking [14:25:22] urbanecm: Hi again. I don't see that there was any discussion for previous request as well https://phabricator.wikimedia.org/T351048 [14:25:34] But it was deployed anyways, and noone didn't complain. [14:26:05] I have community consensus as well, so, it's fine. [14:26:07] (03PS4) 10Elukey: istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) [14:26:09] (03PS2) 10Elukey: cert-manager: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977220 (https://phabricator.wikimedia.org/T351933) [14:26:30] Kizule: well, Niharika did comment on the request at the end, so... :-) [14:26:45] community consensus is one thing, but often, features are first piloted on a couple of wikis to ensure they work, before a full rollout [14:27:01] urbanecm: Which basically says "great that this is done, let me know if there is something wrong" [14:27:25] Kizule: to avoid unfortunate surprises, please ping Niharika and let's reschedule for later. i'm not sure what the status of the feature is ATM. [14:27:33] urbanecm: looks good [14:27:39] thanks, proceeding [14:27:40] !log urbanecm@deploy2002 urbanecm and anzx: Continuing with sync [14:27:50] (03CR) 10Elukey: istio: upgrade to Bullseye (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [14:28:08] (03CR) 10Klausman: [C: 03+1] istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [14:28:39] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [14:29:10] (03PS3) 10Jbond: bacula: only export resources if we have puppetdb support [puppet] - 10https://gerrit.wikimedia.org/r/617157 [14:30:20] !log installing protobuf security updates [14:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:34] urbanecm: Okay, I pinged Niharika and sent her message in IRC as well [14:30:43] ty. [14:32:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2088.mgmt.codfw.wmnet with reboot policy FORCED [14:32:50] (03PS2) 10Urbanecm: Enable native MathML rendering on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977121 (https://phabricator.wikimedia.org/T350787) (owner: 10Physikerwelt) [14:33:02] physikerwelt: your patch is up next :) [14:33:07] (03CR) 10Urbanecm: [C: 03+2] Enable native MathML rendering on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977121 (https://phabricator.wikimedia.org/T350787) (owner: 10Physikerwelt) [14:33:13] cool [14:33:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2089.mgmt.codfw.wmnet with reboot policy FORCED [14:33:41] (03PS5) 10Elukey: istio: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) [14:33:43] (03PS3) 10Elukey: cert-manager: upgrade to Bullseye [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977220 (https://phabricator.wikimedia.org/T351933) [14:33:45] (03PS7) 10Jbond: ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) [14:33:47] (03PS17) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [14:33:50] physikerwelt: do you have the WikimediaDebug extension ready for testing please? :) [14:33:53] (03Merged) 10jenkins-bot: Enable native MathML rendering on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977121 (https://phabricator.wikimedia.org/T350787) (owner: 10Physikerwelt) [14:33:55] (03CR) 10Elukey: "rebased on top of weekly builds" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/977214 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [14:34:00] yes [14:34:07] okay, great. [14:34:14] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:975377|bjnwikiquote: add timezone, wgSitename (T350235)]], [[gerrit:975594|dgawiki: add logos, timezone and sitename (T350229)]] (duration: 10m 57s) [14:34:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2090.mgmt.codfw.wmnet with reboot policy FORCED [14:34:18] (03CR) 10CI reject: [V: 04-1] ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [14:34:20] T350235: Post-creation work for bjnwikiquote - https://phabricator.wikimedia.org/T350235 [14:34:20] T350229: Post-creation work for dgawiki - https://phabricator.wikimedia.org/T350229 [14:34:30] (03CR) 10CI reject: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [14:34:34] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:977121|Enable native MathML rendering on dewiki (T350787)]] [14:34:39] T350787: Deploy native on a few pilot wikis - https://phabricator.wikimedia.org/T350787 [14:34:55] (03CR) 10Jbond: "i have refreshed this patch let me know if you are intrested in it. If we go this route we should also to some one of patches to black al" [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [14:35:02] urbanecm: thanks [14:35:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2091.mgmt.codfw.wmnet with reboot policy FORCED [14:35:52] !log urbanecm@deploy2002 urbanecm and physikerwelt: Backport for [[gerrit:977121|Enable native MathML rendering on dewiki (T350787)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:03] physikerwelt: can you test at mwdebug2001 please? [14:36:06] aanzx: np [14:36:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2092.mgmt.codfw.wmnet with reboot policy FORCED [14:36:21] ok [14:36:42] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976846 (https://phabricator.wikimedia.org/T349385) (owner: 10Jforrester) [14:37:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2093.mgmt.codfw.wmnet with reboot policy FORCED [14:37:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2087.mgmt.codfw.wmnet with reboot policy FORCED [14:37:31] cool. I confirm it works [14:37:34] (03Merged) 10jenkins-bot: wikifunctions: Switch JavaScript evaluator to 2023-11-22-195017 [deployment-charts] - 10https://gerrit.wikimedia.org/r/976846 (https://phabricator.wikimedia.org/T349385) (owner: 10Jforrester) [14:38:04] great, proceeding [14:38:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2094.mgmt.codfw.wmnet with reboot policy FORCED [14:38:06] !log urbanecm@deploy2002 urbanecm and physikerwelt: Continuing with sync [14:38:11] physikerwelt: just because I’m curious – is this eventually going to make https://addons.mozilla.org/en-US/firefox/addon/native-mathml/ basically obsolete for MediaWiki sites? [14:38:26] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:38:26] (JobUnavailable) firing: (2) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2095.mgmt.codfw.wmnet with reboot policy FORCED [14:38:57] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:39:02] (03CR) 10Jcrespo: "Looks ok, but what's the context? Puppet compiler? Cloud? Something else?" [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [14:39:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [14:40:06] (03CR) 10Jcrespo: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [14:40:08] (03PS1) 10Andrew Bogott: wmcs-cold-migrate: backport fixes to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/977684 [14:40:15] Lucas_WMDE yes [14:40:19] nice [14:40:21] best of luck with it then :) [14:40:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2087'] [14:40:27] exciting [14:40:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2087'] [14:41:02] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-cold-migrate: backport fixes to 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/977684 (owner: 10Andrew Bogott) [14:41:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2087'] [14:41:52] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Remove quotes from sudo command due to sudoers flattening [puppet] - 10https://gerrit.wikimedia.org/r/977662 (owner: 10Jcrespo) [14:43:00] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977239 [14:43:03] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:43:42] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [14:43:54] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:977121|Enable native MathML rendering on dewiki (T350787)]] (duration: 09m 19s) [14:44:00] physikerwelt: and deployed [14:44:02] anything else? [14:44:02] T350787: Deploy native on a few pilot wikis - https://phabricator.wikimedia.org/T350787 [14:44:06] (03Abandoned) 10Jbond: Gemfile: update puppetlabs_spec_helper version and switch to rspec-mock [puppet] - 10https://gerrit.wikimedia.org/r/636923 (owner: 10Jbond) [14:44:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2088.mgmt.codfw.wmnet with reboot policy FORCED [14:45:21] (03Abandoned) 10Jbond: taskgen: add new CI check to ensure hiera keys are valid [puppet] - 10https://gerrit.wikimedia.org/r/580921 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [14:45:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088'] [14:45:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2089.mgmt.codfw.wmnet with reboot policy FORCED [14:46:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088'] [14:46:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088'] [14:46:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host dns4003.wikimedia.org [14:46:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2090.mgmt.codfw.wmnet with reboot policy FORCED [14:46:47] (03PS7) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [14:46:55] (03PS8) 10Jbond: puppetmaster: update webconfig to use correct file path [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) [14:47:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2087'] [14:47:38] (03CR) 10Jcrespo: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/617157/2707/ but see my question." [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [14:47:42] (03PS1) 10Muehlenhoff: Switch dns4003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977685 (https://phabricator.wikimedia.org/T349619) [14:47:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2089'] [14:47:47] looks as expected thank you [14:48:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2089'] [14:48:04] great! [14:48:09] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2089'] [14:48:12] (03CR) 10Jbond: "not sure if this is still worth persuing as we are moving to puppet7 but let me know happy to shepherd it in before i go" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [14:48:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2092.mgmt.codfw.wmnet with reboot policy FORCED [14:48:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2093.mgmt.codfw.wmnet with reboot policy FORCED [14:49:06] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2090'] [14:49:17] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977240 [14:49:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2090'] [14:49:43] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2090'] [14:49:48] (03CR) 10Muehlenhoff: [C: 03+2] Switch dns4003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977685 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:50:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2095.mgmt.codfw.wmnet with reboot policy FORCED [14:50:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 23 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [14:51:00] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1001.eqiad.wmnet with OS bookworm [14:51:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2092'] [14:51:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2092'] [14:51:42] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2092'] [14:51:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2091.mgmt.codfw.wmnet with reboot policy FORCED [14:52:01] (03CR) 10Jbond: "thanks response inline" [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [14:52:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2088'] [14:52:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1160 [14:52:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1160 [14:52:30] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1157 [14:52:32] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1157 [14:52:36] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1158 [14:52:36] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2093'] [14:52:38] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1158 [14:52:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1159 [14:52:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1159 [14:52:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2094.mgmt.codfw.wmnet with reboot policy FORCED [14:52:46] (03PS1) 10Filippo Giunchedi: pontoon: set new prometheus defaults [puppet] - 10https://gerrit.wikimedia.org/r/977686 (https://phabricator.wikimedia.org/T351179) [14:52:48] (03PS1) 10Filippo Giunchedi: k8s: allow setting prometheus retention in cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) [14:52:51] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1161 [14:52:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2093'] [14:52:52] (03PS1) 10Filippo Giunchedi: hieradata: set 850GB retention for prometheus@k8s [puppet] - 10https://gerrit.wikimedia.org/r/977688 (https://phabricator.wikimedia.org/T351179) [14:52:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1161 [14:52:58] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1175 [14:52:58] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2093'] [14:53:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1175 [14:53:26] (JobUnavailable) firing: (2) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:53:33] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1160.eqiad.wmnet with OS bullseye [14:53:34] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2091'] [14:53:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1160.eqiad.wmnet with OS bullseye [14:53:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2089'] [14:53:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2091'] [14:54:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2091'] [14:54:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2091'] [14:54:25] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2091'] [14:54:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2091'] [14:54:59] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:55:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2090'] [14:55:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host dns4003.wikimedia.org [14:55:30] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094'] [14:55:44] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2094'] [14:55:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2094'] [14:56:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2094'] [14:56:35] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2095'] [14:56:43] !log mwmaint2002: /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue [14:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:48] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2095'] [14:56:48] (03PS2) 10Ssingh: admin: reserve uid/gid for authdns user [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) [14:56:54] (03Abandoned) 10Ssingh: P:dns::auth::update::account: switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/977259 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [14:56:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2095'] [14:56:57] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10Volans) @JMeybohm could you confirm the above or give me more context? [14:57:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/708/con" [puppet] - 10https://gerrit.wikimedia.org/r/977688 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [14:57:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/707/console" [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [14:57:14] !log mwmaint2002: `/usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/growthexperiments.dblist extensions/GrowthExperiments/maintenance/refreshUserImpactData.php --registeredWithin=1year --editedWithin=2week --hasEditsAtLeast=3 --ignoreIfUpdatedWithin=6hour --verbose --use-job-queue` [14:57:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2092'] [14:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:00] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set new prometheus defaults [puppet] - 10https://gerrit.wikimedia.org/r/977686 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [14:58:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2093'] [15:01:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2096.mgmt.codfw.wmnet with reboot policy FORCED [15:01:54] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [15:02:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [15:02:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:02:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2095'] [15:04:10] (03PS1) 10Ssingh: dns4003: remove dns4003 from authdns_servers for reboot [puppet] - 10https://gerrit.wikimedia.org/r/977693 [15:06:44] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [15:06:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2097.mgmt.codfw.wmnet with reboot policy FORCED [15:06:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2098.mgmt.codfw.wmnet with reboot policy FORCED [15:07:01] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2099.mgmt.codfw.wmnet with reboot policy FORCED [15:07:02] !log `nfctl select name='cp10.*',service=ats-be set/pooled=inactive` (cdn and ats-be not used anymore on these hosts) T349244 [15:07:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2100.mgmt.codfw.wmnet with reboot policy FORCED [15:07:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2101.mgmt.codfw.wmnet with reboot policy FORCED [15:07:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2102.mgmt.codfw.wmnet with reboot policy FORCED [15:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:19] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:08:56] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) thx @ayounsi we will go with option 1: * IPv4: 1500 - 20 (IP) - 20 (IP) - 20 (TCP) = 1440 bytes * IPv6: 1500 - 40 (IPv6) - 40 (IPv6) - 20 (TCP) = 1400 bytes [15:09:26] !log restarting CI Jenkins [15:09:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) @cmooney thanks for looking at it previously we where not even getting to the Debian installer the sre.network.configure-switch-interface was ran without e... [15:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:06] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1001.eqiad.wmnet with reason: host reimage [15:11:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-worker1160.eqiad.wmnet with OS bullseye [15:13:17] (03PS5) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) [15:14:03] !log set `pooled=yes` on cp11.* hosts in eqiad T349244 [15:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:08] T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 [15:14:50] 10SRE, 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T351279 (10phaultfinder) [15:17:10] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/709/con" [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:18:57] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/710/con" [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:19:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2098.mgmt.codfw.wmnet with reboot policy FORCED [15:19:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2102.mgmt.codfw.wmnet with reboot policy FORCED [15:19:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2099.mgmt.codfw.wmnet with reboot policy FORCED [15:19:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2100.mgmt.codfw.wmnet with reboot policy FORCED [15:19:47] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2098'] [15:19:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2099'] [15:19:55] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2100'] [15:20:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2102'] [15:20:08] (03CR) 10Pmiazga: [C: 03+1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [15:20:13] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2098'] [15:20:18] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2099'] [15:20:22] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2100'] [15:20:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2102'] [15:20:45] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2098'] [15:20:48] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2099'] [15:20:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2100'] [15:20:54] (03PS1) 10Majavah: P:openstack: nova: include puppet ca chain for libvirtd [puppet] - 10https://gerrit.wikimedia.org/r/977697 [15:20:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2102'] [15:21:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2100'] [15:22:14] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/711/con" [puppet] - 10https://gerrit.wikimedia.org/r/977697 (owner: 10Majavah) [15:22:39] (03CR) 10Ssingh: [C: 03+2] admin: reserve uid/gid for authdns user [puppet] - 10https://gerrit.wikimedia.org/r/977252 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:25:39] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/977697 (owner: 10Majavah) [15:25:57] (03PS1) 10Herron: arclamp: set errors_mail to sre-observability@ [puppet] - 10https://gerrit.wikimedia.org/r/977698 (https://phabricator.wikimedia.org/T349159) [15:26:19] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: nova: include puppet ca chain for libvirtd [puppet] - 10https://gerrit.wikimedia.org/r/977697 (owner: 10Majavah) [15:27:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2102'] [15:27:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2099'] [15:27:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2098'] [15:28:47] (03CR) 10Herron: [C: 03+1] rsyslog: move netdev_kafka_relay to rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977096 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [15:29:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2097.mgmt.codfw.wmnet with reboot policy FORCED [15:29:31] (03CR) 10Filippo Giunchedi: [C: 03+1] arclamp: set errors_mail to sre-observability@ [puppet] - 10https://gerrit.wikimedia.org/r/977698 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [15:29:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host elastic2101.mgmt.codfw.wmnet with reboot policy FORCED [15:29:48] (03CR) 10Herron: [C: 03+2] arclamp: set errors_mail to sre-observability@ [puppet] - 10https://gerrit.wikimedia.org/r/977698 (https://phabricator.wikimedia.org/T349159) (owner: 10Herron) [15:30:14] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [15:31:27] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2103.mgmt.codfw.wmnet with reboot policy FORCED [15:31:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2104.mgmt.codfw.wmnet with reboot policy FORCED [15:31:34] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2105.mgmt.codfw.wmnet with reboot policy FORCED [15:31:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2106.mgmt.codfw.wmnet with reboot policy FORCED [15:31:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2107.mgmt.codfw.wmnet with reboot policy FORCED [15:31:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2108.mgmt.codfw.wmnet with reboot policy FORCED [15:32:49] 10SRE-tools, 10Observability-Logging: Create a cookbook for managing Logstash cluster restarts - https://phabricator.wikimedia.org/T293929 (10joanna_borun) [15:33:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff I'm taking this one, for coordinationd and partly implementing myself. [15:34:17] (03CR) 10Jcrespo: [C: 03+1] "only miscweb1003.eqiad.wmnet fails, which must be unrelated to the patch. My +1 stands." [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [15:35:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-worker1157.eqiad.wmnet with OS bullseye [15:36:36] (03PS1) 10Fabfur: decom cp1075-1090 [puppet] - 10https://gerrit.wikimedia.org/r/977702 (https://phabricator.wikimedia.org/T349244) [15:40:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1136.eqiad.wmnet - https://phabricator.wikimedia.org/T351065 (10Jclark-ctr) [15:41:10] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1136.eqiad.wmnet - https://phabricator.wikimedia.org/T351065 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:41:55] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977239 (owner: 10PipelineBot) [15:42:59] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977239 (owner: 10PipelineBot) [15:43:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2108.mgmt.codfw.wmnet with reboot policy FORCED [15:43:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2105.mgmt.codfw.wmnet with reboot policy FORCED [15:43:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2107.mgmt.codfw.wmnet with reboot policy FORCED [15:44:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2106.mgmt.codfw.wmnet with reboot policy FORCED [15:44:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2103.mgmt.codfw.wmnet with reboot policy FORCED [15:44:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2104.mgmt.codfw.wmnet with reboot policy FORCED [15:44:41] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2103'] [15:44:45] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2104'] [15:44:49] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2105'] [15:44:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2106'] [15:44:55] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2107'] [15:45:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2108'] [15:45:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2103'] [15:45:25] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2104'] [15:45:27] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2105'] [15:45:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2106'] [15:45:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2107'] [15:45:31] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2108'] [15:45:45] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2103'] [15:45:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2104'] [15:46:00] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2105'] [15:46:05] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2106'] [15:46:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2107'] [15:46:13] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2108'] [15:46:55] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10decommission-hardware, 10cloud-services-team (Hardware): reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10Jclark-ctr) [15:47:32] (03PS1) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [15:48:07] (03CR) 10CI reject: [V: 04-1] miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:49:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442 (10jbond) [15:49:31] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: zookeeper::flink [15:50:03] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10decommission-hardware, 10cloud-services-team (Hardware): reclaim cloudmetrics1003.eqiad.wmnet, cloudmetrics1004.eqiad.wmnet - https://phabricator.wikimedia.org/T351077 (10Jclark-ctr) 05Open→03Resolved a:05taavi→03Jclark-ctr [15:50:38] 10SRE-OnFire, 10Incident Tooling: Provide mechanism to join/leave oncall - https://phabricator.wikimedia.org/T322636 (10joanna_borun) p:05Triage→03Low [15:50:51] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:50:59] (03PS1) 10Muehlenhoff: Switch zookeeper::flink to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977705 (https://phabricator.wikimedia.org/T349619) [15:52:02] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:52:30] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1130.eqiad.wmnet - https://phabricator.wikimedia.org/T351067 (10Jclark-ctr) [15:52:32] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1130.eqiad.wmnet - https://phabricator.wikimedia.org/T351067 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:52:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2103'] [15:52:47] (03CR) 10Bartosz Dziewoński: [C: 03+1] CentralAuth: Fix wikisource.org cookie handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) (owner: 10Gergő Tisza) [15:52:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2104'] [15:52:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2105'] [15:52:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2106'] [15:52:53] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2107'] [15:52:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2108'] [15:53:38] (03PS2) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [15:53:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch zookeeper::flink to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/977705 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:53:46] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: move netdev_kafka_relay to rsyslog-receiver [puppet] - 10https://gerrit.wikimedia.org/r/977096 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [15:54:09] (03PS1) 10Dreamy Jazz: Pin wgCheckUserPurgeOrphanedMapRows to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977726 (https://phabricator.wikimedia.org/T350681) [15:54:11] moritzm: I have merged your change btw [15:54:32] I think I got into a race and merged yours but not mine )o) [15:55:14] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10joanna_borun) p:05Triage→03Medium [15:55:28] godog: ah, I'm still getting your lock, though? [15:55:44] moritzm: should be gone now, I was merging mine [15:55:54] indeed, thanks [15:55:56] so two puppet-merge in a row [15:56:00] sure np [15:56:13] (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [15:56:59] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1127.eqiad.wmnet - https://phabricator.wikimedia.org/T351063 (10Jclark-ctr) [15:57:11] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1127.eqiad.wmnet - https://phabricator.wikimedia.org/T351063 (10Jclark-ctr) 05In progress→03Resolved a:03Jclark-ctr [15:57:37] (03PS1) 10Majavah: P:openstack: nova: try including the chain in cacert.pem instead [puppet] - 10https://gerrit.wikimedia.org/r/977728 [15:58:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) a:03Jclark-ctr [15:58:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: don't write logs to disk - https://phabricator.wikimedia.org/T342079 (10joanna_borun) p:05Triage→03Low [15:58:52] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/712/con" [puppet] - 10https://gerrit.wikimedia.org/r/977728 (owner: 10Majavah) [15:59:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host elastic2109.mgmt.codfw.wmnet with reboot policy FORCED [15:59:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2001.mgmt.codfw.wmnet with reboot policy FORCED [15:59:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2002.mgmt.codfw.wmnet with reboot policy FORCED [15:59:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host logging-hd2003.mgmt.codfw.wmnet with reboot policy FORCED [15:59:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2059.mgmt.codfw.wmnet with reboot policy FORCED [15:59:12] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2060.mgmt.codfw.wmnet with reboot policy FORCED [15:59:14] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325 (10joanna_borun) p:05Triage→03High [15:59:43] (03CR) 10Andrew Bogott: [C: 03+1] "worth a try" [puppet] - 10https://gerrit.wikimedia.org/r/977728 (owner: 10Majavah) [15:59:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:00:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: zookeeper::flink [16:00:32] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission dbproxy1017.eqiad.wmnet - https://phabricator.wikimedia.org/T348956 (10Jclark-ctr) [16:00:52] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: nova: try including the chain in cacert.pem instead [puppet] - 10https://gerrit.wikimedia.org/r/977728 (owner: 10Majavah) [16:01:02] (03CR) 10Ssingh: [C: 03+2] dns4003: remove dns4003 from authdns_servers for reboot [puppet] - 10https://gerrit.wikimedia.org/r/977693 (owner: 10Ssingh) [16:01:04] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:01:06] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops, 10decommission-hardware: decommission dbproxy1017.eqiad.wmnet - https://phabricator.wikimedia.org/T348956 (10Jclark-ctr) 05In progress→03Resolved a:03Jclark-ctr [16:01:34] (03CR) 10Vgutierrez: [C: 03+2] profile: Provide a lvs::realserver::ipip profile [puppet] - 10https://gerrit.wikimedia.org/r/975342 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:02:16] (03CR) 10Clément Goubert: [C: 03+1] mw-api-int: increase replicas by 50% [deployment-charts] - 10https://gerrit.wikimedia.org/r/977683 (https://phabricator.wikimedia.org/T350846) (owner: 10Kamila Součková) [16:02:23] (03PS1) 10Filippo Giunchedi: rsyslog: fix duplicate conf declaration in netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/977729 (https://phabricator.wikimedia.org/T351799) [16:03:30] (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [16:03:32] (03PS3) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [16:04:20] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Support cookbooks resume after user interruption - https://phabricator.wikimedia.org/T345402 (10joanna_borun) 05Open→03Declined [16:05:04] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: fix duplicate conf declaration in netdev_kafka_relay [puppet] - 10https://gerrit.wikimedia.org/r/977729 (https://phabricator.wikimedia.org/T351799) (owner: 10Filippo Giunchedi) [16:05:08] (03CR) 10Ssingh: "Looks good overall, we should also rm the hieradata override for cp1075 (cp1075.yaml)" [puppet] - 10https://gerrit.wikimedia.org/r/977702 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [16:05:23] (03CR) 10Vgutierrez: [C: 03+2] lvs,pybal: Deploy ipip-multiqueue-optimizer for IPIP enabled balancers [puppet] - 10https://gerrit.wikimedia.org/r/976737 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:06:51] (03CR) 10Jelto: "see comment in-line, puppet fails on miscweb hosts" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:07:03] !log disable puppet and stop bird on dns4003: rebooting [16:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns4003.wikimedia.org [16:07:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:07:32] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:07:36] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10joanna_borun) 05Open→03Declined [16:07:39] ^ expected [16:07:42] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:19] (03PS6) 10Vgutierrez: ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) [16:10:14] PROBLEM - BFD status on cr3-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:10:16] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:10:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2060.mgmt.codfw.wmnet with reboot policy FORCED [16:10:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns4003.wikimedia.org [16:11:04] PROBLEM - Host dns4003 is DOWN: PING CRITICAL - Packet loss = 100% [16:11:04] RECOVERY - Host dns4003 is UP: PING OK - Packet loss = 0%, RTA = 134.87 ms [16:11:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2001.mgmt.codfw.wmnet with reboot policy FORCED [16:11:27] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977185 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [16:11:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2059.mgmt.codfw.wmnet with reboot policy FORCED [16:11:31] !log enable puppet and start bird on dns4003 [16:11:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host elastic2109.mgmt.codfw.wmnet with reboot policy FORCED [16:11:42] (03CR) 10Vgutierrez: [C: 03+2] ncredir: Enable IPIP encapsulation on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/975772 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:11:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2003.mgmt.codfw.wmnet with reboot policy FORCED [16:11:44] (03PS1) 10Ssingh: Revert "dns4003: remove dns4003 from authdns_servers for reboot" [puppet] - 10https://gerrit.wikimedia.org/r/977618 [16:11:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd2002.mgmt.codfw.wmnet with reboot policy FORCED [16:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:31] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2109'] [16:12:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd2001'] [16:12:53] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd2002'] [16:12:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd2003'] [16:12:59] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2059'] [16:13:02] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2060'] [16:13:10] RECOVERY - BFD status on cr3-ulsfo is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:13:14] RECOVERY - BFD status on cr4-ulsfo is OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:13:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2109'] [16:13:33] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['logging-hd2001'] [16:13:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['logging-hd2002'] [16:13:35] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['logging-hd2003'] [16:13:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kubernetes2059'] [16:13:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kubernetes2060'] [16:14:08] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2109'] [16:14:12] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd2001'] [16:14:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd2002'] [16:14:17] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd2003'] [16:14:37] (03Abandoned) 10Jbond: global: drop owner/group => root from file resources [puppet] - 10https://gerrit.wikimedia.org/r/809139 (owner: 10Jbond) [16:14:55] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Jclark-ctr) [16:14:58] (03CR) 10Bking: [C: 03+2] query_service: add monitoring for ldf endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:15:05] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1025-cloudvirt1030.eqiad.wmnet - https://phabricator.wikimedia.org/T351010 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [16:15:22] 10SRE-tools, 10Data-Persistence, 10Infrastructure-Foundations, 10Spicerack, and 3 others: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565 (10jbond) [16:15:53] (03CR) 10Ssingh: [C: 03+2] Revert "dns4003: remove dns4003 from authdns_servers for reboot" [puppet] - 10https://gerrit.wikimedia.org/r/977618 (owner: 10Ssingh) [16:17:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2057.mgmt.codfw.wmnet with reboot policy FORCED [16:17:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2058.mgmt.codfw.wmnet with reboot policy FORCED [16:17:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2033.mgmt.codfw.wmnet with reboot policy FORCED [16:17:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ganeti2034.mgmt.codfw.wmnet with reboot policy FORCED [16:18:25] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. You can just go ahead and merge, I'll pick it up for the deb the next time it gets rebuilt for something else." [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/971924 (owner: 10Jbond) [16:18:54] PROBLEM - Check systemd state on ncredir4002 is CRITICAL: CRITICAL - degraded: The following units failed: tcp-mss-clamper.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:20:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2109'] [16:20:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logging-hd2001'] [16:20:11] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logging-hd2002'] [16:20:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['logging-hd2003'] [16:21:34] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti2034.mgmt.codfw.wmnet with reboot policy FORCED [16:23:26] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:23:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:23:28] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10joanna_borun) p:05High→03Low [16:23:34] (03PS4) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [16:24:47] ncredir4002 alert is me ^^ [16:25:30] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [16:25:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:27:07] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [16:27:34] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) 05Stalled→03Resolved [16:27:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2057.mgmt.codfw.wmnet with reboot policy FORCED [16:28:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2058.mgmt.codfw.wmnet with reboot policy FORCED [16:28:22] (03CR) 10Muehlenhoff: [C: 03+2] Add support to write out blocked networks from requestctl [puppet] - 10https://gerrit.wikimedia.org/r/977145 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [16:28:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti2033.mgmt.codfw.wmnet with reboot policy FORCED [16:28:47] (03PS1) 10Filippo Giunchedi: prometheus: re-introduce distro-specific node-exporter arguments [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) [16:28:51] (03PS1) 10Filippo Giunchedi: prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) [16:29:34] (03CR) 10CI reject: [V: 04-1] prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi) [16:29:39] (03PS2) 10Filippo Giunchedi: prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) [16:29:41] (03CR) 10Krinkle: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [16:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T1630). [16:30:16] (03PS1) 10Bking: Revert "query_service: add monitoring for ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/977619 [16:30:54] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: dbbackups::monitoring [16:31:55] !log upload tcp-mss-clamper 0.2+deb12u1 to apt.wm.o (bookworm) [16:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:48] (03CR) 10CI reject: [V: 04-1] Revert "query_service: add monitoring for ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/977619 (owner: 10Bking) [16:32:52] (03Abandoned) 10Jbond: idp: add datacenter-ops to puppetboard [puppet] - 10https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: 10Jbond) [16:33:08] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977735 (https://phabricator.wikimedia.org/T128546) [16:33:18] (03Abandoned) 10Jbond: Ganeti: Add small script to display free resources in gnt groups [puppet] - 10https://gerrit.wikimedia.org/r/923608 (owner: 10Jbond) [16:33:55] (03Abandoned) 10Jbond: Firewall: Change the default firewall rule cloud environment [puppet] - 10https://gerrit.wikimedia.org/r/633157 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [16:35:00] (03PS2) 10Bking: Revert "query_service: add monitoring for ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/977619 [16:35:39] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977735 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:44] (03PS3) 10Muehlenhoff: Enable requestctl-driven network blocks for sretest [puppet] - 10https://gerrit.wikimedia.org/r/977166 (https://phabricator.wikimedia.org/T348734) [16:36:34] RECOVERY - Check systemd state on ncredir4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977166 (https://phabricator.wikimedia.org/T348734) (owner: 10Muehlenhoff) [16:37:01] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977735 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:39:14] !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host an-worker1157.eqiad.wmnet [16:39:52] (03PS1) 10Vgutierrez: interface::clsact: Fix unless cmd [puppet] - 10https://gerrit.wikimedia.org/r/977736 (https://phabricator.wikimedia.org/T351069) [16:40:10] (03CR) 10Bking: [C: 03+2] Revert "query_service: add monitoring for ldf endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/977619 (owner: 10Bking) [16:42:02] (03CR) 10Vgutierrez: [C: 03+2] interface::clsact: Fix unless cmd [puppet] - 10https://gerrit.wikimedia.org/r/977736 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:43:01] (03PS5) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [16:43:13] (03CR) 10Jbond: "fyi it would be good to migrate this to pki.discovery.wmnet at some point" [puppet] - 10https://gerrit.wikimedia.org/r/977697 (owner: 10Majavah) [16:43:13] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host an-worker1157.eqiad.wmnet [16:43:17] (03CR) 10CI reject: [V: 04-1] miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:44:49] (03PS1) 10Jbond: dbbackups::monitoring: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/977737 (https://phabricator.wikimedia.org/T349619) [16:45:09] (03PS6) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [16:45:25] (03CR) 10CI reject: [V: 04-1] miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:45:47] (03CR) 10Jbond: [C: 03+2] dbbackups::monitoring: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/977737 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [16:49:03] (03Abandoned) 10Jbond: DO NOT MERGE: test pcc bug [puppet] - 10https://gerrit.wikimedia.org/r/645448 (owner: 10Jbond) [16:49:20] (03Abandoned) 10Jbond: (DO NOT MERGE) testing CI [puppet] - 10https://gerrit.wikimedia.org/r/651790 (owner: 10Jbond) [16:49:30] (03Abandoned) 10Jbond: (DO NOT MERGE) testing CI [puppet] - 10https://gerrit.wikimedia.org/r/651922 (owner: 10Jbond) [16:49:40] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [16:49:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [16:50:01] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:977735| Bumping portals to master (T128546)]] (duration: 07m 06s) [16:50:01] (03Abandoned) 10Jbond: pki: move the pki service so its avalible via dns discovery [puppet] - 10https://gerrit.wikimedia.org/r/656179 (owner: 10Jbond) [16:50:09] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: dbbackups::monitoring [16:50:13] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:50:26] (03CR) 10JHathaway: puppet-merge: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [16:50:32] (03Abandoned) 10Jbond: (WIP): add script to copy ldap entries to a local db [puppet] - 10https://gerrit.wikimedia.org/r/660869 (owner: 10Jbond) [16:50:54] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:22] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:37] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet with OS bookworm [16:53:01] (03PS7) 10Bking: miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) [16:53:26] (03Abandoned) 10Jbond: service definitions: add custom type/provider to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [16:53:36] (03Abandoned) 10Jbond: service_definitions: add defined ports [puppet] - 10https://gerrit.wikimedia.org/r/670788 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [16:54:43] (03Abandoned) 10Jbond: P:base: manage /etc/services file [puppet] - 10https://gerrit.wikimedia.org/r/670810 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [16:54:58] (03Abandoned) 10Jbond: foo: add test module to check rakefile cehck [puppet] - 10https://gerrit.wikimedia.org/r/672699 (owner: 10Jbond) [16:55:42] (03Abandoned) 10Jbond: P:grafana: Update CAS config to authenticate users on the correct vhost [puppet] - 10https://gerrit.wikimedia.org/r/654813 (https://phabricator.wikimedia.org/T269272) (owner: 10Jbond) [16:55:49] (03PS1) 10Elukey: profile::pyrra::filesystem: reduce scope of the Lift Wing's pilot [puppet] - 10https://gerrit.wikimedia.org/r/977738 (https://phabricator.wikimedia.org/T351390) [16:56:33] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:977735| Bumping portals to master (T128546)]] (duration: 06m 31s) [16:56:45] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:58:19] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:00:45] (03PS1) 10Vgutierrez: interface::ipip: Fix /etc/network/interfaces up order [puppet] - 10https://gerrit.wikimedia.org/r/977739 [17:00:52] (03PS2) 10Jbond: O:grafana: move httpd to the P:grafana [puppet] - 10https://gerrit.wikimedia.org/r/673037 [17:00:54] (03PS2) 10Jbond: hiera - cloud: P:grafana now installes httpd so no need to do it seperatly [puppet] - 10https://gerrit.wikimedia.org/r/673038 [17:01:45] (03CR) 10CI reject: [V: 04-1] hiera - cloud: P:grafana now installes httpd so no need to do it seperatly [puppet] - 10https://gerrit.wikimedia.org/r/673038 (owner: 10Jbond) [17:04:00] (03CR) 10Jbond: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/673037 (owner: 10Jbond) [17:04:02] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:04:10] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:04:17] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/713/con" [puppet] - 10https://gerrit.wikimedia.org/r/977739 (owner: 10Vgutierrez) [17:05:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/714/console" [puppet] - 10https://gerrit.wikimedia.org/r/673037 (owner: 10Jbond) [17:05:30] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:05:32] (03CR) 10Ssingh: [C: 03+1] "Without looking much into the details but just your commit message, looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/977739 (owner: 10Vgutierrez) [17:05:36] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:05:52] (03CR) 10Klausman: [C: 03+1] profile::pyrra::filesystem: reduce scope of the Lift Wing's pilot [puppet] - 10https://gerrit.wikimedia.org/r/977738 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [17:06:17] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface::ipip: Fix /etc/network/interfaces up order [puppet] - 10https://gerrit.wikimedia.org/r/977739 (owner: 10Vgutierrez) [17:06:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1002.eqiad.wmnet with OS bookworm [17:06:51] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2057'] [17:06:57] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2058'] [17:07:04] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2033'] [17:07:21] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kubernetes2057'] [17:07:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['kubernetes2058'] [17:07:24] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['ganeti2033'] [17:07:43] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti2033'] [17:09:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 48.3% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:11:23] (03PS2) 10Jbond: P:tcpircbot: fix minor style violations [puppet] - 10https://gerrit.wikimedia.org/r/673230 [17:11:40] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1-Q2): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10taavi) [17:11:42] 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Jhancock.wm) [17:11:49] (03PS1) 10Majavah: P:alertmanager: wmcs: do not group by instance [puppet] - 10https://gerrit.wikimedia.org/r/977741 (https://phabricator.wikimedia.org/T352059) [17:11:56] (03PS2) 10Majavah: P:alertmanager: wmcs: do not group by instance [puppet] - 10https://gerrit.wikimedia.org/r/977741 (https://phabricator.wikimedia.org/T352059) [17:12:46] (03Abandoned) 10Jbond: Firewall: Change the default firewall rule fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/633156 (https://phabricator.wikimedia.org/T264888) (owner: 10Jbond) [17:13:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Jhancock.wm) [17:13:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti2033'] [17:14:14] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 44.13% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:14:44] (03Abandoned) 10Jbond: P:debmonitor::server: switch to mod_proxy_uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/679458 (owner: 10Jbond) [17:15:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:20:08] (03PS17) 10Jbond: P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [17:20:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web at eqiad: 49.81% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [17:20:21] (03PS1) 10Majavah: team-wmcs: Merge systemd ForLong alert to the main one [alerts] - 10https://gerrit.wikimedia.org/r/977742 (https://phabricator.wikimedia.org/T352059) [17:20:23] (03PS1) 10Majavah: team-wmcs: improve host down alerts [alerts] - 10https://gerrit.wikimedia.org/r/977743 [17:22:15] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [17:22:32] (03PS18) 10Jbond: P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 [17:22:35] (03CR) 10CI reject: [V: 04-1] P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [17:23:08] (03PS2) 10Majavah: team-wmcs: improve host down alerts [alerts] - 10https://gerrit.wikimedia.org/r/977743 (https://phabricator.wikimedia.org/T352059) [17:23:22] (03PS3) 10Jbond: O:grafana: move httpd to the P:grafana [puppet] - 10https://gerrit.wikimedia.org/r/673037 [17:23:24] (03PS3) 10Jbond: hiera - cloud: P:grafana now installes httpd so no need to do it seperatly [puppet] - 10https://gerrit.wikimedia.org/r/673038 [17:23:58] (03CR) 10CI reject: [V: 04-1] hiera - cloud: P:grafana now installes httpd so no need to do it seperatly [puppet] - 10https://gerrit.wikimedia.org/r/673038 (owner: 10Jbond) [17:24:21] (03CR) 10Jbond: "Let me know if this is still of intrest" [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [17:25:03] (03Abandoned) 10Jbond: P:trafficserver::backend: use ca-certificates.crt to talk to backends [puppet] - 10https://gerrit.wikimedia.org/r/683604 (owner: 10Jbond) [17:25:05] (03CR) 10CI reject: [V: 04-1] P:netbase: parse the service catalogue and inject the service ports [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [17:25:12] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1002.eqiad.wmnet with reason: host reimage [17:25:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/673105 (owner: 10Jbond) [17:25:59] (03Abandoned) 10Jbond: (WIP): add function to test if we are doing the initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/684321 (owner: 10Jbond) [17:28:23] (03PS1) 10Vgutierrez: interface::ipip: /etc/network/interfaces order issues (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/977745 [17:28:53] (03CR) 10CI reject: [V: 04-1] interface::ipip: /etc/network/interfaces order issues (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/977745 (owner: 10Vgutierrez) [17:29:15] (03PS3) 10Jbond: P:gitlab: add ability to manage gitlab sshd instance [puppet] - 10https://gerrit.wikimedia.org/r/684438 (https://phabricator.wikimedia.org/T276148) [17:29:17] (03PS3) 10Jbond: O:gitlab: manage sshd config [puppet] - 10https://gerrit.wikimedia.org/r/684439 (https://phabricator.wikimedia.org/T276148) [17:30:03] (03PS2) 10Vgutierrez: interface::ipip: /etc/network/interfaces order issues (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/977745 [17:31:27] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/717/con" [puppet] - 10https://gerrit.wikimedia.org/r/977745 (owner: 10Vgutierrez) [17:31:33] (03CR) 10CI reject: [V: 04-1] P:gitlab: add ability to manage gitlab sshd instance [puppet] - 10https://gerrit.wikimedia.org/r/684438 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [17:32:15] (03Abandoned) 10Jbond: O:gitlab: manage sshd config [puppet] - 10https://gerrit.wikimedia.org/r/684439 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [17:32:36] (03Abandoned) 10Jbond: (DO NOT MERGE) enforce u2f logins for turnilo [puppet] - 10https://gerrit.wikimedia.org/r/685425 (https://phabricator.wikimedia.org/T280691) (owner: 10Jbond) [17:32:41] (03CR) 10Ssingh: [C: 03+1] "[same understanding as before!]" [puppet] - 10https://gerrit.wikimedia.org/r/977745 (owner: 10Vgutierrez) [17:32:54] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] interface::ipip: /etc/network/interfaces order issues (take #2) [puppet] - 10https://gerrit.wikimedia.org/r/977745 (owner: 10Vgutierrez) [17:33:03] (03Abandoned) 10Jbond: tlsproxy: make discovery the default cfssl_label in production [puppet] - 10https://gerrit.wikimedia.org/r/685511 (owner: 10Jbond) [17:36:44] (03CR) 10Jbond: puppet-merge: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [17:37:15] (03CR) 10Jbond: [C: 03+2] P:tcpircbot: fix minor style violations [puppet] - 10https://gerrit.wikimedia.org/r/673230 (owner: 10Jbond) [17:40:52] (03PS1) 10Vgutierrez: hiera: Enable IPIP on ulsfo LVS [puppet] - 10https://gerrit.wikimedia.org/r/977746 (https://phabricator.wikimedia.org/T351069) [17:41:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [17:42:26] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/719/con" [puppet] - 10https://gerrit.wikimedia.org/r/977746 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [17:45:29] (03PS2) 10Vgutierrez: hiera: Enable IPIP on ulsfo text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/977746 (https://phabricator.wikimedia.org/T351069) [17:46:57] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/977746 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [17:48:05] (03CR) 10Ssingh: [C: 03+1] hiera: Enable IPIP on ulsfo text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/977746 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [17:48:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet with OS bookworm [17:51:53] (03PS3) 10Kosta Harlan: ORES: Set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [17:52:21] !log upload ipip-multiqueue-optimizer 0.2 to apt.wm.o (bullseye) - T351069 [17:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:26] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [17:53:37] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on ulsfo text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/977746 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [17:58:51] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352027 (10phaultfinder) [17:59:28] (03PS1) 10Vgutierrez: profile::lvs: Fix ipip-multiqueue-optimizer systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/977761 (https://phabricator.wikimedia.org/T351069) [18:00:04] PROBLEM - Check systemd state on lvs4010 is CRITICAL: CRITICAL - degraded: The following units failed: ipip-multiqueue-optimizer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:07] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T1800) [18:00:07] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T1800). [18:00:41] ^^ lvs4010 is me [18:00:52] (03CR) 10Dzahn: [C: 03+1] "lgtm, should fix the issue with duplicate declaration and otherwise is like the previously merged change" [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:00:59] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/721/con" [puppet] - 10https://gerrit.wikimedia.org/r/977761 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [18:02:00] PROBLEM - ensure kvm processes are running on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:03:50] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T352027 (10phaultfinder) [18:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [18:05:10] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] profile::lvs: Fix ipip-multiqueue-optimizer systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/977761 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [18:07:15] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) 05Open→03In progress [18:07:25] !log restarting pybal on lvs4010 - T351069 [18:07:26] RECOVERY - Check systemd state on lvs4010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:33] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [18:09:59] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirtlocal1003.eqiad.wmnet with OS bookworm [18:10:21] (03PS1) 10Vgutierrez: service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) [18:11:19] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977711 [18:11:51] !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host planet1003.eqiad.wmnet with OS bookworm [18:11:54] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [18:11:56] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm [18:12:11] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977711 (owner: 10PipelineBot) [18:13:06] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:23] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/977711 (owner: 10PipelineBot) [18:17:30] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:22:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [18:25:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1003.eqiad.wmnet with reason: host reimage [18:25:30] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [18:26:09] (03PS2) 10Fabfur: decom cp1075-1090 [puppet] - 10https://gerrit.wikimedia.org/r/977702 (https://phabricator.wikimedia.org/T349244) [18:27:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Papaul) - first issue on 1157 the serial port address was set to COM1 and not com2 - second issue on 1157 boot order was set to network then disk making the server to ke... [18:28:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Looping in @CDanis as the original author for the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/789219 | cp1075 hiera overrides ]... [18:28:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirtlocal1003.eqiad.wmnet with reason: host reimage [18:31:15] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:33] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:19] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:37:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host planet1003.eqiad.wmnet with OS bookworm [18:37:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host planet1003.eqiad.wmnet with OS bookworm completed: - planet1003 (**PASS**) - Downti... [18:41:50] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 2 VMs %request for planet - https://phabricator.wikimedia.org/T351849 (10Dzahn) 05In progress→03Resolved a:03Dzahn [18:43:05] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:45:36] (03PS4) 10Dzahn: site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) [18:48:18] (03CR) 10Dzahn: [C: 03+2] site: add planet[12]003 to planet role [puppet] - 10https://gerrit.wikimedia.org/r/976854 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [18:52:40] (03PS1) 10Ryan Kemper: Search: add new SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/977770 (https://phabricator.wikimedia.org/T338009) [18:53:34] (03CR) 10Ryan Kemper: [C: 03+1] miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:54:51] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:56:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet with OS bookworm [18:57:34] (03PS2) 10Vgutierrez: service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) [18:57:36] (03PS1) 10Vgutierrez: wmflib: Test get_ipport_for_ipip_services [puppet] - 10https://gerrit.wikimedia.org/r/977771 [19:00:12] PROBLEM - Check systemd state on planet2003 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-ar.service,planet-update-bg.service,planet-update-cs.service,planet-update-de.service,planet-update-el.service,planet-update-en.service,planet-update-es.service,planet-update-fr.service,planet-update-gmq.service,planet-update-id.service,planet-update-it.service,planet-update-pl.service,planet-update-pt.service,planet-upd [19:00:12] ervice,planet-update-ru.service,planet-update-sq.service,planet-update-uk.service,planet-update-zh.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:26] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [19:07:36] (03PS2) 10Muehlenhoff: Use separate /etc/ganeti/ssl directory if using PKI [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) [19:07:44] (03CR) 10Muehlenhoff: Use separate /etc/ganeti/ssl directory if using PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [19:10:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977677 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [19:11:33] (03CR) 10Bking: [C: 03+2] miscweb: Add wdqs ldf blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/977704 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:12:34] (03PS3) 10Vgutierrez: service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) [19:12:47] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [19:14:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman) Contract update: Hamid Ghani's contract has been extended, with a new end date of June 30, 2024. @OSefu-WMF is now Hamid's manager. [19:14:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10kzimmerman) [19:17:31] (ProbeDown) firing: Service planet2003:443 has failed probes (http_en_planet_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#planet2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:50] ^ new host being set up. it's kind of a bug that it reports here though [19:20:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on planet2003.codfw.wmnet with reason: maintenance [19:20:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on planet2003.codfw.wmnet with reason: maintenance [19:21:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on planet1003.eqiad.wmnet with reason: maintenance [19:21:32] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on planet1003.eqiad.wmnet with reason: maintenance [19:28:30] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:28:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:32:31] (ProbeDown) firing: Service miscweb1003:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:35:04] inflatador: puppet is fixed but this means the actual check is applied now [19:35:29] there are a couple reasons it might fail, one being whether it checks via v4 or v6 [19:36:04] mutante thanks for the ping, this is a complex check for my first try. Body and/or regex might need tweaking ;) [19:36:32] let me see what it looks like when I run the check manually [19:36:40] inflatador: don't worry, for the record I needed like 3 attempts to make these work each time I added one [19:36:57] I remember changing the protocol version, v4, v6 or both [19:37:05] and other stuff [19:37:25] yea, regex :) [19:38:40] (03PS1) 10Vgutierrez: service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977778 (https://phabricator.wikimedia.org/T351069) [19:39:16] (03CR) 10CI reject: [V: 04-1] service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977778 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [19:39:49] (03PS4) 10Vgutierrez: service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) [19:40:09] (03Abandoned) 10Vgutierrez: service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977778 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [19:43:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/724/con" [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [19:44:29] (03CR) 10Vgutierrez: [C: 03+2] wmflib: Test get_ipport_for_ipip_services [puppet] - 10https://gerrit.wikimedia.org/r/977771 (owner: 10Vgutierrez) [19:46:32] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] service: Enable IPIP encapsulation for ncredir-https too [puppet] - 10https://gerrit.wikimedia.org/r/977764 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [19:50:17] !log restarting pybal on lvs4010 - T351069 [19:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:34] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [19:52:31] (ProbeDown) firing: (2) Service miscweb1003:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:53:19] !log restarting pybal on lvs4008 (effectively enabling IPIP encapsulation on ncredir@ulsfo) - T351069 [19:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:07] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:57:21] aaand that's me [19:57:50] !incidents [19:57:50] 4282 (UNACKED) [2x] ProbeDown sre (ncredir-https:443 probes/service ulsfo) [19:58:00] !ack 4282 [19:58:00] 4282 (ACKED) [2x] ProbeDown sre (ncredir-https:443 probes/service ulsfo) [19:58:08] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:25] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:01:47] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:47] (03PS1) 10Vgutierrez: service: Disable IPIP encapsulation for ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/977782 (https://phabricator.wikimedia.org/T351069) [20:13:03] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/725/con" [puppet] - 10https://gerrit.wikimedia.org/r/977782 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [20:13:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] service: Disable IPIP encapsulation for ncredir@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/977782 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [20:16:14] !log rolling restart of pybal on lvs4010 and lvs4008 - T351069 [20:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:19] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [20:18:25] (ProbeDown) firing: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:21:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) 05Resolved→03Open [20:21:16] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Hghani - https://phabricator.wikimedia.org/T322145 (10Dzahn) a:05jbond→03None [20:22:07] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:25] (ProbeDown) resolved: (2) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ncredir-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:23:26] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:23:27] !incidents [20:23:28] 4282 (RESOLVED) [2x] ProbeDown sre (ncredir-https:443 probes/service ulsfo) [20:23:31] (JobUnavailable) firing: (3) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:24:15] (03PS1) 10Kimberly Sarabia: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) [20:29:28] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [20:30:59] Hello, 503s for en.wiki in the UK, not investigated further [20:31:31] and resolved, ignore that then :) [20:32:46] (03PS2) 10Andrew Bogott: codfw1dev: open radosgw API to the internet [puppet] - 10https://gerrit.wikimedia.org/r/936657 (https://phabricator.wikimedia.org/T341380) (owner: 10Arturo Borrero Gonzalez) [20:33:29] (03Abandoned) 10Andrew Bogott: codfw1dev: open radosgw API to the internet [puppet] - 10https://gerrit.wikimedia.org/r/936657 (https://phabricator.wikimedia.org/T341380) (owner: 10Arturo Borrero Gonzalez) [20:33:32] TheresNoTime: had that too, but it went away after a minute [20:34:44] (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:34:55] uh? [20:36:25] didn't page here hmm [20:36:47] looking [20:37:41] T352094 got logged a moment ago, but evidently wider than that and not a root cause [20:37:41] T352094: 503 on Wikidata - https://phabricator.wikimedia.org/T352094 [20:38:49] already disappeared? [20:39:14] it looks like that [20:39:27] esams definitely did get hit [20:39:44] (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [20:39:52] I've seen a bump in eqiad too [20:40:01] just for text [20:40:58] see security [20:49:59] (03CR) 10Jon Harald Søby: [C: 04-1] "I am not sure this is the correct logo to use, so please hold a bit before merging." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/975379 (https://phabricator.wikimedia.org/T350241) (owner: 10Anzx) [20:52:11] 10SRE, 10ops-esams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10RobH) [20:52:13] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [20:52:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [20:56:16] 10SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10XiaoXiao-WMF) [21:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T2100). [21:00:06] tgr and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] (03PS1) 10Bking: miscweb: resolve wdqs ldf endpoint to wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) [21:00:55] o/ i can deploy [21:01:32] (03PS4) 10Clare Ming: CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) (owner: 10Gergő Tisza) [21:01:52] o/ [21:02:13] hi tgr - will start with yours [21:02:34] kosta couldn't come but his patch is safe and doesn't require any testing [21:02:42] !log btullis@deploy2002 Started deploy [airflow-dags/analytics_test@0283c11]: (no justification provided) [21:02:53] !log btullis@deploy2002 Finished deploy [airflow-dags/analytics_test@0283c11]: (no justification provided) (duration: 00m 11s) [21:02:58] cool - thanks for letting me know - i'll follow up with his then afterwards [21:03:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) (owner: 10Gergő Tisza) [21:04:12] (03Merged) 10jenkins-bot: CentralAuth: Fix wikisource.org cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976864 (https://phabricator.wikimedia.org/T351685) (owner: 10Gergő Tisza) [21:04:27] !log cjming@deploy2002 Started scap: Backport for [[gerrit:976864|CentralAuth: Fix wikisource.org cookie handling (T351685)]] [21:04:33] T351685: I keep getting logged out on Wikisource - https://phabricator.wikimedia.org/T351685 [21:05:03] (03CR) 10Ejegg: "Aha! Thanks so much @hashar." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg) [21:05:43] !log cjming@deploy2002 cjming and tgr: Backport for [[gerrit:976864|CentralAuth: Fix wikisource.org cookie handling (T351685)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:47] tgr: are you able to test? [21:06:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [21:06:47] cjming: yes, just a sec [21:07:47] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1157.eqiad.wmnet with OS bullseye [21:07:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [21:08:03] (03Abandoned) 10Ejegg: Allow crawling FundraiserLandingPage in robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg) [21:09:47] cjming: it's working, thanks [21:09:53] great - syncing [21:09:58] !log cjming@deploy2002 cjming and tgr: Continuing with sync [21:10:08] (03PS2) 10Bking: miscweb: resolve wdqs ldf endpoint to wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) [21:10:37] (03CR) 10CI reject: [V: 04-1] miscweb: resolve wdqs ldf endpoint to wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [21:11:18] (03PS3) 10Bking: miscweb: resolve wdqs ldf endpoint to wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) [21:15:54] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:976864|CentralAuth: Fix wikisource.org cookie handling (T351685)]] (duration: 11m 26s) [21:15:59] T351685: I keep getting logged out on Wikisource - https://phabricator.wikimedia.org/T351685 [21:16:11] (03PS4) 10Clare Ming: ORES: Set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [21:16:19] tgr: should be live! [21:17:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [21:17:58] (03Merged) 10jenkins-bot: ORES: Set default value of OresLiftWingAddHostHeader to true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976161 (https://phabricator.wikimedia.org/T351703) (owner: 10Ilias Sarantopoulos) [21:18:11] !log cjming@deploy2002 Started scap: Backport for [[gerrit:976161|ORES: Set default value of OresLiftWingAddHostHeader to true (T351703)]] [21:18:17] T351703: Update ORES extension configuration - https://phabricator.wikimedia.org/T351703 [21:18:21] thanks! [21:19:25] !log cjming@deploy2002 isaranto and cjming: Backport for [[gerrit:976161|ORES: Set default value of OresLiftWingAddHostHeader to true (T351703)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:19:32] !log cjming@deploy2002 isaranto and cjming: Continuing with sync [21:26:08] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:976161|ORES: Set default value of OresLiftWingAddHostHeader to true (T351703)]] (duration: 07m 57s) [21:26:13] T351703: Update ORES extension configuration - https://phabricator.wikimedia.org/T351703 [21:27:06] i think i'll call it early and close the window since there's nothing else in the queue [21:28:01] !log end of UTC late backport window [21:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:02] !log jhathaway@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1157.eqiad.wmnet with OS bullseye [21:42:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhathaway@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [21:48:32] (03CR) 10JHathaway: [C: 03+1] "seems worth it for WMCS" [puppet] - 10https://gerrit.wikimedia.org/r/550459 (https://phabricator.wikimedia.org/T237994) (owner: 10Jbond) [21:49:23] (03CR) 10JHathaway: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/617157 (owner: 10Jbond) [21:51:26] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977184 (https://phabricator.wikimedia.org/T350809) (owner: 10Jbond) [21:56:08] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1157.eqiad.wmnet with reason: host reimage [21:57:49] (03CR) 10Dzahn: [C: 03+1] "I'm all for merging this. Nitpick is that we should avoid hardcoded single host names like that in puppet code. Would be nicer to create w" [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [21:59:16] 10SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10XiaoXiao-WMF) @leila please approve my request for shell access [21:59:26] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1157.eqiad.wmnet with reason: host reimage [22:00:05] Reedy, sbassett, Maryum, and manfredi: Time to snap out of that daydream and deploy Weekly Security deployment window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231127T2200). [22:00:06] 10SRE, 10SRE-Access-Requests: Requesting access to "researchers" and "analytics-privatedata-users" for Xiao Xiao - https://phabricator.wikimedia.org/T352098 (10leila) approved. [22:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [22:06:55] (03CR) 10Bking: [C: 03+2] miscweb: resolve wdqs ldf endpoint to wdqs1015 [puppet] - 10https://gerrit.wikimedia.org/r/977787 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [22:13:47] !log jhathaway@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhathaway@cumin1001" [22:17:39] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhathaway@cumin1001" [22:17:45] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1157.eqiad.wmnet with OS bullseye [22:17:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhathaway@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye completed: - an-worker1157 (*... [22:22:31] (ProbeDown) firing: (3) Service miscweb1003:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:28:22] (03CR) 10Andrew Bogott: [C: 03+2] Remove obsolete Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/965484 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [22:32:31] (ProbeDown) firing: (4) Service miscweb1003:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:42:31] (ProbeDown) resolved: Service miscweb2003:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#miscweb2003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:45:51] (03PS1) 10Gergő Tisza: mobile: Remove $wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 [22:48:57] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-es.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:52:03] (03PS6) 10Andrew Bogott: cloudnfs: refactor configuration [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [22:54:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on planet1002.eqiad.wmnet with reason: maintenance [22:55:04] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on planet1002.eqiad.wmnet with reason: maintenance [23:02:51] (03CR) 10Andrew Bogott: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/931584/727/" [puppet] - 10https://gerrit.wikimedia.org/r/931584 (owner: 10Majavah) [23:11:35] (03PS1) 10Papaul: Add new restbase node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) [23:11:45] (03CR) 10Andrew Bogott: [C: 03+2] dbproxy: change ownership to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/952455 (https://phabricator.wikimedia.org/T337848) (owner: 10Jbond) [23:12:05] (03CR) 10CI reject: [V: 04-1] Add new restbase node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [23:13:38] (03CR) 10Andrew Bogott: [C: 03+2] puppetserver: create a necessary parent dirs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/975073 (owner: 10Andrew Bogott) [23:13:41] (03PS1) 10Dzahn: planet: only use plugindirs config setting on buster [puppet] - 10https://gerrit.wikimedia.org/r/977796 (https://phabricator.wikimedia.org/T348392) [23:14:23] (03CR) 10Dzahn: [C: 03+2] planet: only use plugindirs config setting on buster [puppet] - 10https://gerrit.wikimedia.org/r/977796 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [23:16:24] (03PS2) 10Papaul: Add new restbase node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) [23:16:52] (03CR) 10CI reject: [V: 04-1] Add new restbase node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [23:19:44] (03PS3) 10Papaul: Add new restbase node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) [23:19:47] (03CR) 10Dzahn: [C: 04-2] "syntax error in line 1 of site.pp" [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [23:20:44] (03PS1) 10Andrew Bogott: cloud-vps vendordata: install all new packages from wmf repos [puppet] - 10https://gerrit.wikimedia.org/r/977797 [23:24:06] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps vendordata: install all new packages from wmf repos [puppet] - 10https://gerrit.wikimedia.org/r/977797 (owner: 10Andrew Bogott) [23:25:05] (03CR) 10Dzahn: [C: 03+1] "looks ok now but you might have to set the Hiera values for puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [23:25:23] (03CR) 10Papaul: [C: 03+2] Add new restbase node to site.pp and apt_repo.yaml [puppet] - 10https://gerrit.wikimedia.org/r/977795 (https://phabricator.wikimedia.org/T349758) (owner: 10Papaul) [23:28:52] (03PS1) 10Dzahn: hieradata: delete puppet7 hiera keys for planet hosts [puppet] - 10https://gerrit.wikimedia.org/r/977798 [23:32:11] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.193 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:38:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Papaul) [23:47:18] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host restbase2028.codfw.wmnet with OS bullseye [23:47:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host restbase2028.codfw.wmnet with OS bullseye [23:47:48] (03CR) 10Jdlrobson: [C: 03+1] mobile: Remove $wgMobileUrlTemplate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977791 (owner: 10Gergő Tisza) [23:56:25] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state