[00:10:31] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:30:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:38:51] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965570 [00:38:57] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965570 (owner: 10TrainBranchBot) [00:56:10] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/965570 (owner: 10TrainBranchBot) [01:01:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:06:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:15:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:25] (03PS1) 10C. Scott Ananian: Enable Parsoid interal REST API only on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) [02:20:32] (03PS1) 10Sharvaniharan: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 [02:22:49] (03PS2) 10Sharvaniharan: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) [02:23:52] (03PS3) 10Sharvaniharan: New stream for Android Patroller tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965610 (https://phabricator.wikimedia.org/T348816) [02:25:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [02:38:33] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:55:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [02:58:24] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:59:32] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:38] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:44] !log on x1 wikishared, created loginnotify_seen_net table T346989 [03:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:48] T346989: Deploy LoginNotify seen subnets table - https://phabricator.wikimedia.org/T346989 [03:09:57] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T348706 (10phaultfinder) [03:16:12] 10SRE-swift-storage, 10Commons: files uploaded to Commons without images - https://phabricator.wikimedia.org/T348827 (10Pppery) [03:16:40] 10SRE-swift-storage, 10Commons: files uploaded to Commons without images - https://phabricator.wikimedia.org/T348827 (10Pppery) [03:17:06] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Pppery) [03:20:31] !log on non-CentralAuth wikis, created the loginnotify_seen_net table T346989 [03:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:20:35] T346989: Deploy LoginNotify seen subnets table - https://phabricator.wikimedia.org/T346989 [03:41:10] (03PS2) 10MPGuy2824: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) [03:55:54] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:54] PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:01:04] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:13:39] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10MW-1.41-notes (1.41.0-wmf.25; 2023-09-05), and 2 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10PseudoSkull) Hi, just wanted to let you guys know... [04:15:34] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [04:35:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:14:18] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:48] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:34:02] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:58:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T343198)', diff saved to https://phabricator.wikimedia.org/P52921 and previous config saved to /var/cache/conftool/dbconfig/20231013-055809-arnaudb.json [05:58:14] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231013T0600) [06:09:40] (03PS1) 10Tim Starling: Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) [06:10:07] (03CR) 10CI reject: [V: 04-1] Add LoginNotify cron job [puppet] - 10https://gerrit.wikimedia.org/r/965620 (https://phabricator.wikimedia.org/T346989) (owner: 10Tim Starling) [06:13:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P52922 and previous config saved to /var/cache/conftool/dbconfig/20231013-061315-arnaudb.json [06:28:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P52923 and previous config saved to /var/cache/conftool/dbconfig/20231013-062821-arnaudb.json [06:33:08] RECOVERY - Check systemd state on apt1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:09] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on debmonitor2002.codfw.wmnet with reason: setup in progress [06:38:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on debmonitor2002.codfw.wmnet with reason: setup in progress [06:38:25] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: sync [06:38:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: setup in progress [06:38:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: setup in progress [06:39:10] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: sync [06:39:14] (03PS9) 10Muehlenhoff: Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) [06:43:13] !log installing Linux 5.10.197 updates from Bullseye point release (no reboots, just installing the new kernels) [06:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T343198)', diff saved to https://phabricator.wikimedia.org/P52924 and previous config saved to /var/cache/conftool/dbconfig/20231013-064328-arnaudb.json [06:43:30] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:43:32] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [06:43:55] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:44:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T343198)', diff saved to https://phabricator.wikimedia.org/P52925 and previous config saved to /var/cache/conftool/dbconfig/20231013-064400-arnaudb.json [06:45:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 150552 [06:46:28] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 150552 [06:48:29] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15133 [06:54:50] (03PS5) 10Ilias Sarantopoulos: ml-services: update revscoring and enable articlequality mp [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) [06:54:59] (03CR) 10Ilias Sarantopoulos: ml-services: update revscoring and enable articlequality mp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [06:55:13] (03CR) 10Ilias Sarantopoulos: ml-services: update revscoring and enable articlequality mp (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231013T0700) [07:01:17] (03CR) 10Hashar: [C: 03+1] "Looks good thanks! And that is cleaner than my series of 3 changes (which end up not even working :D )" [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [07:01:43] (03Abandoned) 10Hashar: zuul: move Gerrit key from merger to server [puppet] - 10https://gerrit.wikimedia.org/r/965103 (owner: 10Hashar) [07:01:45] (03PS1) 10Muehlenhoff: Remove obsolete profile::nftables::basefirewall [puppet] - 10https://gerrit.wikimedia.org/r/965651 [07:01:53] (03Abandoned) 10Hashar: zuul: get ssh key from Puppet collected resource [puppet] - 10https://gerrit.wikimedia.org/r/965106 (owner: 10Hashar) [07:02:40] (03Abandoned) 10Hashar: ci: add Gerrit ssh key to ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/961025 (https://phabricator.wikimedia.org/T328543) (owner: 10Hashar) [07:03:33] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:03:41] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update revscoring and enable articlequality mp [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [07:04:58] (03Merged) 10jenkins-bot: ml-services: update revscoring and enable articlequality mp [deployment-charts] - 10https://gerrit.wikimedia.org/r/965479 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [07:36:06] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [07:36:39] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [07:36:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:23] (03CR) 10Filippo Giunchedi: "I like the idea! See inline" [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [07:42:22] (03PS1) 10Muehlenhoff: backup::host: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/965656 [07:44:24] (03PS1) 10Ilias Sarantopoulos: ml-services: fix articlequality staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/965657 (https://phabricator.wikimedia.org/T348265) [07:45:11] (03CR) 10Elukey: [C: 03+1] ml-services: fix articlequality staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/965657 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [07:50:27] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix articlequality staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/965657 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [07:51:18] (03Merged) 10jenkins-bot: ml-services: fix articlequality staging resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/965657 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [07:51:21] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10MatthewVernon) [07:54:37] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [07:54:49] (03PS5) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [07:54:54] (03CR) 10Muehlenhoff: Add a define to run periodic metric checks (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [07:55:15] (03CR) 10CI reject: [V: 04-1] Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [07:58:18] (03PS6) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [08:03:05] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965656 (owner: 10Muehlenhoff) [08:20:31] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:22:27] (03PS1) 10Muehlenhoff: Setup a prerouting chain in the base table to exempt traffic from conntrack [puppet] - 10https://gerrit.wikimedia.org/r/965659 (https://phabricator.wikimedia.org/T348735) [08:30:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [08:30:54] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:31:46] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:12] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:48] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:34:52] (03CR) 10Filippo Giunchedi: [C: 04-1] "Idea is good, won't work as-is" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [08:36:26] (03PS1) 10Tim Starling: Enable LoginNotify seen subnets table [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965663 (https://phabricator.wikimedia.org/T346989) [08:37:54] PROBLEM - puppet last run on sretest1001 is CRITICAL: CRITICAL: Puppet last ran 21 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:40:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:41:02] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:41:15] (03PS1) 10Muehlenhoff: nftables::service Write out notrack rules for services skipping conntrack [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) [08:41:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15133 [08:41:50] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:43:58] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:44:44] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:44:45] (03PS7) 10Muehlenhoff: Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) [08:53:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [08:54:40] (03PS1) 10AikoChou: ml-services: set OMP_NUM_THREADS for revertrisk-la [deployment-charts] - 10https://gerrit.wikimedia.org/r/965666 (https://phabricator.wikimedia.org/T347550) [08:57:52] (03CR) 10David Caro: [C: 03+2] metricsinfra.alertmanager: add victorops and paging route [puppet] - 10https://gerrit.wikimedia.org/r/965475 (https://phabricator.wikimedia.org/T323713) (owner: 10David Caro) [08:59:38] RECOVERY - puppet last run on sretest1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:59:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965092 (owner: 10Muehlenhoff) [09:01:56] (03PS2) 10AikoChou: ml-services: set OMP_NUM_THREADS for revertrisk-la [deployment-charts] - 10https://gerrit.wikimedia.org/r/965666 (https://phabricator.wikimedia.org/T347550) [09:02:13] (03CR) 10Hashar: [C: 03+1] "I have cherry picked the change on integration-puppetmaster-02.integration.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [09:05:41] (03CR) 10Jbond: "lgtm but see comment inline" [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [09:07:26] (03CR) 10Elukey: [C: 03+1] ml-services: set OMP_NUM_THREADS for revertrisk-la [deployment-charts] - 10https://gerrit.wikimedia.org/r/965666 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [09:07:54] (03CR) 10AikoChou: [C: 03+2] ml-services: set OMP_NUM_THREADS for revertrisk-la [deployment-charts] - 10https://gerrit.wikimedia.org/r/965666 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [09:08:08] (03PS1) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:08:41] (03CR) 10CI reject: [V: 04-1] radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:08:45] (03Merged) 10jenkins-bot: ml-services: set OMP_NUM_THREADS for revertrisk-la [deployment-charts] - 10https://gerrit.wikimedia.org/r/965666 (https://phabricator.wikimedia.org/T347550) (owner: 10AikoChou) [09:10:20] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:15:39] (03CR) 10Jbond: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) (owner: 10JHathaway) [09:22:41] (03PS2) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:23:13] (03CR) 10CI reject: [V: 04-1] radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:23:41] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [09:24:16] (03PS3) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:24:49] (03CR) 10CI reject: [V: 04-1] radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:25:31] (03PS4) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:25:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965651 (owner: 10Muehlenhoff) [09:25:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:26:16] (03CR) 10Jbond: [C: 03+2] gerrit: make gerrit ssh key more DRY [puppet] - 10https://gerrit.wikimedia.org/r/965122 (https://phabricator.wikimedia.org/T328543) (owner: 10Jbond) [09:26:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44032/console" [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:30:27] (03PS1) 10JMeybohm: Revert "wikifunctions: Drop legacy main (all languages) evaluator" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965225 (https://phabricator.wikimedia.org/T343388) [09:30:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:34:45] (03PS5) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:34:56] 10SRE, 10Traffic: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Southparkfan) Alternative to consider: injecting REDIRECTs for traffic meant for a VIP. See the second section at http://www.linuxvirtualserver.org/docs/arp.html. I haven't tested it and it requires som... [09:36:02] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44033/console" [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:36:08] (03CR) 10Hnowlan: [C: 03+2] api-gateway: add Content-type in the CORS' allowed headers settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) (owner: 10Elukey) [09:37:02] (03Merged) 10jenkins-bot: api-gateway: add Content-type in the CORS' allowed headers settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/965153 (https://phabricator.wikimedia.org/T348511) (owner: 10Elukey) [09:37:24] (03PS6) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:38:31] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44034/console" [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:46:08] (03PS7) 10Majavah: radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) [09:47:26] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44035/console" [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [09:48:19] (03PS1) 10Muehlenhoff: testreduce: Set innodb_buffer_pool_size to 4.6G [puppet] - 10https://gerrit.wikimedia.org/r/965679 (https://phabricator.wikimedia.org/T345220) [09:50:25] (03PS2) 10Muehlenhoff: testreduce: Set innodb_buffer_pool_size to 4.6G [puppet] - 10https://gerrit.wikimedia.org/r/965679 (https://phabricator.wikimedia.org/T345220) [09:52:45] (03CR) 10Muehlenhoff: [C: 03+2] Add a prometheus check for whether nftables is running [puppet] - 10https://gerrit.wikimedia.org/r/964851 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [09:53:00] (03CR) 10CI reject: [V: 04-1] testreduce: Set innodb_buffer_pool_size to 4.6G [puppet] - 10https://gerrit.wikimedia.org/r/965679 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [09:54:45] (03PS3) 10Muehlenhoff: testreduce: Set innodb_buffer_pool_size to 4.6G [puppet] - 10https://gerrit.wikimedia.org/r/965679 (https://phabricator.wikimedia.org/T345220) [10:05:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965679 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [10:06:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [10:13:23] (03CR) 10Muehlenhoff: [C: 03+2] testreduce: Set innodb_buffer_pool_size to 4.6G [puppet] - 10https://gerrit.wikimedia.org/r/965679 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [10:16:29] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10cmooney) Codfw equivalent subnet that needs changing also: ` cmooney@cloudcontrol2005-dev:~$ sudo wmcs-openstack subnet show 2596edb4-5a40-... [10:20:20] (03CR) 10Muehlenhoff: Failover testreduce to testreduce1002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/965163 (https://phabricator.wikimedia.org/T345220) (owner: 10Muehlenhoff) [10:25:18] (03CR) 10Muehlenhoff: Add a define to run periodic metric checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [10:29:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:29:43] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [10:29:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:30:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [10:33:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965659 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [10:35:32] 10SRE, 10Traffic-Icebox, 10HTTPS, 10Wikimedia-Performance-recommendation: Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Diskdance) [10:37:03] (03PS3) 10Cathal Mooney: Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) [10:38:07] (03CR) 10Cathal Mooney: [C: 03+2] Add puppet elements for newly added switches. [puppet] - 10https://gerrit.wikimedia.org/r/965148 (https://phabricator.wikimedia.org/T334230) (owner: 10Cathal Mooney) [10:42:20] (03CR) 10Jbond: [C: 04-1] "see inline, perhaps im missing something but this doesn't seem right" [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [10:47:33] (03PS1) 10Ladsgroup: Disable DoubleWiki extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965707 (https://phabricator.wikimedia.org/T344544) [10:47:36] (03PS1) 10Cathal Mooney: Change cloudgw VIPs to /29 so system can ARP from them [puppet] - 10https://gerrit.wikimedia.org/r/965708 (https://phabricator.wikimedia.org/T348140) [10:51:59] (03PS3) 10Majavah: security: use concat to construct access.conf [puppet] - 10https://gerrit.wikimedia.org/r/965461 [10:53:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44036/console" [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [10:54:19] (03PS1) 10Ilias Sarantopoulos: ml-services: enable multiprocessing in enwiki articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965709 (https://phabricator.wikimedia.org/T348265) [10:57:02] (03CR) 10Muehlenhoff: nftables::service Write out notrack rules for services skipping conntrack (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [10:57:55] (03CR) 10David Caro: "LGTM, let's wait to run on codfw, and for monday :)" [puppet] - 10https://gerrit.wikimedia.org/r/965708 (https://phabricator.wikimedia.org/T348140) (owner: 10Cathal Mooney) [10:58:41] (03PS1) 10Jbond: sretest1002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/965711 (https://phabricator.wikimedia.org/T348319) [10:59:09] (03CR) 10Jbond: [C: 03+2] sretest1002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/965711 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [10:59:11] (03PS1) 10Cathal Mooney: Change codfw cloudgw VIPs to /29 so system can ARP from them [puppet] - 10https://gerrit.wikimedia.org/r/965712 (https://phabricator.wikimedia.org/T348140) [11:01:01] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: enable multiprocessing in enwiki articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965709 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [11:01:10] PROBLEM - Host lsw1-e5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:01:10] PROBLEM - Host lsw1-e5-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:10] PROBLEM - Host lsw1-e6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:01:10] PROBLEM - Host lsw1-e6-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:10] PROBLEM - Host lsw1-e7-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:11] PROBLEM - Host lsw1-e7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:01:11] PROBLEM - Host lsw1-f5-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:12] PROBLEM - Host lsw1-f5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:01:12] PROBLEM - Host lsw1-f6-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:13] PROBLEM - Host lsw1-f6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:01:13] PROBLEM - Host lsw1-f7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [11:01:14] PROBLEM - Host lsw1-f7-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:01:37] (03PS2) 10Cathal Mooney: Change eqiad cloudgw VIPs to /29 so system can ARP from them [puppet] - 10https://gerrit.wikimedia.org/r/965708 (https://phabricator.wikimedia.org/T348140) [11:02:27] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/965712 (https://phabricator.wikimedia.org/T348140) (owner: 10Cathal Mooney) [11:02:41] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing in enwiki articlequality in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/965709 (https://phabricator.wikimedia.org/T348265) (owner: 10Ilias Sarantopoulos) [11:03:34] (JobUnavailable) firing: (5) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:03:36] (03CR) 10Cathal Mooney: [C: 03+2] Change codfw cloudgw VIPs to /29 so system can ARP from them [puppet] - 10https://gerrit.wikimedia.org/r/965712 (https://phabricator.wikimedia.org/T348140) (owner: 10Cathal Mooney) [11:04:34] (03CR) 10JMeybohm: [C: 03+2] Revert "wikifunctions: Drop legacy main (all languages) evaluator" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965225 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [11:05:39] (03Merged) 10jenkins-bot: Revert "wikifunctions: Drop legacy main (all languages) evaluator" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965225 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [11:06:23] (03PS2) 10Muehlenhoff: nftables::service Write out notrack rules for services skipping conntrack [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) [11:07:16] (03PS1) 10Jbond: sre.puppet.migrate-hosts: just destroy the node don't delete [cookbooks] - 10https://gerrit.wikimedia.org/r/965715 [11:07:23] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:11:58] PROBLEM - Juniper alarms on lsw1-e7-eqiad.mgmt is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:12:21] (03PS1) 10JMeybohm: Revert "Revert "wikifunctions: Drop legacy main (all languages) evaluator"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965689 (https://phabricator.wikimedia.org/T343388) [11:12:42] (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate-hosts: just destroy the node don't delete [cookbooks] - 10https://gerrit.wikimedia.org/r/965715 (owner: 10Jbond) [11:14:28] (03CR) 10JMeybohm: [C: 03+2] Revert "Revert "wikifunctions: Drop legacy main (all languages) evaluator"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965689 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [11:15:19] (03Merged) 10jenkins-bot: Revert "Revert "wikifunctions: Drop legacy main (all languages) evaluator"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965689 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [11:15:25] (03PS1) 10JMeybohm: Add new version for base.helper (1.1.1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965716 (https://phabricator.wikimedia.org/T343388) [11:15:27] (03PS1) 10JMeybohm: base.helper: Allow to use ClusterIP services [deployment-charts] - 10https://gerrit.wikimedia.org/r/965717 (https://phabricator.wikimedia.org/T343388) [11:15:29] (03PS1) 10JMeybohm: wikifunctions: Use ClusterIP services for evaluators [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) [11:21:30] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Integrate Bookworm 12.1 point update - https://phabricator.wikimedia.org/T343121 (10MoritzMuehlenhoff) [11:21:57] (03PS1) 10Cathal Mooney: Add 'src' to ip route statement on cloudgw to ensure VIP used for ARP [puppet] - 10https://gerrit.wikimedia.org/r/965720 (https://phabricator.wikimedia.org/T348140) [11:22:18] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) On https://commons.wikimedia.org/wiki/File:Swiss_national_ma... [11:26:24] (03CR) 10Jbond: "thanks ee inline" [puppet] - 10https://gerrit.wikimedia.org/r/965461 (owner: 10Majavah) [11:26:40] PROBLEM - Juniper alarms on lsw1-e5-eqiad.mgmt is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [11:33:02] (03CR) 10Filippo Giunchedi: Add a define to run periodic metric checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [11:35:09] (03PS2) 10Jbond: sre.hosts.reimage: remove the call to destroy [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) [11:35:14] (03PS18) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [11:35:41] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [11:38:34] (03Abandoned) 10Cathal Mooney: Add 'src' to ip route statement on cloudgw to ensure VIP used for ARP [puppet] - 10https://gerrit.wikimedia.org/r/965720 (https://phabricator.wikimedia.org/T348140) (owner: 10Cathal Mooney) [11:46:17] (03CR) 10Muehlenhoff: [C: 03+2] Add a define to run periodic metric checks [puppet] - 10https://gerrit.wikimedia.org/r/965533 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [11:53:51] !log starting decommission of restbase2012-c — T328490 [11:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:56] T328490: restbase cluster: decommission end-of-life hosts - https://phabricator.wikimedia.org/T328490 [11:55:02] (03PS1) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [11:55:27] (03CR) 10CI reject: [V: 04-1] Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [11:55:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965664 (https://phabricator.wikimedia.org/T348735) (owner: 10Muehlenhoff) [11:57:28] (03PS2) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [11:57:54] (03CR) 10CI reject: [V: 04-1] Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [11:58:02] 10SRE, 10Traffic: Add custom HAProxy backend only for healthchecks - https://phabricator.wikimedia.org/T348851 (10Fabfur) [11:58:52] (03PS3) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [12:16:53] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [12:21:31] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10TheDJ) >>! In T328872#9249487, @Yann wrote: > On https://commons.w... [12:22:26] (03PS4) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [12:23:02] (03CR) 10CI reject: [V: 04-1] Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [12:25:04] (03PS5) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [12:25:31] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:25:35] (03CR) 10CI reject: [V: 04-1] Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [12:26:36] (03PS6) 10Muehlenhoff: Add textfile exporter for nftables check [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) [12:26:40] (03CR) 10Jforrester: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965225 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [12:28:09] (03CR) 10Jforrester: [C: 03+1] "Brill, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [12:31:18] (03PS3) 10Jbond: sre.hosts.reimage: remove the call to destroy [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) [12:31:20] (03PS19) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [12:31:27] (03CR) 10Jbond: sre.hosts.reimage: remove the call to destroy (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/964917 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:34:04] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:35:45] 10SRE, 10Traffic: Add custom HAProxy backend only for healthchecks - https://phabricator.wikimedia.org/T348851 (10Vgutierrez) [12:36:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [12:36:34] (03CR) 10Sohom Datta: [C: 03+1] "Based on the task, there hasn't much feedback despite posting this on Tech News, and most of the community members are indifferent, so I'd" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965707 (https://phabricator.wikimedia.org/T344544) (owner: 10Ladsgroup) [12:37:17] (03PS1) 10Btullis: Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) [12:38:02] (03PS1) 10Jbond: sretest: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/965731 [12:38:39] (03CR) 10Jbond: [C: 03+2] sretest: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/965731 (owner: 10Jbond) [12:40:05] 10SRE, 10Traffic: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) [12:41:33] (03PS2) 10Btullis: Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) [12:43:52] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44041/console" [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:44:06] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) p:05Low→03High This is causing some issues, should be fixed sooner than later, bumping priority [12:45:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:46:44] (03CR) 10Jbond: "I have now checked the following" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:47:19] (03PS1) 10Jforrester: jquery.tablesorter: Fix data-sort-type with numeric values [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965690 (https://phabricator.wikimedia.org/T348812) [12:47:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bookworm [12:48:04] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm [12:48:24] (03PS1) 10Muehlenhoff: prometheus::node_debian_version: Move to prometheus::node_textfile [puppet] - 10https://gerrit.wikimedia.org/r/965732 [12:48:34] (03PS2) 10Muehlenhoff: prometheus::node_debian_version: Move to prometheus::node_textfile [puppet] - 10https://gerrit.wikimedia.org/r/965732 [12:48:36] (03PS20) 10Jbond: sre.hosts.reimage: update to support puppetserver [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) [12:48:42] (03PS3) 10Btullis: Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) [12:49:09] (03CR) 10CI reject: [V: 04-1] Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [12:50:24] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 (10dcaro) [12:50:42] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@9a8cfd2]: (no justification provided) [12:50:49] (03PS1) 10Elukey: ml-services: upgrade Docker image for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/965733 (https://phabricator.wikimedia.org/T348664) [12:52:08] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@9a8cfd2]: (no justification provided) (duration: 01m 26s) [12:52:48] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@9a8cfd2]: (no justification provided) [12:53:04] (03CR) 10Hashar: [C: 03+1] Don't try to lock to serialize m3u8 file writes [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965220 (https://phabricator.wikimedia.org/T348689) (owner: 10Jforrester) [12:53:10] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965732 (owner: 10Muehlenhoff) [12:53:28] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@9a8cfd2]: (no justification provided) (duration: 00m 39s) [12:53:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [12:53:57] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/965723 (https://phabricator.wikimedia.org/T348499) (owner: 10Muehlenhoff) [12:54:34] 10SRE, 10Traffic: Add custom HAProxy backend only for healthchecks - https://phabricator.wikimedia.org/T348851 (10Fabfur) [12:54:39] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/965732 (owner: 10Muehlenhoff) [12:56:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [cookbooks] - 10https://gerrit.wikimedia.org/r/964007 (https://phabricator.wikimedia.org/T348319) (owner: 10Jbond) [12:57:34] (03CR) 10AikoChou: [C: 03+1] ml-services: upgrade Docker image for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/965733 (https://phabricator.wikimedia.org/T348664) (owner: 10Elukey) [12:59:31] (03CR) 10Elukey: [C: 03+2] ml-services: upgrade Docker image for readability [deployment-charts] - 10https://gerrit.wikimedia.org/r/965733 (https://phabricator.wikimedia.org/T348664) (owner: 10Elukey) [13:00:09] 10SRE-OnFire, 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Wikimedia-Incident: 2023-09-20 Elasticsearch unavailable incident - https://phabricator.wikimedia.org/T346945 (10Gehel) 05Open→03Resolved a:03Gehel Incident report is written, follow up tasks are created, let's close this. [13:03:58] (03CR) 10CDanis: prometheus::node_debian_version: Move to prometheus::node_textfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965732 (owner: 10Muehlenhoff) [13:04:29] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [13:04:30] !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be2003.codfw.wmnet with OS bookworm [13:04:36] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm executed with... [13:04:54] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bookworm [13:05:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm [13:07:55] !log mvernon@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host moss-be2003.codfw.wmnet with OS bookworm [13:08:01] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm executed with... [13:09:14] (03PS1) 10MVernon: install_server: teach early_command about moss-be [puppet] - 10https://gerrit.wikimedia.org/r/965734 (https://phabricator.wikimedia.org/T342674) [13:10:30] (03PS1) 10David Caro: cinder::volume: Add enable attribute and ensure service running [puppet] - 10https://gerrit.wikimedia.org/r/965735 [13:11:39] (03CR) 10Muehlenhoff: [C: 03+1] install_server: teach early_command about moss-be [puppet] - 10https://gerrit.wikimedia.org/r/965734 (https://phabricator.wikimedia.org/T342674) (owner: 10MVernon) [13:12:56] (03CR) 10MVernon: [C: 03+2] install_server: teach early_command about moss-be [puppet] - 10https://gerrit.wikimedia.org/r/965734 (https://phabricator.wikimedia.org/T342674) (owner: 10MVernon) [13:13:07] (03CR) 10CI reject: [V: 04-1] cinder::volume: Add enable attribute and ensure service running [puppet] - 10https://gerrit.wikimedia.org/r/965735 (owner: 10David Caro) [13:14:05] (03CR) 10Hnowlan: [C: 03+1] Add new version for base.helper (1.1.1) [deployment-charts] - 10https://gerrit.wikimedia.org/r/965716 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [13:16:57] (03PS2) 10David Caro: cinder::volume: Add enable attribute and ensure service running [puppet] - 10https://gerrit.wikimedia.org/r/965735 [13:19:29] (03CR) 10CI reject: [V: 04-1] cinder::volume: Add enable attribute and ensure service running [puppet] - 10https://gerrit.wikimedia.org/r/965735 (owner: 10David Caro) [13:20:41] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be2003.codfw.wmnet with OS bookworm [13:20:52] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, and 2 others: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm [13:21:12] (03CR) 10Hnowlan: [C: 03+1] base.helper: Allow to use ClusterIP services [deployment-charts] - 10https://gerrit.wikimedia.org/r/965717 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [13:24:15] (03PS4) 10Btullis: Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) [13:26:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44043/console" [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:29:28] (03CR) 10Hnowlan: [C: 03+1] "lgtm mostly, one nit/query" [deployment-charts] - 10https://gerrit.wikimedia.org/r/965718 (https://phabricator.wikimedia.org/T343388) (owner: 10JMeybohm) [13:33:06] (03PS5) 10Btullis: Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) [13:34:00] (03PS6) 10Btullis: Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) [13:34:10] (03PS1) 10Muehlenhoff: profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 [13:34:21] (03PS2) 10Muehlenhoff: profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 [13:34:48] (03CR) 10Subramanya Sastry: [C: 03+1] profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 (owner: 10Muehlenhoff) [13:36:46] (03CR) 10CI reject: [V: 04-1] profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 (owner: 10Muehlenhoff) [13:37:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44044/console" [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [13:39:01] 10SRE, 10Traffic: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10Vgutierrez) >>! In T348837#9249192, @Southparkfan wrote: > Alternative to consider: injecting REDIRECTs for traffic meant for a VIP. See the second section at http://www.linuxvirtualserver.org/docs/arp.... [13:49:11] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/965744 [13:49:41] (03CR) 10CI reject: [V: 04-1] wip [puppet] - 10https://gerrit.wikimedia.org/r/965744 (owner: 10Herron) [13:50:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [13:50:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [14:00:47] (03PS3) 10Jbond: profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 (owner: 10Muehlenhoff) [14:00:49] (03PS1) 10Jbond: P:base: drop spec test for buster [puppet] - 10https://gerrit.wikimedia.org/r/965745 [14:01:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/965745 (owner: 10Jbond) [14:02:35] (03PS1) 10Elukey: ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965746 (https://phabricator.wikimedia.org/T348664) [14:02:37] (03CR) 10Jbond: [C: 03+2] P:base: drop spec test for buster [puppet] - 10https://gerrit.wikimedia.org/r/965745 (owner: 10Jbond) [14:03:12] 10SRE, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): hw troubleshooting: disk failure for cloudvirt2004-dev.codfw.wmnet - https://phabricator.wikimedia.org/T348531 (10Jhancock.wm) 05Open→03Resolved it's been 3 days. good enough for me. [14:03:44] (03CR) 10Elukey: [C: 03+2] ml-services: update readability docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/965746 (https://phabricator.wikimedia.org/T348664) (owner: 10Elukey) [14:03:53] !log remove redundant 208.80.154.238/32 dev from /e/n/i on A:dns-rec and A:eqiad (superseded by label lo:anycast): T348041 [14:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] T348041: Remove static routes for ns[01] and replace their announcements with bird - https://phabricator.wikimedia.org/T348041 [14:04:51] (03CR) 10Muehlenhoff: [C: 03+2] profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 (owner: 10Muehlenhoff) [14:05:00] (03PS4) 10Muehlenhoff: profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 [14:06:59] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [14:07:13] (03PS1) 10Bking: search-loader: Move bullseye hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/965748 (https://phabricator.wikimedia.org/T346039) [14:07:33] (03CR) 10CI reject: [V: 04-1] profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 (owner: 10Muehlenhoff) [14:12:16] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cloudelastic1008.wikimedia.org with OS bullseye [14:12:37] (03CR) 10CI reject: [V: 04-1] arclamp: add redis exporter and prom scrape config [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [14:12:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cloudelastic1008.... [14:13:53] (03CR) 10Muehlenhoff: [V: 03+2] profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 (owner: 10Muehlenhoff) [14:14:29] (03CR) 10Gehel: [C: 03+1] search-loader: Move bullseye hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/965748 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:15:13] (03CR) 10Btullis: [C: 03+1] search-loader: Move bullseye hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/965748 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:17:04] (03CR) 10Bking: [C: 03+2] search-loader: Move bullseye hosts back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/965748 (https://phabricator.wikimedia.org/T346039) (owner: 10Bking) [14:17:04] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1105 [14:17:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1105 [14:17:22] (03PS5) 10Herron: arclamp: add redis exporter and prom scrape config [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) [14:17:47] (03CR) 10Herron: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44050/console" [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [14:17:59] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [14:18:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [14:18:55] (03PS5) 10Muehlenhoff: profile::parsoid::rt_server: Fix quotes [puppet] - 10https://gerrit.wikimedia.org/r/965740 [14:19:10] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:21] (03CR) 10Andrew Bogott: [C: 03+1] radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [14:21:57] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be2003.codfw.wmnet with reason: host reimage [14:26:15] (03CR) 10Brouberol: [C: 03+2] Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:26:33] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1008.wikimedia.org with reason: host reimage [14:27:51] (03PS3) 10JHathaway: dev env: PS1 function for to show the puppet env [puppet] - 10https://gerrit.wikimedia.org/r/965536 (https://phabricator.wikimedia.org/T337970) [14:28:04] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1106.mgmt.eqiad.wmnet with reboot policy FORCED [14:29:30] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10Papaul) @MatthewVernon you welcome [14:29:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1008.wikimedia.org with reason: host reimage [14:30:06] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1106 [14:30:08] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1106 [14:33:03] (03CR) 10Brouberol: [C: 03+1] Ensure that alluxio cache directories are present for presto [puppet] - 10https://gerrit.wikimedia.org/r/965730 (https://phabricator.wikimedia.org/T266641) (owner: 10Btullis) [14:34:17] (03CR) 10Bartosz Dziewoński: "Feel free to backport if you want, but this is not a regression." [core] (wmf/1.41.0-wmf.30) - 10https://gerrit.wikimedia.org/r/965690 (https://phabricator.wikimedia.org/T348812) (owner: 10Jforrester) [14:36:44] (03PS1) 10DDesouza: miscweb: update research-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/965755 (https://phabricator.wikimedia.org/T219903) [14:38:34] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be2003.codfw.wmnet with OS bookworm [14:39:45] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be2003.codfw.wmnet with OS bookworm completed: - m... [14:42:51] (03CR) 10Majavah: [V: 03+1 C: 03+2] radosgw: Enforce header name and CSP [puppet] - 10https://gerrit.wikimedia.org/r/965670 (https://phabricator.wikimedia.org/T276961) (owner: 10Majavah) [14:43:57] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [14:44:48] 10SRE, 10Wikimedia-Mailing-lists: Requesting mailing list for Leadership Development Network - https://phabricator.wikimedia.org/T348868 (10Ladsgroup) a:03Ladsgroup I'll do this. The only thing that can't be changed easily later is the address. Are you sure you want it to be "leadershipdevelopmentnetwork"? M... [14:45:18] James_F: were you planning to backport some changes today? [14:46:15] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install moss-be200[34] - https://phabricator.wikimedia.org/T342674 (10MatthewVernon) 05Open→03Resolved Successful install (I had to change the boot disk in BIOS setup). [14:48:08] (03PS1) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [14:48:11] (03PS1) 10Majavah: radosgw: Tweak CSP [puppet] - 10https://gerrit.wikimedia.org/r/965757 [14:48:34] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:22] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:49:56] MatmaRex: To prod? No. [14:50:02] (03PS2) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [14:50:11] MatmaRex: If you think an emergency prod backport is needed, I'm happy to advise, but there's a process. [14:50:15] (03PS2) 10Majavah: radosgw: Tweak CSP [puppet] - 10https://gerrit.wikimedia.org/r/965757 [14:50:38] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44052/console" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [14:50:42] MatmaRex: For the table sorter I was giving Jdlrobson the opportunity given he was concerned about the error rate. [14:50:49] (03CR) 10Majavah: [C: 03+2] radosgw: Tweak CSP [puppet] - 10https://gerrit.wikimedia.org/r/965757 (owner: 10Majavah) [14:51:02] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1106 [14:51:04] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1106 [14:51:07] James_F: yeah, that was the change i was wondering about. alright [14:51:23] i'm considering proposing a revert of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/958993 [14:51:28] (03PS3) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [14:51:37] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1106.mgmt.eqiad.wmnet with reboot policy FORCED [14:52:54] (03PS4) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [14:53:13] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1107 [14:53:18] (03PS2) 10Jbond: puppet: add support for puppetserver returning none 0 rc [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 [14:53:29] MatmaRex: Given edsanders just C+2ed a bunch of things I was hoping he'd weigh in. [14:54:04] (03PS1) 10Majavah: radosgw: Fix inline syntax [puppet] - 10https://gerrit.wikimedia.org/r/965758 [14:54:52] (03PS2) 10Majavah: radosgw: Fix inline syntax [puppet] - 10https://gerrit.wikimedia.org/r/965758 [14:54:56] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1108 [14:55:02] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1107 [14:55:43] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1107.mgmt.eqiad.wmnet with reboot policy FORCED [14:55:59] (03CR) 10Majavah: [C: 03+2] radosgw: Fix inline syntax [puppet] - 10https://gerrit.wikimedia.org/r/965758 (owner: 10Majavah) [14:56:55] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1108 [14:58:44] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1108.mgmt.eqiad.wmnet with reboot policy FORCED [14:59:25] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1109 [15:01:01] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1109 [15:02:02] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1109.mgmt.eqiad.wmnet with reboot policy FORCED [15:02:12] (03PS3) 10AikoChou: ml-services: test kserve batcher for revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) [15:04:22] (03CR) 10Jbond: "ready for review" [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [15:04:36] (03CR) 10Hnowlan: [C: 03+1] Update cxserver to 2023-10-12-080927-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/965022 (https://phabricator.wikimedia.org/T344982) (owner: 10KartikMistry) [15:06:03] 10SRE, 10Traffic: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 (10jhathaway) @Vgutierrez thanks for opening this ticket and investigating ipip support in ipvs. Another alternative would be [[ https://datatracker.ietf.org/doc/html/draft-ietf-intarea-gue-06 | GUE ]] enc... [15:06:11] (03CR) 10CI reject: [V: 04-1] team-ml: add alert for Kafka consumer lag for ores extension [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:06:13] (03CR) 10AikoChou: [C: 03+2] ml-services: test kserve batcher for revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [15:06:23] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1110 [15:06:54] (03CR) 10Ilias Sarantopoulos: "recheck" [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:07:27] (03Merged) 10jenkins-bot: ml-services: test kserve batcher for revertrisk-multilingual in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/964915 (https://phabricator.wikimedia.org/T348536) (owner: 10AikoChou) [15:07:33] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1110 [15:07:47] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1111 [15:08:54] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1110.mgmt.eqiad.wmnet with reboot policy FORCED [15:09:31] (03CR) 10Ilias Sarantopoulos: team-ml: add alert for Kafka consumer lag for ores extension (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/962056 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [15:09:40] (03PS5) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [15:09:42] (03PS1) 10Btullis: Remove the need for the analytics-meta database to require java [puppet] - 10https://gerrit.wikimedia.org/r/965761 (https://phabricator.wikimedia.org/T284150) [15:10:00] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1111 [15:10:20] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:11:10] (03PS1) 10Vgutierrez: Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 [15:11:44] (03CR) 10CI reject: [V: 04-1] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (owner: 10Vgutierrez) [15:11:49] (03PS2) 10Vgutierrez: Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) [15:12:03] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1106.mgmt.eqiad.wmnet with reboot policy FORCED [15:12:44] (03CR) 10CI reject: [V: 04-1] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [15:12:58] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1106'] [15:15:04] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1111.mgmt.eqiad.wmnet with reboot policy FORCED [15:15:15] (03PS3) 10Vgutierrez: Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) [15:15:42] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1112 [15:16:04] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1107.mgmt.eqiad.wmnet with reboot policy FORCED [15:16:07] (03CR) 10CI reject: [V: 04-1] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [15:16:42] (03PS1) 10Eevans: install_server: create aqs reuse partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) [15:16:53] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1112 [15:16:54] (03CR) 10JHathaway: puppet: add support for puppetserver returning none 0 rc (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [15:17:16] (03CR) 10CI reject: [V: 04-1] install_server: create aqs reuse partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans) [15:18:40] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1107.mgmt.eqiad.wmnet with reboot policy FORCED [15:19:01] (03PS2) 10Eevans: install_server: create aqs reuse partition reuse recipe [puppet] - 10https://gerrit.wikimedia.org/r/965767 (https://phabricator.wikimedia.org/T347738) [15:19:46] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1108.mgmt.eqiad.wmnet with reboot policy FORCED [15:19:50] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1107.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:36] (03CR) 10Jbond: "thanks for the review, response inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [15:20:52] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1109.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:58] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44055/console" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [15:21:27] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1107'] [15:22:36] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1108'] [15:22:59] (03PS4) 10Vgutierrez: Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) [15:23:17] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1106'] [15:23:31] (03CR) 10CI reject: [V: 04-1] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [15:23:49] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1109'] [15:25:03] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1112.mgmt.eqiad.wmnet with reboot policy FORCED [15:26:12] (03PS5) 10Vgutierrez: Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) [15:26:26] (03PS6) 10Btullis: Create a new role for analytics_cluster::mariadb and assign it [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) [15:26:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [15:27:55] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44056/console" [puppet] - 10https://gerrit.wikimedia.org/r/965756 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis) [15:30:09] 10SRE, 10Wikimedia-Mailing-lists: Requesting mailing list for Leadership Development Network - https://phabricator.wikimedia.org/T348868 (10BJiang) Hi @Ladsgroup ! Thanks for taking on this task and the feedback. Let's go with leadership-development-network. This can be a public group. [15:31:44] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1113 [15:32:09] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1107'] [15:32:18] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1108'] [15:32:50] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1110.mgmt.eqiad.wmnet with reboot policy FORCED [15:32:51] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1113 [15:33:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [15:33:09] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1109'] [15:35:13] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [15:35:34] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1111.mgmt.eqiad.wmnet with reboot policy FORCED [15:35:59] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [15:36:28] (03CR) 10JHathaway: [C: 03+1] puppet: add support for puppetserver returning none 0 rc (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/965112 (owner: 10Jbond) [15:37:05] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1113.mgmt.eqiad.wmnet with reboot policy FORCED [15:39:23] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1114 [15:39:36] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1115 [15:40:33] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1114 [15:40:53] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1115 [15:41:01] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1114.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:36] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1115.mgmt.eqiad.wmnet with reboot policy FORCED [15:43:21] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1112.mgmt.eqiad.wmnet with reboot policy FORCED [15:43:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T343198)', diff saved to https://phabricator.wikimedia.org/P52932 and previous config saved to /var/cache/conftool/dbconfig/20231013-154321-arnaudb.json [15:43:27] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [15:44:25] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [15:44:38] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1112'] [15:44:59] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [15:45:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add support for IPIP encapsulation [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/965763 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez) [15:45:24] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110'] [15:45:29] 10SRE, 10observability, 10SRE Observability (FY2023/2024-Q2): Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10dr0ptp4kt) Thanks @herron ! [15:45:47] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111'] [15:49:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:50] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1112'] [15:54:57] (03PS1) 10Jbond: puppet: simplify debug code [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/965772 [15:55:23] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [15:55:25] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1113.mgmt.eqiad.wmnet with reboot policy FORCED [15:55:43] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111'] [15:58:08] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 3 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Umherirrender) >>! In T328872#9249624, @TheDJ wrote: >>>! In T3288... [15:58:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P52933 and previous config saved to /var/cache/conftool/dbconfig/20231013-155827-arnaudb.json [15:59:21] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1115.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:31] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1113'] [16:00:25] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1115'] [16:01:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:06:52] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1114.mgmt.eqiad.wmnet with reboot policy FORCED [16:08:08] PROBLEM - Disk space on build2001 is CRITICAL: DISK CRITICAL - free space: / 5529 MB (2% inode=64%): /tmp 5529 MB (2% inode=64%): /var/tmp 5529 MB (2% inode=64%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [16:10:34] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1115'] [16:11:25] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1113'] [16:12:03] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1114'] [16:13:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P52934 and previous config saved to /var/cache/conftool/dbconfig/20231013-161333-arnaudb.json [16:15:53] (03PS1) 10Jbond: mcrouter_pools: drop the use of alert [puppet] - 10https://gerrit.wikimedia.org/r/965775 [16:18:34] (03PS2) 10Jbond: puppet: simplify debug code [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/965772 [16:22:25] (03PS3) 10Jbond: puppet: simplify debug code [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/965772 [16:24:42] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1114'] [16:25:16] (03CR) 10CI reject: [V: 04-1] puppet: simplify debug code [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/965772 (owner: 10Jbond) [16:26:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10VRiley-WMF) [16:28:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T343198)', diff saved to https://phabricator.wikimedia.org/P52935 and previous config saved to /var/cache/conftool/dbconfig/20231013-162840-arnaudb.json [16:28:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [16:28:45] T343198: Add pl_target_id column to pagelinks in production - https://phabricator.wikimedia.org/T343198 [16:28:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [16:29:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T343198)', diff saved to https://phabricator.wikimedia.org/P52936 and previous config saved to /var/cache/conftool/dbconfig/20231013-162902-arnaudb.json [16:29:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [16:29:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1008.wikimedia.org with OS bullseye [16:29:50] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [16:29:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudelastic1008.wiki... [16:30:24] (03PS1) 10Andrew Bogott: Neutron/antelope: add policy rule for create_port:device_id [puppet] - 10https://gerrit.wikimedia.org/r/965779 (https://phabricator.wikimedia.org/T341285) [16:30:31] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [16:31:31] (03CR) 10FNegri: [C: 03+1] "I couldn't find any documentation about it, but it seems safe to apply." [puppet] - 10https://gerrit.wikimedia.org/r/965779 (https://phabricator.wikimedia.org/T341285) (owner: 10Andrew Bogott) [16:31:44] (03CR) 10Andrew Bogott: [C: 03+2] Neutron/antelope: add policy rule for create_port:device_id [puppet] - 10https://gerrit.wikimedia.org/r/965779 (https://phabricator.wikimedia.org/T341285) (owner: 10Andrew Bogott) [16:35:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [16:35:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) 05In progress→03Resolved [16:41:40] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [16:41:53] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1105'] [16:42:03] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [16:42:21] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1105'] [16:43:29] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [16:43:38] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [16:48:29] (03PS1) 10Jclark-ctr: add cp11[00-15] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/965781 (https://phabricator.wikimedia.org/T342159) [16:49:04] RECOVERY - Disk space on build2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=build2001&var-datasource=codfw+prometheus/ops [16:49:16] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [16:49:41] (03CR) 10Jclark-ctr: [C: 03+2] add cp11[00-15] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/965781 (https://phabricator.wikimedia.org/T342159) (owner: 10Jclark-ctr) [16:50:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [16:53:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [16:57:46] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp1105.mgmt.eqiad.wmnet with reboot policy FORCED [16:58:55] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [16:59:54] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): re-write compile_redirects function - https://phabricator.wikimedia.org/T348883 (10jbond) [17:00:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) a:05ssingh→03Jclark-ctr [17:01:37] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1115'] [17:03:36] (03PS2) 10Hashar: mcrouter: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960063 (https://phabricator.wikimedia.org/T345152) [17:03:52] (03PS2) 10Hashar: envoyproxy: remove skip_install from tox.ini [puppet] - 10https://gerrit.wikimedia.org/r/960062 (https://phabricator.wikimedia.org/T345152) [17:04:08] (03PS2) 10Hashar: Remove minversion=1.6 from tox.ini files [puppet] - 10https://gerrit.wikimedia.org/r/960064 (https://phabricator.wikimedia.org/T345152) [17:08:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1115'] [17:10:16] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [17:10:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [17:12:13] (03PS1) 10Herron: alertmanager::api: enable POST logging [puppet] - 10https://gerrit.wikimedia.org/r/965785 (https://phabricator.wikimedia.org/T321579) [17:14:25] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1105'] [17:14:50] (03PS1) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:15:30] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:15:53] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [17:16:10] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1105'] [17:16:17] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [17:24:00] (03PS2) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:24:38] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:25:06] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) [17:25:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44057/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:25:43] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [17:26:07] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) (duration: 01m 01s) [17:26:15] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1105'] [17:29:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [17:29:14] 10SRE, 10Wikimedia-Mailing-lists: Requesting mailing list for Leadership Development Network - https://phabricator.wikimedia.org/T348868 (10Ladsgroup) 05Open→03Resolved {{done}} https://lists.wikimedia.org/postorius/lists/leadership-development-network.lists.wikimedia.org [17:29:17] (03PS3) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:29:45] (03CR) 10Herron: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/965785 (https://phabricator.wikimedia.org/T321579) (owner: 10Herron) [17:29:57] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:34:53] (03PS4) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:35:46] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:40:03] (03PS5) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:41:01] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:41:19] (03PS1) 10Eevans: streams: update regex for 4.x `nodetool netstats` output [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/965788 [17:42:17] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1100'] [17:43:02] (03PS6) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [17:44:04] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1101'] [17:44:06] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [17:45:02] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10VirginiaPoundstone) Had a quick chat with @jAllemandou to think through some open questions we need to answer across teams. ====O... [17:45:44] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [17:46:03] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [17:46:16] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [17:46:39] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1103'] [17:48:31] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1100'] [17:50:57] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1101'] [17:52:59] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1103'] [17:53:00] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1104'] [17:53:51] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1105'] [17:54:11] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10DDeSouza) Thank y'all! @MoritzMuehlenhoff I can connect to the bastion but I'm having trouble connecting to [miscweb](https://wikitech.wikimedia.org/wiki/Miscweb). ### Config ` Identitie... [17:54:22] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1105'] [17:54:45] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for DDeSouza - https://phabricator.wikimedia.org/T348209 (10DDeSouza) 05Resolved→03Open [17:55:45] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1106'] [17:55:52] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1106'] [17:56:07] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1107'] [17:59:22] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) [18:00:21] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) (duration: 00m 59s) [18:00:37] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1108'] [18:01:28] (03PS7) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [18:02:08] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1107'] [18:02:13] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:02:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44064/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:03:11] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [18:03:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye [18:03:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed: - cp1115 (**PASS**) - Removed from Puppet... [18:04:44] (03PS8) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [18:05:24] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:06:39] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1104'] [18:06:39] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1108'] [18:07:03] (03PS9) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [18:07:41] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:08:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44065/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:10:50] (03PS10) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [18:11:29] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:12:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44066/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:16:07] (03PS11) 10Jbond: compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) [18:16:46] (03CR) 10CI reject: [V: 04-1] compile_redirects: Try porting compile_redirects to new api [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:17:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/44067/console" [puppet] - 10https://gerrit.wikimedia.org/r/965786 (https://phabricator.wikimedia.org/T348883) (owner: 10Jbond) [18:52:15] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1109'] [18:53:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:58:32] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1109'] [19:00:31] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [19:00:38] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1110'] [19:00:51] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [19:07:46] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1111'] [19:07:57] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [19:08:02] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1112'] [19:08:06] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [19:14:15] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1112'] [19:14:17] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) [19:14:29] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1113'] [19:14:41] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@520fa55]: (no justification provided) (duration: 00m 23s) [19:17:19] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1114'] [19:17:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1114'] [19:17:54] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1113'] [19:18:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [19:18:36] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1112'] [19:19:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1111'] [19:20:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1111'] [19:20:10] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1110'] [19:20:12] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1110'] [19:21:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [19:21:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [19:22:08] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1113'] [19:22:58] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [19:23:00] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1112'] [19:23:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [19:23:29] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1112'] [19:24:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['cp1114'] [19:24:03] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [19:24:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye [19:24:37] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [19:24:42] !log vriley@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1102'] [19:24:48] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [19:25:00] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [19:25:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1112'] [19:27:52] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1112'] [19:28:48] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [19:28:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [19:30:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1108.eqiad.wmnet with OS bullseye [19:30:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye [19:35:13] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102 [19:35:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102 [19:37:33] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102 [19:38:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102 [19:39:54] !log jclark@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp1113'] [19:42:26] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [19:42:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1102'] [19:43:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [19:43:47] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp1102'] [19:44:10] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage [19:44:37] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [19:47:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage [19:48:38] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1113.eqiad.wmnet with OS bullseye [19:48:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye [19:49:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [19:52:05] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [19:52:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [19:53:08] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [19:53:23] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [19:54:59] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp1102 [19:55:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [19:55:56] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp1102 [19:56:15] !log vriley@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1102'] [19:56:29] !log vriley@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1102'] [19:57:04] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1111.eqiad.wmnet with OS bullseye [19:57:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye [20:03:37] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:04:44] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:04:45] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1109.eqiad.wmnet with OS bullseye [20:04:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye completed: - cp1109 (**PASS**) - Removed from Puppet... [20:05:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [20:06:36] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [20:07:18] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:07:30] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [20:10:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1113.eqiad.wmnet with reason: host reimage [20:11:26] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:11:27] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS bullseye [20:11:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (**WARN**) - Removed from Puppet... [20:11:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [20:12:21] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1106.eqiad.wmnet with OS bullseye [20:12:25] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host cp1107.eqiad.wmnet with OS bullseye [20:12:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye [20:12:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye [20:12:37] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1107.eqiad.wmnet with reason: host reimage [20:15:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1107.eqiad.wmnet with OS bullseye [20:15:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**) - Removed f... [20:22:53] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1108.eqiad.wmnet with OS bullseye [20:23:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1108.eqiad.wmnet with OS bullseye executed with errors: - cp1108 (**FAIL**) - Removed f... [20:23:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp1108'] [20:24:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp1108'] [20:26:59] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:28:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [20:28:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1113.eqiad.wmnet with OS bullseye [20:28:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1113.eqiad.wmnet with OS bullseye completed: - cp1113 (**PASS**) - Removed from Puppet... [20:28:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10Jclark-ctr) [20:35:31] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [20:41:19] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1114.eqiad.wmnet with OS bullseye [20:41:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**) - Removed f... [20:48:29] RECOVERY - Check systemd state on mw1489 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:36] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1111.eqiad.wmnet with OS bullseye [20:49:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1111.eqiad.wmnet with OS bullseye executed with errors: - cp1111 (**FAIL**) - Removed f... [20:53:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [20:55:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:28:53] !log hashar@deploy2002 Started deploy [integration/docroot@504d455]: Fix php-session-serializer tagline [21:29:00] !log hashar@deploy2002 Finished deploy [integration/docroot@504d455]: Fix php-session-serializer tagline (duration: 00m 06s) [21:29:33] !log hashar@deploy2002 Started deploy [integration/docroot@096f637]: Expand Purtle doc card [21:29:38] !log hashar@deploy2002 Finished deploy [integration/docroot@096f637]: Expand Purtle doc card (duration: 00m 05s) [21:32:35] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1106.eqiad.wmnet with OS bullseye [21:32:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1106.eqiad.wmnet with OS bullseye executed with errors: - cp1106 (**FAIL**) - Removed f... [21:32:43] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp1107.eqiad.wmnet with OS bullseye [21:32:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp11[00-15] - https://phabricator.wikimedia.org/T342159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cp1107.eqiad.wmnet with OS bullseye executed with errors: - cp1107 (**FAIL**) - Downtimed... [21:38:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:06:55] PROBLEM - Check systemd state on gitlab-runner1003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:23] RECOVERY - Check systemd state on gitlab-runner1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:39:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:40:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:41:03] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.378 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:41:09] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 4.486 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:41:55] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:45:33] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:45:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:46:21] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:53:34] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:58:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.556 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:58:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50715 bytes in 6.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:04:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:06:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:06:59] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:23:13] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:23:42] (03CR) 10Andrea Denisse: [C: 03+1] arclamp: add redis exporter and prom scrape config [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [23:23:47] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:23:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:24:00] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [23:26:44] (03CR) 10Cwhite: [C: 03+1] arclamp: add redis exporter and prom scrape config [puppet] - 10https://gerrit.wikimedia.org/r/965744 (https://phabricator.wikimedia.org/T348756) (owner: 10Herron) [23:29:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:29:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:30:33] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:37:45] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:38:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:38:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50714 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:48:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:51:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.335 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:56:08] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)