[00:04:28] (03PS1) 10Andrew Bogott: codfw1dev horizon: update docker version [puppet] - 10https://gerrit.wikimedia.org/r/1025488 [00:08:25] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev horizon: update docker version [puppet] - 10https://gerrit.wikimedia.org/r/1025488 (owner: 10Andrew Bogott) [00:17:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:08] (03CR) 10Andrew Bogott: [C:03+2] Revert "wmcs VM backups: move all backups to one host" [puppet] - 10https://gerrit.wikimedia.org/r/1023468 (https://phabricator.wikimedia.org/T332400) (owner: 10Andrew Bogott) [00:37:15] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:38:35] (03PS1) 10Andrew Bogott: codfw1dev horizon: bump docker version [puppet] - 10https://gerrit.wikimedia.org/r/1025493 [00:42:15] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [00:42:27] (03CR) 10Andrew Bogott: [C:03+2] codfw1dev horizon: bump docker version [puppet] - 10https://gerrit.wikimedia.org/r/1025493 (owner: 10Andrew Bogott) [00:45:46] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7004.magru.wmnet with OS bullseye [00:45:56] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9756042 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7004.magru.wmnet with OS bullseye executed with errors: - cp7004 (**FAIL**) - Removed fro... [00:53:56] 06SRE, 06collaboration-services, 10WMF-General-or-Unknown, 07Documentation: https://static-codereview.wikimedia.org/ documentation improvements - https://phabricator.wikimedia.org/T363771#9756045 (10Dzahn) [01:02:22] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9756051 (10ssingh) ` Function lookup() did not find a value for the name 'prometheus_nodes' …in /srv/puppet_code/environments/production/modules/profile/manifests/firewall.pp, line: 21 ` `lookup(... [01:03:38] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T363783 (10phaultfinder) 03NEW [01:07:54] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.3 [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1024761 (https://phabricator.wikimedia.org/T361397) [01:07:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.3 [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1024761 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [01:07:58] (03PS1) 10Ssingh: magru: add hieradata for P:cache::varnish [puppet] - 10https://gerrit.wikimedia.org/r/1025495 [01:09:03] (03CR) 10Ssingh: [C:03+2] magru: add hieradata for P:cache::varnish [puppet] - 10https://gerrit.wikimedia.org/r/1025495 (owner: 10Ssingh) [01:13:13] (03PS1) 10Ssingh: Revert "magru: add hieradata for P:cache::varnish" [puppet] - 10https://gerrit.wikimedia.org/r/1025321 [01:13:51] (03CR) 10Ssingh: [C:03+2] Revert "magru: add hieradata for P:cache::varnish" [puppet] - 10https://gerrit.wikimedia.org/r/1025321 (owner: 10Ssingh) [01:16:47] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9756089 (10ssingh) OK, so I finally found why this is failing. For a reason that I don't fully understand, `hieradata/magru/` directory actually needs to exist for the `lookup()` against `hieradata... [01:21:42] (03PS1) 10Ssingh: magru: set hiera for trafficserver::backend::storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1025496 (https://phabricator.wikimedia.org/T362729) [01:22:32] (03CR) 10Ssingh: [C:03+2] magru: set hiera for trafficserver::backend::storage_elements [puppet] - 10https://gerrit.wikimedia.org/r/1025496 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [01:24:29] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7004.magru.wmnet with OS bullseye [01:24:38] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9756098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7004.magru.wmnet with OS bullseye [01:27:49] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.3 [core] (wmf/1.43.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1024761 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [01:53:00] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [01:54:27] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1013 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:55:07] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:55:43] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:55:49] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0200) [02:04:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:05:59] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51784 bytes in 1.878 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:06:03] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Fri 14 Jun 2024 01:28:50 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:06:27] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:06:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.232 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:15:44] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [02:16:41] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [02:16:43] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7004.magru.wmnet with OS bullseye [02:16:53] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9756151 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7004.magru.wmnet with OS bullseye completed: - cp7004 (**WARN**) - Downtimed on Icinga/Al... [02:24:27] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1013 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:38:53] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:15] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 18.75% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:49:15] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 25% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [02:52:07] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:53:07] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:54:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [02:58:53] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0300) [03:02:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:13] !log mwpresync@deploy1002 Pruned MediaWiki: 1.42.0-wmf.26 (duration: 03m 03s) [03:04:42] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025500 (https://phabricator.wikimedia.org/T361397) [03:04:43] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025500 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [03:05:29] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025500 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [03:05:55] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.3 refs T361397 [03:06:05] T361397: 1.43.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T361397 [03:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [03:39:55] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:45:27] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:45:44] (03CR) 10Abijeet Patro: ContentTranslation: Update publishing setting for cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) (owner: 10KartikMistry) [03:45:53] (03PS2) 10KartikMistry: ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) [03:59:56] (03PS3) 10KartikMistry: ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) [04:01:42] (03CR) 10KartikMistry: ContentTranslation: Update publishing setting for cswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) (owner: 10KartikMistry) [04:01:51] (03PS4) 10KartikMistry: ContentTranslation: Update publishing setting for cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025300 (https://phabricator.wikimedia.org/T353049) [04:05:22] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.3 refs T361397 (duration: 59m 27s) [04:05:28] T361397: 1.43.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T361397 [04:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:21] (03CR) 10Stevemunene: datahub: create dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [04:55:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T363672 [04:55:30] T363672: Switchover s3 master (db1157 -> db1223) - https://phabricator.wikimedia.org/T363672 [04:55:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1223 with weight 0 T363672', diff saved to https://phabricator.wikimedia.org/P61449 and previous config saved to /var/cache/conftool/dbconfig/20240430-045541-marostegui.json [04:55:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T363672 [04:56:40] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1223 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1024752 (https://phabricator.wikimedia.org/T363672) (owner: 10Gerrit maintenance bot) [05:02:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [05:02:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1230.eqiad.wmnet with reason: Maintenance [05:12:54] !log Starting s3 eqiad failover from db1157 to db1223 - T363672 [05:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:59] T363672: Switchover s3 master (db1157 -> db1223) - https://phabricator.wikimedia.org/T363672 [05:13:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T363672', diff saved to https://phabricator.wikimedia.org/P61450 and previous config saved to /var/cache/conftool/dbconfig/20240430-051312-root.json [05:13:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1223 to s3 primary and set section read-write T363672', diff saved to https://phabricator.wikimedia.org/P61451 and previous config saved to /var/cache/conftool/dbconfig/20240430-051332-root.json [05:14:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1157 T363672', diff saved to https://phabricator.wikimedia.org/P61452 and previous config saved to /var/cache/conftool/dbconfig/20240430-051419-root.json [05:15:03] (03CR) 10Marostegui: [C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024753 (https://phabricator.wikimedia.org/T363672) (owner: 10Gerrit maintenance bot) [05:15:08] (03PS2) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024753 (https://phabricator.wikimedia.org/T363672) [05:15:50] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1024753 (https://phabricator.wikimedia.org/T363672) (owner: 10Gerrit maintenance bot) [05:18:18] (03PS1) 10Marostegui: db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025505 [05:18:45] (03CR) 10Marostegui: [C:03+2] db1157: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025505 (owner: 10Marostegui) [05:19:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1157.eqiad.wmnet with OS bookworm [05:29:58] (03PS1) 10Marostegui: Revert "db1157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025322 [05:32:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage [05:32:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: host reimage [05:48:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [05:48:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [05:49:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2129 from API serving', diff saved to https://phabricator.wikimedia.org/P61453 and previous config saved to /var/cache/conftool/dbconfig/20240430-054943-arnaudb.json [05:51:15] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9756319 (10ABran-WMF) sure, go ahead, the host is depooled [05:55:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1157.eqiad.wmnet with OS bookworm [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0600). Please do the needful. [06:04:50] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:10:20] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:10:30] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [06:47:17] (03PS1) 10Marostegui: db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025600 [06:47:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P61454 and previous config saved to /var/cache/conftool/dbconfig/20240430-064720-root.json [06:48:03] (03CR) 10Marostegui: [C:03+2] db1174: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025600 (owner: 10Marostegui) [06:48:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1174.eqiad.wmnet with OS bookworm [06:58:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0700) [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:01:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage [07:02:17] (03CR) 10Muehlenhoff: "Ack, I'm merging now, then you can make all further changes. I actually missed three more, so ideally you can simply fix them along as wel" [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [07:02:19] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for Collaboration services (batch two) [puppet] - 10https://gerrit.wikimedia.org/r/1024615 (owner: 10Muehlenhoff) [07:02:53] 06SRE, 10Bitu, 06DBA, 06Infrastructure-Foundations: Database request for Bitu Cloud DEV installation - https://phabricator.wikimedia.org/T362619#9756386 (10ABran-WMF) 05Open→03Resolved @SLyngshede-WMF done, password available in `/home/slyngshede/.pw` on `mwmaint1002`! [07:04:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1174.eqiad.wmnet with reason: host reimage [07:06:49] (03PS1) 10Marostegui: sections.yaml: Add es6 as valid dbctl section [puppet] - 10https://gerrit.wikimedia.org/r/1025603 (https://phabricator.wikimedia.org/T355285) [07:07:04] (03CR) 10Brouberol: [C:03+1] "LGTM, thanks for the clear explanaations. You might want to hear from someone with more Ceph experience before merging though." [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [07:07:43] (03CR) 10Marostegui: [C:03+2] sections.yaml: Add es6 as valid dbctl section [puppet] - 10https://gerrit.wikimedia.org/r/1025603 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [07:10:55] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 497, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:11:18] (03CR) 10Muehlenhoff: purged: add PKI cert handling (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1019866 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [07:11:52] (03CR) 10Muehlenhoff: [C:03+2] aptrepo: Add new repository component and repo sync config for Node 20 [puppet] - 10https://gerrit.wikimedia.org/r/1024663 (https://phabricator.wikimedia.org/T362681) (owner: 10Muehlenhoff) [07:14:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:14:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:16:06] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1021406 (https://phabricator.wikimedia.org/T362128) (owner: 10Slyngshede) [07:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [07:18:04] (03CR) 10Muehlenhoff: [C:03+2] netbox::standalone: Enable profile::auto_restarts::service for postgres [puppet] - 10https://gerrit.wikimedia.org/r/1024359 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:18:16] (03PS1) 10Marostegui: Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025611 [07:18:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61455 and previous config saved to /var/cache/conftool/dbconfig/20240430-071852-root.json [07:19:18] (03CR) 10Marostegui: [C:03+2] Revert "db1174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025611 (owner: 10Marostegui) [07:20:40] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:21:15] (03CR) 10Muehlenhoff: [C:03+2] Extend cloudnet-codfw1dev Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1024620 (owner: 10Muehlenhoff) [07:24:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1174.eqiad.wmnet with OS bookworm [07:24:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61456 and previous config saved to /var/cache/conftool/dbconfig/20240430-072406-root.json [07:26:56] (03PS1) 10Marostegui: instances.yaml: Add es6 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025668 (https://phabricator.wikimedia.org/T355424) [07:27:34] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add es6 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025668 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [07:28:41] (03CR) 10DCausse: "I think this new alert could go to `cirrussearch.yaml` instead of creating a new file?" [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [07:32:33] (03PS6) 10Santiago Faci: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) [07:33:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61457 and previous config saved to /var/cache/conftool/dbconfig/20240430-073358-root.json [07:34:27] (03CR) 10Santiago Faci: "Extra whitespace removed! Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [07:35:13] (03CR) 10Marostegui: [C:03+2] Revert "db1157: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025322 (owner: 10Marostegui) [07:37:27] (03PS1) 10Marostegui: etcd.php: Add es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) [07:38:09] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for Benthos instances [puppet] - 10https://gerrit.wikimedia.org/r/1023883 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:38:12] (03CR) 10CI reject: [V:04-1] etcd.php: Add es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [07:39:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61458 and previous config saved to /var/cache/conftool/dbconfig/20240430-073912-root.json [07:40:00] (03PS2) 10Marostegui: etcd.php: Add es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) [07:40:01] (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.makevm: Default to Puppet 7 for new VMs [cookbooks] - 10https://gerrit.wikimedia.org/r/1024656 (owner: 10Muehlenhoff) [07:40:12] (03CR) 10Brouberol: [C:03+2] Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [07:40:20] (03CR) 10Brouberol: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [07:42:38] (03CR) 10Santiago Faci: [C:03+2] Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [07:43:37] (03Merged) 10jenkins-bot: Creating staging and production helmfiles for MPIC (Metrics Platform Instrument Configurator) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025283 (https://phabricator.wikimedia.org/T361344) (owner: 10Santiago Faci) [07:43:41] (03CR) 10Muehlenhoff: [C:03+2] arclamp: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024630 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:45:32] (03CR) 10Muehlenhoff: [C:03+2] netbox-standalone: Enable profile::auto_restarts::service for Redis [puppet] - 10https://gerrit.wikimedia.org/r/1024690 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:49:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61459 and previous config saved to /var/cache/conftool/dbconfig/20240430-074903-root.json [07:51:10] (03CR) 10Muehlenhoff: [C:03+2] kerberos: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1024694 (owner: 10Muehlenhoff) [07:54:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61460 and previous config saved to /var/cache/conftool/dbconfig/20240430-075418-root.json [07:55:24] (03PS1) 10Marostegui: mariadb: Remove comments from es2035, es2036 [puppet] - 10https://gerrit.wikimedia.org/r/1025672 (https://phabricator.wikimedia.org/T355424) [07:55:58] (03CR) 10Marostegui: [C:03+2] mariadb: Remove comments from es2035, es2036 [puppet] - 10https://gerrit.wikimedia.org/r/1025672 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [07:57:44] (03PS1) 10Marostegui: check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025674 [07:57:51] (03CR) 10CI reject: [V:04-1] check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025674 (owner: 10Marostegui) [08:00:04] jnuche and brennen: Your horoscope predicts another MediaWiki train - Utc-0+Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0800). [08:00:21] morning, train will be deployed in the next few minutes [08:01:38] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for alertmanager-webhook-logger [puppet] - 10https://gerrit.wikimedia.org/r/1024288 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:03:30] (03PS1) 10Marostegui: check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025675 [08:04:08] (03PS1) 10Santiago Faci: MPIC chart: bumping chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025676 (https://phabricator.wikimedia.org/T361060) [08:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61461 and previous config saved to /var/cache/conftool/dbconfig/20240430-080409-root.json [08:04:36] (03CR) 10Marostegui: [C:03+2] check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025675 (owner: 10Marostegui) [08:04:44] (03CR) 10CI reject: [V:04-1] check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025675 (owner: 10Marostegui) [08:04:46] (03Abandoned) 10Marostegui: check_depooled: Add es6 [software] - 10https://gerrit.wikimedia.org/r/1025674 (owner: 10Marostegui) [08:05:39] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025678 (https://phabricator.wikimedia.org/T361397) [08:05:41] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025678 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [08:06:28] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025678 (https://phabricator.wikimedia.org/T361397) (owner: 10TrainBranchBot) [08:08:28] !log bounce prometheus@k8s in eqiad - T343529 [08:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:33] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [08:09:16] (03CR) 10Brouberol: [C:03+1] MPIC chart: bumping chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025676 (https://phabricator.wikimedia.org/T361060) (owner: 10Santiago Faci) [08:09:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61462 and previous config saved to /var/cache/conftool/dbconfig/20240430-080924-root.json [08:10:16] (03CR) 10Santiago Faci: [C:03+2] MPIC chart: bumping chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025676 (https://phabricator.wikimedia.org/T361060) (owner: 10Santiago Faci) [08:11:27] (03Merged) 10jenkins-bot: MPIC chart: bumping chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025676 (https://phabricator.wikimedia.org/T361060) (owner: 10Santiago Faci) [08:13:48] (03PS1) 10Muehlenhoff: Adapt cookbooks to new Cumin aliases for analytics hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1025679 [08:14:18] (03CR) 10Muehlenhoff: "Respective patch for the cookbooks is at" [puppet] - 10https://gerrit.wikimedia.org/r/1024625 (owner: 10Muehlenhoff) [08:14:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1416 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:14:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:15:04] (03CR) 10Muehlenhoff: [C:03+2] cloudweb: Enable profile::auto_restarts::service for apache/envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024347 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:15:50] (03CR) 10Muehlenhoff: "Doh, indeed. I'll abandon." [puppet] - 10https://gerrit.wikimedia.org/r/1024647 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:16:03] (03Abandoned) 10Muehlenhoff: cloudweb: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1024647 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:19:06] (03PS1) 10Arnaudb: mariadb: removes db2114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1024762 (https://phabricator.wikimedia.org/T356053) [08:19:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61463 and previous config saved to /var/cache/conftool/dbconfig/20240430-081915-root.json [08:19:42] (03CR) 10Marostegui: [C:03+1] mariadb: removes db2114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1024762 (https://phabricator.wikimedia.org/T356053) (owner: 10Arnaudb) [08:19:46] (03CR) 10Arnaudb: [C:03+2] mariadb: removes db2114 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1024762 (https://phabricator.wikimedia.org/T356053) (owner: 10Arnaudb) [08:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:40] (KubernetesAPINotScrapable) firing: (4) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:21:42] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.3 refs T361397 [08:21:47] T361397: 1.43.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T361397 [08:22:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Removes db2114, repools db2151', diff saved to https://phabricator.wikimedia.org/P61464 and previous config saved to /var/cache/conftool/dbconfig/20240430-082200-arnaudb.json [08:22:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 5%: Post replag', diff saved to https://phabricator.wikimedia.org/P61465 and previous config saved to /var/cache/conftool/dbconfig/20240430-082208-arnaudb.json [08:24:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61466 and previous config saved to /var/cache/conftool/dbconfig/20240430-082430-root.json [08:27:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:27:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good. But let's also include" [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [08:27:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:27:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T360332)', diff saved to https://phabricator.wikimedia.org/P61467 and previous config saved to /var/cache/conftool/dbconfig/20240430-082753-arnaudb.json [08:27:58] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [08:29:10] (03PS1) 10Filippo Giunchedi: prometheus: use longer-expiration pki client certs for k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) [08:30:08] (03CR) 10Muehlenhoff: [C:03+1] "Nvm, I just noticed you did that in a later patch." [puppet] - 10https://gerrit.wikimedia.org/r/1024739 (https://phabricator.wikimedia.org/T325398) (owner: 10JHathaway) [08:30:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T360332)', diff saved to https://phabricator.wikimedia.org/P61468 and previous config saved to /var/cache/conftool/dbconfig/20240430-083033-arnaudb.json [08:30:43] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1024741 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [08:31:41] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1024740 (https://phabricator.wikimedia.org/T325407) (owner: 10JHathaway) [08:34:19] (03PS1) 10Marostegui: es2037: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1025683 [08:34:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61469 and previous config saved to /var/cache/conftool/dbconfig/20240430-083420-root.json [08:34:55] (03CR) 10Marostegui: [C:03+2] es2037: Remove comment [puppet] - 10https://gerrit.wikimedia.org/r/1025683 (owner: 10Marostegui) [08:37:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 25%: Post replag', diff saved to https://phabricator.wikimedia.org/P61470 and previous config saved to /var/cache/conftool/dbconfig/20240430-083713-arnaudb.json [08:37:23] (03CR) 10JMeybohm: "I'm not sure this will work as the staging intermediates are configured with a 24h expiry (hieradata/role/common/pki/multirootca.yaml)" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:38:07] (03CR) 10JMeybohm: ""unresolve"" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:39:26] (03PS1) 10Marostegui: mariadb: Add es7 eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1025684 (https://phabricator.wikimedia.org/T355285) [08:39:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61471 and previous config saved to /var/cache/conftool/dbconfig/20240430-083935-root.json [08:40:11] (03CR) 10Volans: [V:03+2 C:03+2] "Merging and testing it on netbox-next first" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1024838 (owner: 10Volans) [08:40:48] (03CR) 10Marostegui: [C:03+2] mariadb: Add es7 eqiad servers [puppet] - 10https://gerrit.wikimedia.org/r/1025684 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [08:43:29] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [08:44:34] (03PS1) 10Marostegui: mariadb: Set up eqiad es7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025685 (https://phabricator.wikimedia.org/T355285) [08:44:47] RECOVERY - Check whether ferm is active by checking the default input chain on mw1416 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:45:17] (03CR) 10Marostegui: [C:03+2] mariadb: Set up eqiad es7 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025685 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [08:45:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P61472 and previous config saved to /var/cache/conftool/dbconfig/20240430-084541-arnaudb.json [08:46:03] (03CR) 10Btullis: [V:03+1] Allow the ceph-common package to create the ceph user/group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [08:47:07] (03CR) 10David Caro: [C:03+1] "About the origin of the requirement:" [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [08:47:45] (03CR) 10David Caro: [C:03+1] "Iec9c8acd1e5f85da04a2c3c2024715376a623f9d" [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [08:48:44] (03PS2) 10Btullis: Allow the ceph-common package to create the ceph user/group [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) [08:48:59] !log volans@cumin1002 START - Cookbook sre.deploy.python-code netbox to netbox-dev2002.codfw.wmnet with reason: Update Netbox dependencies for netbox-next - volans@cumin1002 [08:49:24] (03CR) 10JMeybohm: "AIUI etcd does reload the Cert from disk with every client connection, so there is no need to restart (https://github.com/etcd-io/etcd/com" [puppet] - 10https://gerrit.wikimedia.org/r/1025422 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:49:29] (03Abandoned) 10JMeybohm: etcd: Notify etcd on PKI cert generation and reneval [puppet] - 10https://gerrit.wikimedia.org/r/1025422 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:49:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1174 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61473 and previous config saved to /var/cache/conftool/dbconfig/20240430-084926-root.json [08:49:41] (03PS1) 10Marostegui: db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025686 [08:51:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1158', diff saved to https://phabricator.wikimedia.org/P61474 and previous config saved to /var/cache/conftool/dbconfig/20240430-085129-root.json [08:51:33] (03CR) 10Marostegui: [C:03+2] db1158: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025686 (owner: 10Marostegui) [08:52:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 50%: Post replag', diff saved to https://phabricator.wikimedia.org/P61475 and previous config saved to /var/cache/conftool/dbconfig/20240430-085219-arnaudb.json [08:54:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61476 and previous config saved to /var/cache/conftool/dbconfig/20240430-085441-root.json [08:54:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1158.eqiad.wmnet with OS bookworm [08:55:57] (03PS1) 10Marostegui: es1035: Make it es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1025689 (https://phabricator.wikimedia.org/T355285) [08:56:11] (03CR) 10Btullis: Allow the ceph-common package to create the ceph user/group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [08:56:21] (03CR) 10Marostegui: [C:03+2] es1035: Make it es7 master [puppet] - 10https://gerrit.wikimedia.org/r/1025689 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [08:56:47] (03PS1) 10Marostegui: Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025618 [08:59:54] (03CR) 10Btullis: [C:03+2] Allow the ceph-common package to create the ceph user/group [puppet] - 10https://gerrit.wikimedia.org/r/1025428 (https://phabricator.wikimedia.org/T362993) (owner: 10Btullis) [08:59:55] (03PS1) 10JMeybohm: cfssl::cert: Add before_services parameter [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) [09:00:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P61477 and previous config saved to /var/cache/conftool/dbconfig/20240430-090048-arnaudb.json [09:04:10] (03PS2) 10JMeybohm: cfssl::cert: Add before_services parameter [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) [09:04:35] (03CR) 10Ladsgroup: [C:03+1] "It works and won't break anything from what I'm seeing in the code but honestly the code is quite fragile. The data needs to be provided f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:07:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Post replag', diff saved to https://phabricator.wikimedia.org/P61478 and previous config saved to /var/cache/conftool/dbconfig/20240430-090724-arnaudb.json [09:08:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage [09:09:36] (03CR) 10Marostegui: "The data to etcd cannot be provided via dbctl until this is merged, as dbctl fails to recognize es6 as a valid section :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:09:38] Amir1: ^ [09:09:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:09:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2179.codfw.wmnet with reason: Maintenance [09:10:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2179 (T360332)', diff saved to https://phabricator.wikimedia.org/P61479 and previous config saved to /var/cache/conftool/dbconfig/20240430-091002-arnaudb.json [09:10:07] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:10:15] (03CR) 10Muehlenhoff: [C:03+2] druid::broker: Switch to firewall::service for test_analytics [puppet] - 10https://gerrit.wikimedia.org/r/1024403 (owner: 10Muehlenhoff) [09:10:36] (03CR) 10Effie Mouzeli: [C:03+2] admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [09:10:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:10:43] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [09:10:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T352010)', diff saved to https://phabricator.wikimedia.org/P61480 and previous config saved to /var/cache/conftool/dbconfig/20240430-091049-ladsgroup.json [09:10:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:11:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1158.eqiad.wmnet with reason: host reimage [09:12:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T360332)', diff saved to https://phabricator.wikimedia.org/P61481 and previous config saved to /var/cache/conftool/dbconfig/20240430-091221-arnaudb.json [09:13:09] jouncebot: next [09:13:09] In 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1000) [09:13:21] (03CR) 10Marostegui: [C:03+2] etcd.php: Add es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:13:33] (03Merged) 10jenkins-bot: admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs [deployment-charts] - 10https://gerrit.wikimedia.org/r/895696 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [09:13:52] (03PS1) 10Muehlenhoff: preseed: Extend Ganeti PoP config to also cover magru [puppet] - 10https://gerrit.wikimedia.org/r/1025691 (https://phabricator.wikimedia.org/T362730) [09:14:07] (03Merged) 10jenkins-bot: etcd.php: Add es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025670 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:14:56] (03PS1) 10Muehlenhoff: preseed: Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/1025693 [09:14:59] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:1025670|etcd.php: Add es6 (T355285 T355424)]] [09:15:05] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [09:15:06] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [09:15:10] (03PS1) 10Hnowlan: Revert "mw-parsoid: bump workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025619 [09:15:41] (03CR) 10JMeybohm: [C:03+1] Revert "mw-parsoid: bump workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025619 (owner: 10Hnowlan) [09:15:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T360332)', diff saved to https://phabricator.wikimedia.org/P61482 and previous config saved to /var/cache/conftool/dbconfig/20240430-091556-arnaudb.json [09:16:01] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:16:17] (03CR) 10Hnowlan: [C:03+2] Revert "mw-parsoid: bump workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025619 (owner: 10Hnowlan) [09:17:12] (03Merged) 10jenkins-bot: Revert "mw-parsoid: bump workers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025619 (owner: 10Hnowlan) [09:17:39] (03PS3) 10JMeybohm: cfssl::cert: Add before_services parameter [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) [09:17:48] !log marostegui@deploy1002 marostegui: Backport for [[gerrit:1025670|etcd.php: Add es6 (T355285 T355424)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:17:54] !log marostegui@deploy1002 marostegui: Continuing with sync [09:18:48] !log volans@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) netbox to netbox-dev2002.codfw.wmnet with reason: Update Netbox dependencies for netbox-next - volans@cumin1002 [09:21:05] (03CR) 10Muehlenhoff: [C:03+2] preseed: Extend Ganeti PoP config to also cover magru [puppet] - 10https://gerrit.wikimedia.org/r/1025691 (https://phabricator.wikimedia.org/T362730) (owner: 10Muehlenhoff) [09:21:22] (03PS1) 10Brouberol: mpic: specify extra TLS SAN for each release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025694 (https://phabricator.wikimedia.org/T361343) [09:22:16] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 7 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:22:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Post replag', diff saved to https://phabricator.wikimedia.org/P61483 and previous config saved to /var/cache/conftool/dbconfig/20240430-092230-arnaudb.json [09:23:26] (03PS2) 10Brouberol: mpic: specify extra TLS SAN for each release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025694 (https://phabricator.wikimedia.org/T361343) [09:24:56] (03PS1) 10Jcrespo: dbbackups: Setup dbprov1006 & dbprov2006 and do s4 & s7 dumps there [puppet] - 10https://gerrit.wikimedia.org/r/1025695 (https://phabricator.wikimedia.org/T362509) [09:26:18] (03CR) 10Santiago Faci: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025694 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [09:26:21] (03CR) 10Jcrespo: "Aiming for s4 and s7 next on both dcs for backups upgrade to 10.6. I will merge this, but it won't take effect until Tuesday next week. Wi" [puppet] - 10https://gerrit.wikimedia.org/r/1025695 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:26:39] !log draining mw2382.codfw.wmnet - T362938 [09:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:44] T362938: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938 [09:27:23] (03CR) 10Brouberol: [C:03+2] mpic: specify extra TLS SAN for each release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025694 (https://phabricator.wikimedia.org/T361343) (owner: 10Brouberol) [09:27:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61484 and previous config saved to /var/cache/conftool/dbconfig/20240430-092729-arnaudb.json [09:27:41] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:27:50] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:27:51] jouncebot: nowandnext [09:27:51] For the next 0 hour(s) and 32 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T0800) [09:27:51] In 0 hour(s) and 32 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1000) [09:27:58] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:28:00] (03PS1) 10Santiago Faci: MPIC chart and helmfiles: Some fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025696 (https://phabricator.wikimedia.org/T361343) [09:28:10] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:28:25] !log jayme@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on mw2382.codfw.wmnet with reason: Degraded RAID/storage controller issues [09:28:39] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on mw2382.codfw.wmnet with reason: Degraded RAID/storage controller issues [09:28:44] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9756823 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2b315a7-d925-49a5-80d5-19849b998b72) set by jayme@cumin1002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Degra... [09:28:50] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:28:59] marostegui: I want to do a security deploy. Can you ping once you are done. [09:29:00] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:29:06] Dreamy_Jazz: will doo [09:29:09] Thanks! [09:29:57] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [09:30:01] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:1025670|etcd.php: Add es6 (T355285 T355424)]] (duration: 15m 01s) [09:30:07] Dreamy_Jazz: done [09:30:09] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [09:30:09] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [09:30:26] (03CR) 10Jcrespo: [C:03+1] "https://puppet-compiler.wmflabs.org/output/1025695/2180/" [puppet] - 10https://gerrit.wikimedia.org/r/1025695 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:30:27] Thanks [09:30:42] (03CR) 10Arnaudb: [C:03+1] dbbackups: Setup dbprov1006 & dbprov2006 and do s4 & s7 dumps there [puppet] - 10https://gerrit.wikimedia.org/r/1025695 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:32:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1158.eqiad.wmnet with OS bookworm [09:33:57] (03PS2) 10EoghanGaffney: apt-staging: Add access token for gitlab package puller [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (https://phabricator.wikimedia.org/T347004) [09:33:58] (03CR) 10EoghanGaffney: apt-staging: Add access token for gitlab package puller (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [09:34:15] (03CR) 10Brouberol: [C:03+1] MPIC chart and helmfiles: Some fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025696 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [09:34:36] (03CR) 10Filippo Giunchedi: [C:03+1] preseed: Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/1025693 (owner: 10Muehlenhoff) [09:34:53] (03PS1) 10Muehlenhoff: Add ganeti700[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1025697 (https://phabricator.wikimedia.org/T362730) [09:34:57] (03CR) 10Muehlenhoff: [C:03+2] preseed: Remove obsolete config [puppet] - 10https://gerrit.wikimedia.org/r/1025693 (owner: 10Muehlenhoff) [09:35:05] (03CR) 10Santiago Faci: [C:03+2] "Let's merge!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025696 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [09:36:12] (03Merged) 10jenkins-bot: MPIC chart and helmfiles: Some fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025696 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [09:36:30] (03PS1) 10Marostegui: profile: Add es6 to the regex of valid sections [puppet] - 10https://gerrit.wikimedia.org/r/1025699 (https://phabricator.wikimedia.org/T355285) [09:37:35] (03CR) 10Marostegui: [C:03+2] Revert "db1158: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1025618 (owner: 10Marostegui) [09:39:01] !log Starting security deploy on tmux session [09:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:26] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti700[1-4] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1025697 (https://phabricator.wikimedia.org/T362730) (owner: 10Muehlenhoff) [09:39:38] (03CR) 10Marostegui: [C:03+2] profile: Add es6 to the regex of valid sections [puppet] - 10https://gerrit.wikimedia.org/r/1025699 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [09:41:00] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=mw2382.codfw.wmnet [09:42:14] Dreamy_Jazz: have you started deploying ? [09:42:19] YEs [09:42:35] ok [09:42:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P61485 and previous config saved to /var/cache/conftool/dbconfig/20240430-094237-arnaudb.json [09:42:53] (03CR) 10EoghanGaffney: [C:03+2] apt-staging: Add access token for gitlab package puller [puppet] - 10https://gerrit.wikimedia.org/r/1025358 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [09:42:56] This time running on a tmux session to avoid issues :) [09:43:52] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:08] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:45:42] (03CR) 10Filippo Giunchedi: "good point! I think I'd be ok with 24h for staging though" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [09:46:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Push es6 eqiad section T355285', diff saved to https://phabricator.wikimedia.org/P61486 and previous config saved to /var/cache/conftool/dbconfig/20240430-094635-marostegui.json [09:46:42] T355285: Productionize es10[35-40] - https://phabricator.wikimedia.org/T355285 [09:47:09] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:47:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2125.codfw.wmnet [09:48:53] (03PS1) 10Muehlenhoff: Switch db2125 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025700 (https://phabricator.wikimedia.org/T349619) [09:49:30] PROBLEM - Check whether ferm is active by checking the default input chain on parse1014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:49:33] BGP alerts is probebly me [09:49:46] https://phabricator.wikimedia.org/T362938 [09:51:08] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1022 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:51:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Push es6 codfw config T355424', diff saved to https://phabricator.wikimedia.org/P61487 and previous config saved to /var/cache/conftool/dbconfig/20240430-095119-marostegui.json [09:51:26] T355424: Productionize es[2035-2040] - https://phabricator.wikimedia.org/T355424 [09:52:09] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH certificates) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:47] (HelmReleaseBadStatus) firing: (2) Helm release mw-misc/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:53:30] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [09:53:31] (03CR) 10Muehlenhoff: [C:03+2] Switch db2125 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025700 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:54:00] (03CR) 10Jcrespo: [C:03+2] dbbackups: Setup dbprov1006 & dbprov2006 and do s4 & s7 dumps there [puppet] - 10https://gerrit.wikimedia.org/r/1025695 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [09:54:11] (03PS1) 10Marostegui: es6 eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025701 (https://phabricator.wikimedia.org/T355285) [09:55:46] PROBLEM - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is CRITICAL: CRITICAL - Unknown error executing dbctl config diff https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [09:55:58] checking [09:56:32] ah, puppet didn't run there yet [09:56:33] doing it manually [09:56:58] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete certs for wdqs/wcqs [puppet] - 10https://gerrit.wikimedia.org/r/1024420 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [09:57:30] PROBLEM - Check whether ferm is active by checking the default input chain on mw1415 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:57:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T360332)', diff saved to https://phabricator.wikimedia.org/P61488 and previous config saved to /var/cache/conftool/dbconfig/20240430-095745-arnaudb.json [09:57:47] (HelmReleaseBadStatus) firing: (17) Helm release mw-api-ext/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:57:52] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [09:58:40] (03PS2) 10Stevemunene: datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) [09:58:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2125.codfw.wmnet [09:58:51] (03PS1) 10Effie Mouzeli: Revert "admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025621 [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1000) [10:00:40] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:00:46] RECOVERY - Uncommitted dbctl configuration changes- check dbctl config diff on cumin2002 is OK: OK - no diffs https://wikitech.wikimedia.org/wiki/Dbctl%23Uncommitted_dbctl_diffs [10:00:57] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2138.codfw.wmnet [10:02:09] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:02:16] (03PS1) 10Muehlenhoff: Switch db2138 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025706 (https://phabricator.wikimedia.org/T349619) [10:02:47] (HelmReleaseBadStatus) firing: (17) Helm release mw-api-ext/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:03:15] (03CR) 10Muehlenhoff: [C:03+2] Switch db2138 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025706 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:03:39] Security deploy failed [10:04:22] It seems the k8s deployment failed [10:04:42] Dreamy_Jazz: yes, we're looking into it (cc effie) [10:05:08] Dreamy_Jazz: sorry for that, I should have asked you to not deploy [10:05:09] It looks like it successfully rolled back though. [10:06:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P61489 and previous config saved to /var/cache/conftool/dbconfig/20240430-100607-root.json [10:06:15] (03CR) 10Marostegui: [C:03+2] es6 eqiad: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025701 (https://phabricator.wikimedia.org/T355285) (owner: 10Marostegui) [10:06:41] If you could let me know when things are back to normal and I can re-try the security deploy, that would be great. [10:07:09] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:07:38] The security issue isn't particularly high in urgency, but it would be good if I could deploy it today. [10:07:43] (03PS1) 10Fabfur: site:magru: set definitive roles for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025707 (https://phabricator.wikimedia.org/T362729) [10:07:44] The ticket is https://phabricator.wikimedia.org/T338419 [10:07:47] (HelmReleaseBadStatus) firing: (18) Helm release mw-api-ext/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:08:10] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 421, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:08:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2138.codfw.wmnet [10:08:54] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 497, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:09:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:10:40] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:11:13] Dreamy_Jazz: we are working on it [10:11:30] Thanks [10:12:57] ACKNOWLEDGEMENT - MD RAID on mw2382 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363811 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:13:04] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363811 (10ops-monitoring-bot) 03NEW [10:13:37] Dreamy_Jazz: have another go [10:13:56] and let us know how things are progressing [10:14:24] Hmm. The deploy security script errors out now. Looking into fixing that. [10:15:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2148.codfw.wmnet [10:16:04] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [10:16:51] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757021 (10JMeybohm) @Jhancock.wm I've tried powercycling the system and to restart iDRAC to see if the storage controller "comes back" but no luck. During boot I did see 2 SATA drives listed, though. Ofc. /... [10:17:06] (03PS1) 10Muehlenhoff: Switch db2148 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025709 (https://phabricator.wikimedia.org/T349619) [10:17:49] Looks like the rollback didn't reset the git HEAD of the repo in 1.43.0-wmf.2 [10:18:05] (03PS1) 10Jcrespo: mariadb: Upgrade db1150, db1171 and move s4, s7, s8 backups to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) [10:18:15] I ran `git checkout origin/wmf/1.43.0-wmf.2` in the relevant repo [10:18:20] Will try the script again [10:18:44] (03CR) 10Muehlenhoff: [C:03+2] Switch db2148 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025709 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:19:30] The deploy script has started running. [10:19:30] RECOVERY - Check whether ferm is active by checking the default input chain on parse1014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:20:13] (03PS2) 10Jcrespo: mariadb: Upgrade db1150, db1171 and move s4, s7, s8 backups to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) [10:21:08] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1022 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:21:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P61490 and previous config saved to /var/cache/conftool/dbconfig/20240430-102113-root.json [10:21:47] (03Abandoned) 10Jcrespo: mariadb: Migrate db2098 backups to db2198 and upgrade dbprov2002 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1018276 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [10:22:25] (03Abandoned) 10Jcrespo: mariadb: Move services db2101->db2201,db2099->db2199, upgrade dbprov2003 [puppet] - 10https://gerrit.wikimedia.org/r/1019263 (https://phabricator.wikimedia.org/T358741) (owner: 10Jcrespo) [10:22:47] (HelmReleaseBadStatus) firing: (14) Helm release mw-api-ext/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:24:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2148.codfw.wmnet [10:25:03] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2175.codfw.wmnet [10:25:58] (03PS1) 10Muehlenhoff: Switch db2175 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025711 (https://phabricator.wikimedia.org/T349619) [10:26:29] (03CR) 10Jcrespo: [C:04-1] "This is ready- although I need to check it more in detail, but I will *wait until next week* to be merged (it assumes manual upgrades and " [puppet] - 10https://gerrit.wikimedia.org/r/1025710 (https://phabricator.wikimedia.org/T362509) (owner: 10Jcrespo) [10:27:09] !log aokoth@cumin1002 START - Cookbook sre.hosts.decommission for hosts lists1004.eqiad.wmnet [10:27:47] (HelmReleaseBadStatus) firing: (14) Helm release mw-api-ext/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:28:12] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:30:02] PROBLEM - Check whether ferm is active by checking the default input chain on mw1460 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:32:57] !log aokoth@cumin1002 START - Cookbook sre.dns.netbox [10:33:35] !log dreamyjazz Deployed security patch for T338419 [10:34:12] (03CR) 10Muehlenhoff: [C:03+2] Switch db2175 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025711 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:34:22] effie: The security deploy worked for wmf.2 [10:34:29] ok cool [10:34:33] The script is not proceeding to wmf.3 [10:34:38] ? [10:34:45] *now not not [10:34:57] (03PS1) 10Marostegui: es203[57]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025713 (https://phabricator.wikimedia.org/T355424) [10:35:15] Not sure how I wrote "not" when I meant to write "now". [10:35:24] Dreamy_Jazz: I am not sure I understand [10:35:37] (03PS1) 10Jcrespo: [WIP]dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) [10:35:39] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [10:35:44] what is the current status? [10:35:48] (03CR) 10CI reject: [V:04-1] [WIP]dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [10:36:05] Deployed to wmf.2 and currently deploying to wmf.3 [10:36:06] (03PS2) 10Jcrespo: [WIP]dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) [10:36:18] (03CR) 10Marostegui: [C:03+2] es203[57]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1025713 (https://phabricator.wikimedia.org/T355424) (owner: 10Marostegui) [10:36:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P61491 and previous config saved to /var/cache/conftool/dbconfig/20240430-103618-root.json [10:36:31] (03CR) 10Jcrespo: "This requires grant config changes and deployment before it can work." [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [10:36:33] The `deploy_security.py` script deploys to the wiki versions one at a time [10:36:42] Dreamy_Jazz: so things are progressing ok ? [10:36:45] Yes [10:37:02] (03CR) 10Jcrespo: [C:04-1] "Waiting for es7 setup + other blockers, at least until next week." [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [10:37:40] I have never ran deploy_security.py, which is why I need a more detailed version of where things are [10:37:58] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [10:38:17] (03CR) 10Marostegui: [WIP]dbbackups: Add backups for es6 and es7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [10:38:24] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [10:38:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2175.codfw.wmnet [10:38:49] It runs `scap sync-file` which does the actual deployment [10:39:15] (03CR) 10Kamila Součková: [C:03+1] mw-videoscaler: helmfile scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020860 (https://phabricator.wikimedia.org/T355292) (owner: 10Hnowlan) [10:39:33] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: lists1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - aokoth@cumin1002" [10:39:33] !log aokoth@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:39:34] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lists1004.eqiad.wmnet [10:39:42] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9757105 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aokoth@cumin1002 for hosts: `lists1004.eqiad.wmnet` - lists1004.eqiad.wmnet (**PA... [10:39:58] (03PS3) 10Jcrespo: dbbackups: Add backups for es6 and es7 [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) [10:40:00] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [10:40:13] (03CR) 10Jcrespo: [C:04-1] dbbackups: Add backups for es6 and es7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [10:40:30] (03CR) 10Jcrespo: [C:04-1] dbbackups: Add backups for es6 and es7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025714 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [10:41:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1374 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:42:34] PROBLEM - Check whether ferm is active by checking the default input chain on parse2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:42:38] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2043 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:43:00] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw, AS64602/IPv6: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:43:20] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2034 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:43:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm [10:43:52] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7001.magru.wmnet with OS bookworm [10:44:54] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757115 (10JMeybohm) @Jhancock.wm I did shutdown the server for now. Could you please try do drain flea power and see if the controller comes back after? If not please open a case with Dell [10:45:01] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2189.codfw.wmnet [10:45:50] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9757116 (10jcrespo) 05In progress→03Resolved [10:46:32] (03PS1) 10Muehlenhoff: Switch db2189 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025716 (https://phabricator.wikimedia.org/T349619) [10:47:47] (HelmReleaseBadStatus) resolved: Helm release istio-system/namespace-certificates on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=istio-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:47:51] !log dreamyjazz Deployed security patch for T338419 [10:48:02] !log Security deploy finished [10:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:15] ( cc effie ^ ) [10:49:06] ok cool [10:49:09] thank you [10:49:45] (03CR) 10Muehlenhoff: [C:03+2] Switch db2189 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025716 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:50:59] (03PS1) 10Ssingh: magru: add cp nodes text: cp700[1-8] and upload: cp70(09|1[0-6]) [puppet] - 10https://gerrit.wikimedia.org/r/1025718 (https://phabricator.wikimedia.org/T362729) [10:51:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P61492 and previous config saved to /var/cache/conftool/dbconfig/20240430-105124-root.json [10:53:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2189.codfw.wmnet [10:54:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bullseye [10:54:14] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2204.codfw.wmnet [10:55:11] (03Abandoned) 10Fabfur: site:magru: set definitive roles for cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/1025707 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [10:55:36] (03PS1) 10Muehlenhoff: Switch db2204 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025720 (https://phabricator.wikimedia.org/T349619) [10:57:28] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 419, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:30] RECOVERY - Check whether ferm is active by checking the default input chain on mw1415 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:58:12] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:58:16] (03CR) 10Muehlenhoff: [C:03+2] Switch db2204 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025720 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:58:53] (JobUnavailable) firing: Reduced availability for job ncredir in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:02] RECOVERY - Check whether ferm is active by checking the default input chain on mw1460 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:00:23] (03CR) 10JMeybohm: [C:03+1] Revert "admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025621 (owner: 10Effie Mouzeli) [11:00:28] (03CR) 10Effie Mouzeli: [C:03+2] Revert "admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025621 (owner: 10Effie Mouzeli) [11:01:14] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 497, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:01:36] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757195 (10JMeybohm) Host is set pooled=inactive, cordoned in k8s, removed from BGP and shut down, so all yours [11:03:13] (03Merged) 10jenkins-bot: Revert "admin_ng/cert-manager: Remove dependency on kubernetesMasters.cidrs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025621 (owner: 10Effie Mouzeli) [11:03:37] (03PS1) 10Ssingh: hiera: acme_chief: add magru to authorized_regexes [puppet] - 10https://gerrit.wikimedia.org/r/1025722 (https://phabricator.wikimedia.org/T346722) [11:03:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2204.codfw.wmnet [11:04:22] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 13Patch-For-Review: Fatal error detected on elastic2088 - https://phabricator.wikimedia.org/T361286#9757207 (10MoritzMuehlenhoff) I've set the server back to "Active" in Netbox. [11:04:54] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2207.codfw.wmnet [11:05:57] (03PS1) 10Btullis: Test a fix for the bootstrapping of mon daemons on cephosd* [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) [11:06:22] !log jiji@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [11:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P61493 and previous config saved to /var/cache/conftool/dbconfig/20240430-110629-root.json [11:06:41] !log jiji@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [11:06:57] !log jiji@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:07:22] !log jiji@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:07:29] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [11:07:54] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [11:08:01] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [11:08:04] (03CR) 10Fabfur: [C:03+1] hiera: acme_chief: add magru to authorized_regexes [puppet] - 10https://gerrit.wikimedia.org/r/1025722 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [11:08:44] (03CR) 10Ssingh: [C:03+2] hiera: acme_chief: add magru to authorized_regexes [puppet] - 10https://gerrit.wikimedia.org/r/1025722 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [11:08:44] (03PS3) 10Stevemunene: datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) [11:09:25] (03PS1) 10Muehlenhoff: Switch db2207 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025726 (https://phabricator.wikimedia.org/T349619) [11:10:40] (03PS2) 10Ssingh: magru: add cp nodes text: cp700[1-8] and upload: cp70(09|1[0-6]) [puppet] - 10https://gerrit.wikimedia.org/r/1025718 (https://phabricator.wikimedia.org/T362729) [11:11:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw1374 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:11:53] (03CR) 10Stevemunene: datahub: create dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [11:12:09] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=eqiad&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=PUT - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:34] RECOVERY - Check whether ferm is active by checking the default input chain on parse2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:12:38] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2043 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:13:00] (03CR) 10Muehlenhoff: [C:03+2] Switch db2207 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025726 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:13:20] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2034 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:15:17] (03PS2) 10Btullis: Test a fix for the bootstrapping of mon daemons on cephosd* [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) [11:16:07] (03PS1) 10Superpes15: [itwiki] Create a new 'arbcom' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) [11:16:10] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [11:16:13] (03PS3) 10Ssingh: magru: add DNS boxes dns700[12] [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) [11:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:16:30] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [11:16:59] (03PS1) 10Btullis: Add dummy keydata for a wildcard ceph monitor [labs/private] - 10https://gerrit.wikimedia.org/r/1025728 (https://phabricator.wikimedia.org/T332987) [11:17:08] (03PS2) 10Superpes15: [itwiki] Create a new 'arbcom' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) [11:19:19] (03PS3) 10Superpes15: [itwiki] Create a new 'arbcom' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) [11:19:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage [11:19:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2207.codfw.wmnet [11:20:09] (03PS9) 10Urbanecm: Add account_conversion event streams. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989216 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:20:10] (03PS2) 10Ssingh: magru: add lvs700[1-3] and related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) [11:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P61494 and previous config saved to /var/cache/conftool/dbconfig/20240430-112135-root.json [11:26:50] (03PS3) 10Btullis: Test a fix for the bootstrapping of mon daemons on cephosd* [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) [11:27:10] (03PS2) 10KartikMistry: Update MinT to 2024-03-28-061726-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1015258 (https://phabricator.wikimedia.org/T333969) [11:27:25] (03CR) 10Btullis: [V:03+2 C:03+2] Add dummy keydata for a wildcard ceph monitor [labs/private] - 10https://gerrit.wikimedia.org/r/1025728 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [11:28:06] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [11:29:38] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2185/console" [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [11:36:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P61495 and previous config saved to /var/cache/conftool/dbconfig/20240430-113640-root.json [11:37:30] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [11:38:49] (03PS1) 10Ladsgroup: mariadb: Add SLAVE MONITOR grant to the replication user [puppet] - 10https://gerrit.wikimedia.org/r/1025731 [11:39:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [11:43:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [11:44:52] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM. Please also collect +1 from @dcaro." [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [11:47:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [11:47:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm [11:47:36] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7001.magru.wmnet with OS bookworm com... [11:50:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7002.magru.wmnet with OS bookworm [11:51:07] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7002.magru.wmnet with OS bookworm [11:57:16] (03PS7) 10TChin: Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [11:57:47] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 12), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9757373 (10SGupta-WMF) @Scott_French Thank you ! We are in process of creating... [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1200) [12:01:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1162.eqiad.wmnet [12:02:45] (03PS1) 10Muehlenhoff: Switch db1162 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025736 (https://phabricator.wikimedia.org/T349619) [12:06:16] (03CR) 10Muehlenhoff: [C:03+2] Switch db1162 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025736 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:09:49] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:09:56] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:10:05] (03CR) 10Alexandros Kosiaris: [C:03+1] Add datasets-config helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:11:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1162.eqiad.wmnet [12:19:52] (03CR) 10Marostegui: [C:03+1] mariadb: Add SLAVE MONITOR grant to the replication user [puppet] - 10https://gerrit.wikimedia.org/r/1025731 (owner: 10Ladsgroup) [12:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1182.eqiad.wmnet [12:23:29] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage [12:23:50] (03PS1) 10Muehlenhoff: wmp-laptop-sre: Add support for magru [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1025742 [12:24:10] !log jforrester@deploy1002 Started deploy [integration/docroot@b88f9e1]: Update VisualEditor links, post-JSDoc (b88f9e1674) [12:24:17] !log jforrester@deploy1002 Finished deploy [integration/docroot@b88f9e1]: Update VisualEditor links, post-JSDoc (b88f9e1674) (duration: 00m 06s) [12:24:47] (03PS1) 10Muehlenhoff: Switch db1182 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025743 (https://phabricator.wikimedia.org/T349619) [12:25:34] 6 second deploys are rather nicer than 15 minute ones. [12:25:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7002.magru.wmnet with reason: host reimage [12:28:20] (03CR) 10Muehlenhoff: [C:03+2] Switch db1182 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025743 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:30:09] (03PS8) 10TChin: Add datasets-config and datasets-config-next helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019085 (https://phabricator.wikimedia.org/T357434) [12:30:31] (03CR) 10David Caro: Test a fix for the bootstrapping of mon daemons on cephosd* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [12:30:43] (03PS14) 10TChin: Add datasets-config helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) [12:30:51] (03CR) 10Brouberol: [C:03+1] datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [12:32:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1182.eqiad.wmnet [12:33:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1188.eqiad.wmnet [12:34:06] (03PS1) 10Muehlenhoff: Switch db1188 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025745 (https://phabricator.wikimedia.org/T349619) [12:35:56] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:37:40] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [12:37:52] (03CR) 10TChin: Add datasets-config helm chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016471 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:38:48] (03PS2) 10Ladsgroup: mariadb: Add SLAVE MONITOR grant to the replication user [puppet] - 10https://gerrit.wikimedia.org/r/1025731 [12:38:56] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add SLAVE MONITOR grant to the replication user [puppet] - 10https://gerrit.wikimedia.org/r/1025731 (owner: 10Ladsgroup) [12:40:45] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1023837 (https://phabricator.wikimedia.org/T363176) (owner: 10FNegri) [12:40:56] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=commons - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:42:27] (03CR) 10Fabfur: [C:03+1] "ok for me" [puppet] - 10https://gerrit.wikimedia.org/r/1025718 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [12:45:17] (03CR) 10Btullis: [V:03+1] Test a fix for the bootstrapping of mon daemons on cephosd* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [12:45:49] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [12:46:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [12:47:43] (03CR) 10Elukey: "I am a bit on the fence on this one, since it may happen that if we don't get the same result that v2 shows after backfilling we may end u" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [12:49:37] (03CR) 10Muehlenhoff: [C:03+2] Switch db1188 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025745 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:49:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [12:50:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7002.magru.wmnet with OS bookworm [12:50:09] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7002.magru.wmnet with OS bookworm com... [12:51:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7003.magru.wmnet with OS bookworm [12:51:21] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm [12:51:32] (KubernetesCalicoDown) firing: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:51:47] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757592 (10MoritzMuehlenhoff) [12:52:00] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:52:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:53:03] (03PS1) 10Santiago Faci: MPIC Chart: Fixing single quotes and bumping the version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025768 (https://phabricator.wikimedia.org/T361343) [12:53:49] (03CR) 10Elukey: [C:03+2] role::restbase::production: cleanup after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1024738 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [12:54:26] (03PS5) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) [12:55:24] !log uploaded openjdk-8 8u412-ga-1~deb11u1 to bullseye-wikimedia (forward port of latest Java 8 security updates) [12:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:32] (KubernetesCalicoDown) firing: (5) kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:56:50] (03CR) 10Ssingh: [C:03+1] wmp-laptop-sre: Add support for magru [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1025742 (owner: 10Muehlenhoff) [12:57:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1188.eqiad.wmnet [12:57:27] (03CR) 10Brouberol: [C:03+1] MPIC Chart: Fixing single quotes and bumping the version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025768 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [12:57:58] (03CR) 10Ssingh: [C:03+2] magru: add cp nodes text: cp700[1-8] and upload: cp70(09|1[0-6]) [puppet] - 10https://gerrit.wikimedia.org/r/1025718 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [12:58:17] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [12:58:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [12:58:30] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [12:58:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.copy (exit_code=0) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [12:58:50] all the kubernets staging rumble is me [12:58:55] !log installing util-linux security updates [12:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:00] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1300) [13:00:05] cscott and Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:00] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:01:32] (KubernetesCalicoDown) resolved: (5) kubestage2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:01:54] (03PS2) 10Santiago Faci: MPIC Chart: Fixing single quotes and bumping the version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025768 (https://phabricator.wikimedia.org/T361343) [13:02:06] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7009.magru.wmnet with OS bullseye [13:02:15] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9757620 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7009.magru.wmnet with OS bullseye [13:03:12] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:03:14] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:04:07] Hi :D [13:06:37] (03CR) 10Brouberol: [C:03+1] MPIC Chart: Fixing single quotes and bumping the version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025768 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [13:07:48] cscott: Superpes let me know when done with deployment. Adding my patch in the window. [13:09:19] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [13:11:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:11:47] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:12:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:12:56] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:13:18] i can deploy today :) [13:13:26] Uhhhhh :3 [13:13:34] Yep @kart_ :) [13:13:39] (03CR) 10Btullis: [V:03+1] Test a fix for the bootstrapping of mon daemons on cephosd* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [13:15:06] (03CR) 10Urbanecm: [C:04-1] [itwiki] Create a new 'arbcom' usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) (owner: 10Superpes15) [13:15:20] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:15:22] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:15:38] cscott: hi, are you around? [13:17:19] (03CR) 10Superpes15: [itwiki] Create a new 'arbcom' usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) (owner: 10Superpes15) [13:17:22] (03PS4) 10Superpes15: [itwiki] Create a new 'arbcom' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) [13:17:28] (03PS4) 10Btullis: Test a fix for the bootstrapping of mon daemons on cephosd* [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) [13:17:48] kart_: would it be possible to reschedule that patch for tomorrow please? i'd like to check with WMCZ they're prepared for this change. would that work for you? [13:18:30] (03CR) 10Urbanecm: [C:03+2] [itwiki] Create a new 'arbcom' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) (owner: 10Superpes15) [13:18:36] (03PS1) 10Vgutierrez: hiera: Enable benthos on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1025773 (https://phabricator.wikimedia.org/T362776) [13:19:01] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2187/console" [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [13:19:04] urbanecm: see: https://phabricator.wikimedia.org/T353049 [13:19:24] (03CR) 10Santiago Faci: [C:03+2] MPIC Chart: Fixing single quotes and bumping the version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025768 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [13:19:26] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1197.eqiad.wmnet [13:19:27] (03Merged) 10jenkins-bot: [itwiki] Create a new 'arbcom' usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) (owner: 10Superpes15) [13:19:38] urbanecm: No issue though. I can postpone to Thursday (Off day tomorrow for me) [13:20:10] thanks! [13:20:20] (03PS1) 10Muehlenhoff: Switch db1197 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025774 (https://phabricator.wikimedia.org/T349619) [13:20:23] (03Merged) 10jenkins-bot: MPIC Chart: Fixing single quotes and bumping the version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025768 (https://phabricator.wikimedia.org/T361343) (owner: 10Santiago Faci) [13:20:25] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:20:38] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:20:44] (03CR) 10FNegri: [C:03+2] wmcs::metricsinfra: set Grafana scrape interval [puppet] - 10https://gerrit.wikimedia.org/r/1023837 (https://phabricator.wikimedia.org/T363176) (owner: 10FNegri) [13:20:45] Let me know when config deployment is done, urbanecm. Need to deploy MinT+Cxserver. [13:20:53] kart_: will do [13:21:14] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db1125.eqiad.wmnet onto db2114.codfw.wmnet [13:21:50] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1025727|[itwiki] Create a new 'arbcom' usergroup (T363805)]] [13:21:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db1197 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025774 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:21:58] T363805: Creating a new 'arbcom' usergroup on itwiki - https://phabricator.wikimedia.org/T363805 [13:22:01] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:22:30] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2188/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025773 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:22:50] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:23:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw1485 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:23:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage [13:24:03] (03PS1) 10Muehlenhoff: Remove obsolete certificate [puppet] - 10https://gerrit.wikimedia.org/r/1025775 (https://phabricator.wikimedia.org/T360439) [13:24:12] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable benthos on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1025773 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:25:08] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:25:25] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [13:26:06] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [13:26:06] i'm getting 13:25:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-04-30-132158-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out [13:26:09] is that expected? [13:26:17] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [13:26:42] urbanecm: yep i'm around [13:26:49] cscott: great, thanks! [13:26:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1197.eqiad.wmnet [13:26:57] !log urbanecm@deploy1002 superpes and urbanecm: Backport for [[gerrit:1025727|[itwiki] Create a new 'arbcom' usergroup (T363805)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:26:58] do you want to self-deploy, or should i deploy for you cscott ? [13:27:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7003.magru.wmnet with reason: host reimage [13:27:02] T363805: Creating a new 'arbcom' usergroup on itwiki - https://phabricator.wikimedia.org/T363805 [13:27:03] Superpes: please test your patch [13:27:04] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757729 (10taavi) [13:27:06] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363811#9757727 (10taavi) →14Duplicate dup:03T362938 [13:27:09] urbanecm: looks like T362938 [13:27:10] T362938: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938 [13:27:16] thanks taavi, as always [13:27:25] (SystemdUnitFailed) firing: (3) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:29] seems it is shutdown, so not an issue [13:27:34] urbanecm: if you could deploy for me it would be nice, although i've got the bits i haven't done a deploy myself for a long time & am rusty [13:27:37] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1222.eqiad.wmnet [13:27:39] cscott: will do [13:27:55] Yep looking [13:28:07] (03PS3) 10C. Scott Ananian: Turn on ParserMigration extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024407 [13:28:12] (03PS2) 10C. Scott Ananian: Quiet ParserMigration notice for 30 days after acknowledgement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024408 [13:28:25] Ok fine thanks urbanecm [13:28:29] !log urbanecm@deploy1002 superpes and urbanecm: Continuing with sync [13:28:31] thanks [13:28:36] (03CR) 10Urbanecm: [C:03+2] Turn on ParserMigration extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024407 (owner: 10C. Scott Ananian) [13:28:39] (03CR) 10Urbanecm: [C:03+2] Quiet ParserMigration notice for 30 days after acknowledgement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024408 (owner: 10C. Scott Ananian) [13:28:47] (03PS1) 10Muehlenhoff: Switch db1222 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025776 (https://phabricator.wikimedia.org/T349619) [13:29:24] (03Merged) 10jenkins-bot: Turn on ParserMigration extension everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024407 (owner: 10C. Scott Ananian) [13:29:26] (03Merged) 10jenkins-bot: Quiet ParserMigration notice for 30 days after acknowledgement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024408 (owner: 10C. Scott Ananian) [13:30:16] (03PS1) 10Muehlenhoff: Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1025777 (https://phabricator.wikimedia.org/T360439) [13:31:38] (03CR) 10Muehlenhoff: [C:03+2] Switch db1222 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025776 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:32:25] (SystemdUnitFailed) firing: (5) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:33:10] (03PS4) 10Clément Goubert: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316) [13:33:28] (03PS1) 10Ssingh: update service.yaml for text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025780 (https://phabricator.wikimedia.org/T362729) [13:33:40] (03Abandoned) 10Muehlenhoff: Enable profile::auto_restarts::service for redis/arclamp [puppet] - 10https://gerrit.wikimedia.org/r/1024263 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:34:46] PROBLEM - Check whether ferm is active by checking the default input chain on mw2422 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:34:54] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9757781 (10akosiaris) @Jclark-ctr syslog doesn't have anything, these are the last few lines ` 2024-04-25T19:17:00.091655+00:00 parse1002 systemd[1]: Starting Export confd Prometheus metrics... 2024-04-25... [13:35:26] (JobUnavailable) firing: (2) Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:27] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7009.magru.wmnet with OS bullseye [13:35:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1222.eqiad.wmnet [13:35:40] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9757783 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7009.magru.wmnet with OS bullseye executed with errors: - cp7009 (**F... [13:35:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1233.eqiad.wmnet [13:36:16] (03CR) 10Btullis: datahub: create dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [13:36:38] (03CR) 10Vgutierrez: [C:03+1] update service.yaml for text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025780 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [13:36:40] (03PS5) 10Elukey: revscoring-articlequality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018989 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:36:57] (03PS4) 10Clément Goubert: revertrisk: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018987 (https://phabricator.wikimedia.org/T362316) [13:37:25] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:37:32] (03CR) 10Ssingh: [C:03+2] update service.yaml for text and upload clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025780 (https://phabricator.wikimedia.org/T362729) (owner: 10Ssingh) [13:37:33] (03PS4) 10Clément Goubert: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316) [13:37:42] (03PS1) 10Muehlenhoff: Switch db1233 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025781 (https://phabricator.wikimedia.org/T349619) [13:37:50] (03PS5) 10Elukey: revscoring-articletopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018991 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:37:59] (03PS4) 10Clément Goubert: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316) [13:38:26] (03CR) 10BBlack: [C:03+1] "LGTM, no obvious typos or anything, IP addrs match." [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:38:40] urbanecm: are you doing both patches at once, or one at a time?  let me know when i should test. [13:38:55] cscott: i'm waiting on scap to finish with Superpes's patch now [13:38:57] (03PS5) 10Elukey: revscoring-draftquality: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018993 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:38:57] (CalicoKubeControllersDown) firing: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:39:03] cscott: then i plan to do both of the at the same time [13:39:07] unless that is a problem for you [13:39:15] no problem that should be fine. [13:39:20] (03PS4) 10Elukey: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:39:22] ok [13:39:24] (03PS1) 10JMeybohm: Use a blocksize of 30 for staging ipv4 pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025783 (https://phabricator.wikimedia.org/T345823) [13:39:27] (03PS5) 10Elukey: revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:39:28] (03CR) 10CI reject: [V:04-1] revscoring-drafttopic: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:39:48] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018995 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:39:52] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7009.magru.wmnet with OS bullseye [13:39:55] (03PS4) 10Clément Goubert: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316) [13:40:02] (03CR) 10Eevans: [C:03+1] role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [13:40:06] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9757796 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7009.magru.wmnet with OS bullseye [13:40:12] (03CR) 10Muehlenhoff: [C:03+2] Switch db1233 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1025781 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:40:22] (03PS1) 10Fabfur: hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) [13:40:55] Uhm too slow :D [13:41:00] (03PS5) 10Elukey: revscoring-editquality-damaging: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018997 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:41:32] Today it's taking a quarter of an hour [13:41:44] (03PS4) 10Elukey: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:41:51] (03PS4) 10Clément Goubert: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316) [13:41:52] (03CR) 10CI reject: [V:04-1] revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:41:59] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1025727|[itwiki] Create a new 'arbcom' usergroup (T363805)]] (duration: 20m 09s) [13:42:05] finally [13:42:05] T363805: Creating a new 'arbcom' usergroup on itwiki - https://phabricator.wikimedia.org/T363805 [13:42:28] (03PS5) 10Elukey: revscoring-editquality-reverted: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1019001 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:42:42] (03PS5) 10Elukey: revscoring-editquality-goodfaith: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018999 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:42:47] (03CR) 10Fabfur: [C:04-2] hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [13:42:48] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1024407|Turn on ParserMigration extension everywhere]], [[gerrit:1024408|Quiet ParserMigration notice for 30 days after acknowledgement]] [13:43:00] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:43:02] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-staging_30443: Servers kubestage2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:43:13] (03PS4) 10Ssingh: magru: add DNS boxes dns700[12] [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) [13:43:57] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7001.magru.wmnet with OS bullseye [13:44:12] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9757827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7001.magru.wmnet with OS bullseye [13:44:30] (03PS1) 10Vgutierrez: hiera: Enable benthos on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1025785 (https://phabricator.wikimedia.org/T362776) [13:44:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1233.eqiad.wmnet [13:45:23] (03CR) 10Ssingh: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:45:25] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:30] (03CR) 10Ssingh: [C:03+2] magru: add DNS boxes dns700[12] [puppet] - 10https://gerrit.wikimedia.org/r/1025354 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:45:41] (ConfdResourceFailed) firing: (8) confd resource _srv_config-master_pybal_magru_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:47:09] (03PS3) 10Elukey: article-description: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018960 (https://phabricator.wikimedia.org/T362316) (owner: 10Clément Goubert) [13:47:35] (03PS4) 10Clément Goubert: readability: Switch prod to mw-api-int-ro [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018965 (https://phabricator.wikimedia.org/T362316) [13:47:37] !log urbanecm@deploy1002 urbanecm and cscott: Backport for [[gerrit:1024407|Turn on ParserMigration extension everywhere]], [[gerrit:1024408|Quiet ParserMigration notice for 30 days after acknowledgement]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:47:58] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [13:48:16] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2024 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:48:24] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2189/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025785 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:48:47] (03PS2) 10Fabfur: hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) [13:48:55] urbanecm test servers today is mwdebug1001.eqiad or mwdebug2001.codfw ? [13:49:00] (03CR) 10Fabfur: [C:03+1] magru: add lvs700[1-3] and related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [13:49:05] jouncebot: nowandnext [13:49:05] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1300) [13:49:05] In 1 hour(s) and 10 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1500) [13:49:15] cscott: either, doesn't really matter. [13:49:16] cscott: all of the above [13:49:22] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable benthos on ncredir@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1025785 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [13:49:26] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2050 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:49:36] the code's at all test servers now [13:49:40] cscott: waiting for your lgtm [13:49:56] (03PS4) 10Stevemunene: datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) [13:50:41] (ConfdResourceFailed) firing: (84) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:50:58] sukhe: ^^ [13:51:02] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bullseye [13:51:09] yeah, I will silence it for now [13:51:11] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye [13:51:14] urbanecm: working on it.  wikidata looks good, checking commons [13:55:11] ack, waiting [13:55:11] oh wait, so it's non-magru as well [13:55:11] ok looking [13:55:12] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9757868 (10elukey) All changes rebased and ready to go (for prod). The main idea is the following: * Remove WIKI_URL for revscoring isvcs, so we'll... [13:55:12] (03PS1) 10Hnowlan: kubernetes: improve log message for calico controllers [alerts] - 10https://gerrit.wikimedia.org/r/1025787 [13:55:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [13:55:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7003.magru.wmnet with OS bookworm [13:55:12] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm com... [13:55:12] RECOVERY - Check whether ferm is active by checking the default input chain on mw1485 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:55:13] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757870 (10MoritzMuehlenhoff) [13:55:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti7004.magru.wmnet with OS bookworm [13:55:13] urbanecm: LGTM [13:55:13] !log urbanecm@deploy1002 urbanecm and cscott: Continuing with sync [13:55:13] proceeding [13:55:13] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm [13:56:02] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9757891 (10MoritzMuehlenhoff) [13:56:42] PROBLEM - Check whether ferm is active by checking the default input chain on mw1375 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:57:25] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:58:06] Thanks urbanecm for the assistance :3 [13:58:53] (JobUnavailable) firing: (3) Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:59:36] expected in drmrs? benthos change was eqiad ^ [13:59:36] PROBLEM - Check whether ferm is active by checking the default input chain on mw2332 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:00:40] PROBLEM - Check whether ferm is active by checking the default input chain on mw2311 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:02:10] kart_: fyi, just waiting for scap to finish [14:02:25] (SystemdUnitFailed) firing: (6) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:58] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7010.magru.wmnet with OS bullseye [14:04:08] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9757928 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7010.magru.wmnet with OS bullseye [14:04:46] RECOVERY - Check whether ferm is active by checking the default input chain on mw2422 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:04:54] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:04:54] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:06:03] silencing all cp hosts [14:06:07] in magru [14:06:25] in fact, just these two, since the others should be OK and we want to know if they fail [14:06:28] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1024407|Turn on ParserMigration extension everywhere]], [[gerrit:1024408|Quiet ParserMigration notice for 30 days after acknowledgement]] (duration: 23m 40s) [14:06:52] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp[7002-7003].magru.wmnet with reason: will be reimaged soon [14:07:05] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7009.magru.wmnet with reason: host reimage [14:07:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp[7002-7003].magru.wmnet with reason: will be reimaged soon [14:07:27] (03PS1) 10Hnowlan: Enable async upload-by-URL via jobqueue on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) [14:07:48] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9757948 (10Jhancock.wm) draining didn't fix it. I'm gonna update the firmware and bios and then see where it is. [14:07:49] (03CR) 10Herron: [C:03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1025682 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [14:09:25] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9757952 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [14:09:28] (03CR) 10Hnowlan: "Not sure whether enabling on just commons or as default is the right move here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [14:10:04] Duration: 23m 40s :/ [14:10:06] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7009.magru.wmnet with reason: host reimage [14:10:08] !log arnaudb@cumin1002 START - Cookbook sre.mysql.copy Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [14:10:16] (03CR) 10Btullis: [V:03+1 C:03+2] Test a fix for the bootstrapping of mon daemons on cephosd* [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [14:10:30] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.mysql.copy (exit_code=99) Will create a clone of db2114.codfw.wmnet onto db1125.eqiad.wmnet [14:10:41] (ConfdResourceFailed) firing: (28) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:12:03] (03PS1) 10Eevans: {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) [14:12:52] ^ I will clean the confd errors later [14:12:57] (03CR) 10BBlack: [C:03+1] magru: add lvs700[1-3] and related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1023850 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:13:15] (03PS1) 10Stevemunene: Setup kubeconfigs for datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025792 (https://phabricator.wikimedia.org/T363832) [14:13:45] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [14:13:55] 06SRE, 06Machine-Learning-Team, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9757995 (10elukey) a:03elukey [14:14:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:14:52] (03PS1) 10Vgutierrez: hiera,ncredir: Enable benthos on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/1025793 (https://phabricator.wikimedia.org/T362776) [14:15:22] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7011.magru.wmnet with OS bullseye [14:15:30] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758002 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7011.magru.wmnet with OS bullseye [14:16:40] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7001.magru.wmnet with reason: host reimage [14:17:18] (03CR) 10Brouberol: datahub: create dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [14:17:46] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:18:16] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:19:08] !log dcausse@deploy1002 Started deploy [airflow-dags/search@ab19bcd]: wdqs: deduplicate side-output events (T362508) [14:19:08] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025793 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:19:13] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [14:19:26] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2050 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:19:37] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@ab19bcd]: wdqs: deduplicate side-output events (T362508) (duration: 00m 29s) [14:19:53] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye [14:19:55] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9758022 (10Jclark-ctr) @Volans @Eevans same results between two different servers. total of 7 ssd have been swapped. it completes rebuild and then fail 2-3 days later. IDRAC shows no Errors. only mdstat s... [14:19:57] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9758021 (10VRiley-WMF) This DIMM (A6) has been replaced and the server has been powered back on. [14:20:46] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [14:21:24] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7004.magru.wmnet with reason: will be reimaged soon [14:21:37] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: will be reimaged soon [14:21:45] !log installing Java 8 security updates [14:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:28] (03PS2) 10Vgutierrez: hiera: Enable benthos on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/1025793 (https://phabricator.wikimedia.org/T362776) [14:22:46] RECOVERY - MariaDB disk space #page on db1234 is OK: DISK OK https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:23:12] 10ops-eqiad, 06SRE: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9758032 (10Jclark-ctr) Idrac is still up after almost 24 hours. i did move IDRAC port on switch to a different group of ports will monitor it [14:24:20] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 5 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025793 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:26:15] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable benthos on ncredir@esams [puppet] - 10https://gerrit.wikimedia.org/r/1025793 (https://phabricator.wikimedia.org/T362776) (owner: 10Vgutierrez) [14:26:42] RECOVERY - Check whether ferm is active by checking the default input chain on mw1375 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:27:14] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7004.magru.wmnet with reason: host reimage [14:29:37] RECOVERY - Check whether ferm is active by checking the default input chain on mw2332 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:29:46] !log installing gnutls28 security updates on buster [14:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:41] RECOVERY - Check whether ferm is active by checking the default input chain on mw2311 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:30:49] (03CR) 10Fabfur: hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [14:31:03] !log bblack@cumin1002 conftool action : set/pooled=no; selector: name=ncredir3003.esams.wmnet [14:31:09] !log bblack@cumin1002 conftool action : set/pooled=yes; selector: name=ncredir3003.esams.wmnet [14:31:37] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [14:31:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7004.magru.wmnet with reason: host reimage [14:32:46] 10ops-eqiad, 06SRE, 06DBA: db1246 crashed - https://phabricator.wikimedia.org/T363119#9758059 (10Jclark-ctr) @Marostegui "At the creation of ticket i requested to not repeat any troubleshooting steps the where not effective" followed up with dell again they should be sending out parts shortly [14:33:07] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [14:34:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [14:34:22] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7009.magru.wmnet with OS bullseye [14:34:36] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758063 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7009.magru.wmnet with OS bullseye completed: - cp7009 (**WARN**) -... [14:34:40] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: disk failure for an-worker1087 - https://phabricator.wikimedia.org/T362871#9758064 (10Jclark-ctr) 05Open→03Resolved [14:34:48] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7010.magru.wmnet with reason: host reimage [14:35:08] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=ncredir3003.esams.wmnet [14:38:32] (03CR) 10Ladsgroup: [C:04-1] Enable async upload-by-URL via jobqueue on commons (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [14:38:53] (JobUnavailable) firing: (4) Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:06] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [14:41:23] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7011.magru.wmnet with reason: host reimage [14:41:45] (03CR) 10Valerio Bozzolan: [itwiki] Create a new 'arbcom' usergroup (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025727 (https://phabricator.wikimedia.org/T363805) (owner: 10Superpes15) [14:42:10] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1002.eqiad.wmnet with reason: host reimage [14:42:29] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [14:43:33] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [14:43:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7001.magru.wmnet with OS bullseye [14:43:45] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758086 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7001.magru.wmnet with OS bullseye completed: - cp7001 (**WARN**) -... [14:43:53] (JobUnavailable) firing: (5) Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:45:20] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir3003.esams.wmnet [14:45:26] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ncredir3004.esams.wmnet [14:45:34] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7011.magru.wmnet with reason: host reimage [14:46:52] (03CR) 10Herron: "Makes sense thanks. Re: dropping the metric, its not something we have a trusted process for yet so it'll involve testing and careful exe" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [14:48:30] (03Abandoned) 10Sbisson: Enable Wikistories on test and test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/989606 (https://phabricator.wikimedia.org/T352454) (owner: 10Sbisson) [14:49:30] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host mw2382.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:50:12] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns7001.wikimedia.org with OS bullseye [14:50:16] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye exe... [14:51:08] (03CR) 10David Caro: Test a fix for the bootstrapping of mon daemons on cephosd* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [14:51:51] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [14:54:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [14:54:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7004.magru.wmnet with OS bookworm [14:54:19] (03CR) 10Andrea Denisse: "Hi team, this change is ready for review. It goes along with the following patch: https://gerrit.wikimedia.org/r/c/operations/dns/+/102480" [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [14:54:26] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm comp... [14:55:28] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758122 (10MoritzMuehlenhoff) [14:57:08] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758128 (10ssingh) [14:58:17] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [14:58:53] (JobUnavailable) firing: (5) Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:18] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [14:59:19] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7010.magru.wmnet with OS bullseye [14:59:32] (03PS2) 10Hnowlan: Enable async upload-by-URL via jobqueue on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) [14:59:32] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758130 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7010.magru.wmnet with OS bullseye completed: - cp7010 (**PASS**) -... [14:59:44] (03CR) 10Hnowlan: Enable async upload-by-URL via jobqueue on commons (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [15:00:04] eoghan, jelto, arnoldokoth, and mutante: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1500) [15:02:45] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758152 (10ssingh) [15:03:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1002.eqiad.wmnet with OS bullseye [15:05:22] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host mw2382.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:06:14] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['mw2382'] [15:06:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['mw2382'] [15:09:58] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [15:10:27] (03CR) 10Stevemunene: [C:03+1] Setup kubeconfigs for datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025792 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [15:10:44] 10ops-codfw, 06SRE: Inbound interface errors - https://phabricator.wikimedia.org/T363783#9758186 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [15:10:58] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [15:10:59] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7011.magru.wmnet with OS bullseye [15:11:05] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758192 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7011.magru.wmnet with OS bullseye completed: - cp7011 (**PASS**) -... [15:11:25] (03CR) 10Stevemunene: Setup kubeconfigs for datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025792 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [15:12:18] (03CR) 10Ladsgroup: [C:03+1] Enable async upload-by-URL via jobqueue on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [15:13:01] ACKNOWLEDGEMENT - MD RAID on mw2382 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363838 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:13:07] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363838 (10ops-monitoring-bot) 03NEW [15:14:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:14:09] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:16:28] (RoutinatorRsyncErrors) firing: (2) Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:17:54] (03PS3) 10Hnowlan: Enable async upload-by-URL via jobqueue on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025790 (https://phabricator.wikimedia.org/T295007) [15:17:54] (03PS1) 10Ssingh: wmnet: add esams services under the right origin [dns] - 10https://gerrit.wikimedia.org/r/1025800 [15:19:23] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:20:19] (03CR) 10CDanis: [C:03+1] "lgtm one typo" [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:20:23] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:21:22] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758219 (10ssingh) [15:23:57] (CalicoKubeControllersDown) resolved: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:25:41] swfrench-wmf: has the etcd maintenance started? [15:25:59] (03PS6) 10Elukey: role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) [15:26:28] sukhe: no, work won't start until the infrastructure window starting at 17:00 UTC [15:26:49] (03CR) 10Vgutierrez: [C:04-1] "regex needs some work as the current version won't match our FQDNs: `^ncredir[0-9]{4,}\..*\.wmnet` would do it though" [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [15:27:00] !log elukey@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=inference,name=codfw [15:27:15] oh right, sorry 17:00 UTC [15:27:24] !log depool liftwing codfw for a couple of hours to test eqiad capabilities to handle the traffic [15:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:55] (03CR) 10Elukey: [C:03+2] role::sessionstore: upgrade the Cassandra's Truststore [puppet] - 10https://gerrit.wikimedia.org/r/1024691 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [15:28:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9758252 (10RKemper) 05Open→03Resolved [15:28:25] jouncebot: next [15:28:25] In 0 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1600) [15:28:56] !log move Cassandra instances on session store nodes to a new Java Truststore to support PKI - T352647 [15:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:02] T352647: Move Cassandra clusters to PKI - https://phabricator.wikimedia.org/T352647 [15:31:05] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bullseye [15:31:10] (03PS3) 10Fabfur: hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) [15:31:11] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758278 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye [15:31:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:31:20] (03CR) 10Fabfur: "really dumb typo, fixed it (and even better with your regex proposal)" [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [15:31:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:32:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.274 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:32:21] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51782 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:32:26] (03PS4) 10JMeybohm: cfssl::cert: Add before_services parameter [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) [15:32:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:32:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [15:32:56] ACKNOWLEDGEMENT - MD RAID on mw2382 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363840 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:33:01] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363840 (10ops-monitoring-bot) 03NEW [15:33:42] (03CR) 10CDanis: [C:03+1] cfssl::cert: Add before_services parameter [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:33:58] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore2004.codfw.wmnet: Move to PKI Truststore - elukey@cumin1002 [15:34:30] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9758296 (10Eevans) @jsn.sherman Could you please do one of the following? Either: - Edit your [[ https://www.mediawiki.org/wiki/User:JSherman_(WMF) | user page ]] to include y... [15:34:32] (03CR) 10JMeybohm: cfssl::cert: Add before_services parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025690 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [15:35:42] (03CR) 10Elukey: [C:03+1] "Let's use v2, can't think of a better name :)" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [15:35:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9758311 (10Eevans) @thcipriani It looks like you're the approver for group `deployment`... do you? [15:36:01] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9758312 (10Eevans) a:05BCornwall→03Eevans [15:36:04] (03CR) 10JMeybohm: [C:03+1] "Oh, sweet - thanks! Did not mean to push you to it!" [alerts] - 10https://gerrit.wikimedia.org/r/1025787 (owner: 10Hnowlan) [15:36:31] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS bullseye [15:36:32] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7003.magru.wmnet with OS bullseye [15:36:50] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758314 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7002.magru.wmnet with OS bullseye [15:36:50] (03PS2) 10JMeybohm: Use a blocksize of /28 for staging ipv4 pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025783 (https://phabricator.wikimedia.org/T345823) [15:36:51] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758315 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye [15:36:54] (03CR) 10Herron: [C:03+2] "fair enough!" [puppet] - 10https://gerrit.wikimedia.org/r/1024790 (https://phabricator.wikimedia.org/T359879) (owner: 10Herron) [15:36:58] (03CR) 10Elukey: "Sure sure everybody says that :D" [alerts] - 10https://gerrit.wikimedia.org/r/1025787 (owner: 10Hnowlan) [15:37:31] (03CR) 10Vgutierrez: hiera:magru: adding magru dc to authorized ncredir regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [15:38:24] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1003.eqiad.wmnet with OS bullseye [15:39:20] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7012.magru.wmnet with OS bullseye [15:39:30] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7012.magru.wmnet with OS bullseye [15:40:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore2004.codfw.wmnet: Move to PKI Truststore - elukey@cumin1002 [15:43:53] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9758329 (10fgiunchedi) lshw as requested ` centrallog1002:~$ sudo lshw -class disk *-disk:0 description: ATA Disk product: SSDSC2KG960G8R physical id: 0 bus info: scs... [15:44:40] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore200[5,6].codfw.wmnet: Move to PKI Truststore - elukey@cumin1002 [15:45:05] (03CR) 10Hnowlan: [C:03+2] "Just saw the quick fix, no pressure at all I swear >_>" [alerts] - 10https://gerrit.wikimedia.org/r/1025787 (owner: 10Hnowlan) [15:46:11] (03Merged) 10jenkins-bot: kubernetes: improve log message for calico controllers [alerts] - 10https://gerrit.wikimedia.org/r/1025787 (owner: 10Hnowlan) [15:53:32] (03PS3) 10JMeybohm: Use a blocksize of /28 for staging-codfw ipv4 pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025783 (https://phabricator.wikimedia.org/T345823) [15:53:32] (03PS1) 10JMeybohm: Use a blocksize of /28 for staging-eqiad ipv4 pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025804 (https://phabricator.wikimedia.org/T345823) [15:56:14] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7005.magru.wmnet with OS bullseye [15:56:22] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7005.magru.wmnet with OS bullseye [15:56:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7004.magru.wmnet with OS bullseye [15:56:33] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758363 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7004.magru.wmnet with OS bullseye [15:56:36] (03CR) 10JMeybohm: [C:03+2] Use a blocksize of /28 for staging-codfw ipv4 pools [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025783 (https://phabricator.wikimedia.org/T345823) (owner: 10JMeybohm) [15:57:06] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore200[5,6].codfw.wmnet: Move to PKI Truststore - elukey@cumin1002 [15:58:00] !log elukey@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching sessionstore1*.eqiad.wmnet: Move to PKI Truststore - elukey@cumin1002 [15:58:08] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [15:59:01] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:59:17] (03CR) 10Brouberol: [C:03+1] "Actually, https://datahub-next.wikimedia.org/ is not a thing. It could be, but is not. So the change is good as is. We can work on setting" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [15:59:32] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7013.magru.wmnet with OS bullseye [15:59:39] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7013.magru.wmnet with OS bullseye [16:00:05] jhathaway and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:17] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:01:33] (03PS6) 10Andrea Denisse: wmnet: Add discovery entries for the Prometheus hosts [dns] - 10https://gerrit.wikimedia.org/r/1025447 (https://phabricator.wikimedia.org/T356386) [16:01:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:01] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1003.eqiad.wmnet with reason: host reimage [16:02:20] (03CR) 10CI reject: [V:04-1] wmnet: Add discovery entries for the Prometheus hosts [dns] - 10https://gerrit.wikimedia.org/r/1025447 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:03:39] (03PS7) 10Andrea Denisse: wmnet: Add discovery entries for the Prometheus hosts [dns] - 10https://gerrit.wikimedia.org/r/1025447 (https://phabricator.wikimedia.org/T356386) [16:04:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51783 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:04:53] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7012.magru.wmnet with reason: host reimage [16:05:17] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [16:05:46] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [16:05:48] (03CR) 10Stevemunene: datahub: create dse-k8s namespaces (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [16:06:04] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: sync on main [16:06:06] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9758428 (10Volans) Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days? That will surely wipe clean any manual procedure that was carried on the host since the... [16:07:06] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363838#9758436 (10Pppery) [16:07:12] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363840#9758434 (10Pppery) →14Duplicate dup:03T363838 [16:07:36] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7012.magru.wmnet with reason: host reimage [16:08:27] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7001.wikimedia.org with OS bullseye [16:08:33] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758441 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye exe... [16:09:14] eoghan: the PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL above seems related to your changes for lists1004 [16:09:43] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [16:09:53] forgot to run the sre.dns.netbox cookbook? [16:09:54] (03CR) 10Stevemunene: [C:03+2] datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [16:10:28] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [16:12:27] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [16:12:50] (03Merged) 10jenkins-bot: datahub: create dse-k8s namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1024365 (https://phabricator.wikimedia.org/T363298) (owner: 10Stevemunene) [16:12:54] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [16:15:46] (03PS1) 10Filippo Giunchedi: titan: trim 5m retention to 2y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1025806 (https://phabricator.wikimedia.org/T351927) [16:16:14] !log elukey@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching sessionstore1*.eqiad.wmnet: Move to PKI Truststore - elukey@cumin1002 [16:16:47] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference,name=codfw [16:17:31] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1025806 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [16:18:11] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:18:51] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:19:47] (03PS1) 10Elukey: role::sessionstore: move Cassandra instances to PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/1025810 (https://phabricator.wikimedia.org/T352647) [16:19:48] (03PS1) 10Elukey: role::sessionstore: cleanup unused TLS settings after PKI migration [puppet] - 10https://gerrit.wikimedia.org/r/1025811 (https://phabricator.wikimedia.org/T352647) [16:20:37] (03CR) 10Filippo Giunchedi: [C:03+2] titan: trim 5m retention to 2y + 2w [puppet] - 10https://gerrit.wikimedia.org/r/1025806 (https://phabricator.wikimedia.org/T351927) (owner: 10Filippo Giunchedi) [16:22:25] (03CR) 10Dzahn: "These are going to be like "prometheus-eqiad.eqiad", "prometheus-codfw.codfw" etc. Isn't that DC name kind of duplicate then?" [dns] - 10https://gerrit.wikimedia.org/r/1025447 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:22:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7005.magru.wmnet with reason: host reimage [16:22:30] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1003.eqiad.wmnet with OS bullseye [16:22:32] 10ops-eqiad, 06SRE: Degraded RAID on centrallog1002 - https://phabricator.wikimedia.org/T363660#9758527 (10andrea.denisse) Hello DC-Ops team, I can be your o11y point of contact for this task as it's easier for us to coordinate timezone wise. Cheers. [16:22:56] ACKNOWLEDGEMENT - MD RAID on mw2382 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T363847 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [16:23:02] 10ops-codfw, 06SRE: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T363847 (10ops-monitoring-bot) 03NEW [16:23:42] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2194/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025810 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:24:59] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7005.magru.wmnet with reason: host reimage [16:25:21] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [16:25:36] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7013.magru.wmnet with reason: host reimage [16:26:59] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2195/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025811 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:27:44] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7004.magru.wmnet with reason: host reimage [16:27:53] !log fabfur@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp7003.magru.wmnet with OS bullseye [16:27:59] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758598 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7003.magru.wmnet with OS bullseye executed with errors: - cp7003 (**... [16:30:16] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7013.magru.wmnet with reason: host reimage [16:30:55] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [16:33:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [16:33:30] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7012.magru.wmnet with OS bullseye [16:33:36] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758612 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7012.magru.wmnet with OS bullseye completed: - cp7012 (**PASS**) -... [16:36:02] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7002.magru.wmnet with OS bullseye [16:36:10] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758614 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7002.magru.wmnet with OS bullseye completed: - cp7002 (**PASS**) -... [16:36:15] 10ops-codfw, 06SRE, 06serviceops: Degraded RAID on mw2382 - https://phabricator.wikimedia.org/T362938#9758615 (10Jhancock.wm) idrac upgraded to 7.0.0. won't go any higher. Bios is already at 2.9.3. Reset the factory defaults and tried rebooting the idrac. reseated the backplane. None of these have fixed the... [16:38:56] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758653 (10ssingh) [16:39:36] (03PS2) 10Elukey: {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [16:39:37] (03PS1) 10Elukey: kask: allow to skip the generation of the Cassandra CA bundle file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025816 (https://phabricator.wikimedia.org/T352647) [16:40:14] (03CR) 10Elukey: "I left a comment since we can remove the last bits of self-signed CAs, so that the file will not be rendered on the host anymore. I added " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [16:42:25] (03PS3) 10Elukey: {session,echo}store: update defaults for PKI [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [16:43:00] (03CR) 10Elukey: "Nevermind the chart change is not needed, we can drop the cassandra bit too entirely." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [16:43:14] (03Abandoned) 10Elukey: kask: allow to skip the generation of the Cassandra CA bundle file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025816 (https://phabricator.wikimedia.org/T352647) (owner: 10Elukey) [16:43:39] (03CR) 10Dzahn: "if 1002 is the "production" one and 2001 is the newer one this looks good to me" [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:43:55] (03PS5) 10Andrea Denisse: wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T356386) [16:46:43] (03CR) 10Andrea Denisse: [C:03+2] wmnet: Add discovery entries for grafana and grafana-next [dns] - 10https://gerrit.wikimedia.org/r/1024806 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:48:18] (03CR) 10Dzahn: "You will need the new discovery name on the cert before you merge this." [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [16:51:46] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [16:52:22] (03CR) 10RLazarus: [C:03+1] hieradata: make etcd in eqiad read-only [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:52:36] (03CR) 10RLazarus: [C:03+1] hieradata: disable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023554 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:52:48] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [16:52:49] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7005.magru.wmnet with OS bullseye [16:52:57] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7005.magru.wmnet with OS bullseye completed: - cp7005 (**PASS**) -... [16:53:02] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [16:53:54] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [16:53:57] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7013.magru.wmnet with OS bullseye [16:54:09] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758695 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7013.magru.wmnet with OS bullseye completed: - cp7013 (**WARN**) -... [16:54:42] volans: We had to pause what we were doing, will take care of it. Thanks [16:56:14] (03CR) 10RLazarus: [C:03+1] etcdmirror::instance: absent all resources [puppet] - 10https://gerrit.wikimedia.org/r/1023555 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:56:20] (03CR) 10RLazarus: [C:03+1] etcdmirror: reconfigure with full-keyspace replication [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:56:22] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7004.magru.wmnet with OS bullseye [16:56:23] 10ops-eqiad, 06SRE, 06DBA: db1234 has hardware issues - https://phabricator.wikimedia.org/T363102#9758703 (10VRiley-WMF) 05In progress→03Resolved [16:56:25] (03CR) 10RLazarus: [C:03+1] hieradata: reenable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:56:31] (03CR) 10RLazarus: [C:03+1] hieradata: return etcd in eqiad to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [16:56:33] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9758704 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7004.magru.wmnet with OS bullseye completed: - cp7004 (**PASS**) -... [16:59:31] (03PS4) 10Fabfur: hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) [17:00:04] swfrench-wmf: Time to do the MediaWiki infrastructure (UTC late) deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1700). [17:00:41] (03PS1) 10Andrew Bogott: puppetserver-deploy-code: add -force to g10k call to invoke purging [puppet] - 10https://gerrit.wikimedia.org/r/1025818 [17:00:46] (03CR) 10Fabfur: [C:03+1] wmnet: add esams services under the right origin [dns] - 10https://gerrit.wikimedia.org/r/1025800 (owner: 10Ssingh) [17:01:27] the etcd work will start shortly, beginning with taking the scap lock [17:01:31] (03CR) 10Fabfur: hiera:magru: adding magru dc to authorized ncredir regex (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [17:01:37] !log swfrench@deploy1002 Locking from deployment [ALL REPOSITORIES]: etcd replication maintenance - T358636 [17:01:44] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [17:02:07] (03CR) 10Scott French: [C:03+2] hieradata: make etcd in eqiad read-only [puppet] - 10https://gerrit.wikimedia.org/r/1023966 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:03:03] !log putting etcd in read-only mode for T358636 [17:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:51] (03PS1) 10Kimberly Sarabia: Deploy a11y settings to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) [17:07:30] (03CR) 10Scott French: [C:03+2] hieradata: disable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023554 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:09:52] !log disabling etcd replication into conf2005 for T358636 [17:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:57] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [17:13:41] (EtcdReplicationDown) firing: etcd replication down on conf2005:8000 #page - https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication - TODO - https://alerts.wikimedia.org/?q=alertname%3DEtcdReplicationDown [17:14:02] ^ this is expected [17:14:19] got it thanks swfrench-wmf [17:14:20] I acked the page on vops [17:14:32] (03PS1) 10Andrea Denisse: grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) [17:14:42] should maybe put in a silence for that alert [17:14:51] apologies for the noise: I completely forgot that was p.age [17:16:06] (03CR) 10Dzahn: [C:03+2] deployment_server: stop installing python-gitdb, python-git [puppet] - 10https://gerrit.wikimedia.org/r/1023955 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [17:16:18] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9758815 (10thcipriani) reason for access looks good to me. Approved. [17:16:42] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9758817 (10thcipriani) [17:18:25] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2197/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:18:26] swfrench-wmf: Could you please LMK the ETA to release scap lock "Locking from deployment [ALL REPOSITORIES]: etcd replication maintenance" ? [17:18:51] I silenced everything for job=etcdmirror for 4h, we can unsilence early when we complete the work on time [17:19:27] xcollazo: we have the deployment calendar blocked out until 18 UTC and will likely use most or all of it [17:20:06] Got it, thanks. [17:20:10] we'll announce in here when finished though :) [17:21:18] (03CR) 10BBlack: [C:03+1] wmnet: add esams services under the right origin [dns] - 10https://gerrit.wikimedia.org/r/1025800 (owner: 10Ssingh) [17:23:17] (03PS12) 10Bking: search-platform: monitoring/alert on upstream MW API errors [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) [17:23:28] (03PS2) 10Andrea Denisse: grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) [17:24:01] (03CR) 10Bking: "Done" [alerts] - 10https://gerrit.wikimedia.org/r/1025453 (https://phabricator.wikimedia.org/T363609) (owner: 10Bking) [17:24:47] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2198/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [17:32:57] (03PS1) 10Andrew Bogott: horizon local_settings.py: Set CSRF_TRUSTED_ORIGINS [puppet] - 10https://gerrit.wikimedia.org/r/1025829 [17:33:18] (03CR) 10CI reject: [V:04-1] horizon local_settings.py: Set CSRF_TRUSTED_ORIGINS [puppet] - 10https://gerrit.wikimedia.org/r/1025829 (owner: 10Andrew Bogott) [17:33:22] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1025829 (owner: 10Andrew Bogott) [17:34:15] (03PS2) 10Andrew Bogott: horizon local_settings.py: Set CSRF_TRUSTED_ORIGINS [puppet] - 10https://gerrit.wikimedia.org/r/1025829 [17:34:25] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1025829 (owner: 10Andrew Bogott) [17:34:49] (03CR) 10Scott French: [C:03+2] etcdmirror::instance: absent all resources [puppet] - 10https://gerrit.wikimedia.org/r/1023555 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:37:02] (03CR) 10Andrew Bogott: [C:03+2] horizon local_settings.py: Set CSRF_TRUSTED_ORIGINS [puppet] - 10https://gerrit.wikimedia.org/r/1025829 (owner: 10Andrew Bogott) [17:38:20] (03PS3) 10Scott French: etcdmirror: reconfigure with full-keyspace replication [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) [17:38:20] (03PS3) 10Scott French: hieradata: reenable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) [17:42:10] (03CR) 10Scott French: [C:03+2] etcdmirror: reconfigure with full-keyspace replication [puppet] - 10https://gerrit.wikimedia.org/r/1023556 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:43:40] 06SRE, 10Observability-Logging, 10Wikimedia-Logstash: Investigate methods to rate-limit/discard excessive log messages closer to the producer - https://phabricator.wikimedia.org/T331879#9758921 (10colewhite) {T363856} [17:44:25] (03PS1) 10Andrew Bogott: horizon local_settings.py: CSRF_TRUSTED_ORIGINS requires an https prefix [puppet] - 10https://gerrit.wikimedia.org/r/1025830 [17:44:35] (03CR) 10CI reject: [V:04-1] horizon local_settings.py: CSRF_TRUSTED_ORIGINS requires an https prefix [puppet] - 10https://gerrit.wikimedia.org/r/1025830 (owner: 10Andrew Bogott) [17:45:08] (03PS2) 10Andrew Bogott: horizon local_settings.py: CSRF_TRUSTED_ORIGINS requires an https prefix [puppet] - 10https://gerrit.wikimedia.org/r/1025830 [17:45:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:42] (03CR) 10Andrew Bogott: [V:03+2 C:03+2] horizon local_settings.py: CSRF_TRUSTED_ORIGINS requires an https prefix [puppet] - 10https://gerrit.wikimedia.org/r/1025830 (owner: 10Andrew Bogott) [17:48:49] (03CR) 10Scott French: [C:03+2] hieradata: reenable etcd replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/1023557 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:51:12] (03PS3) 10Scott French: hieradata: return etcd in eqiad to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) [17:54:00] (03CR) 10Scott French: [C:03+2] hieradata: return etcd in eqiad to read-write [puppet] - 10https://gerrit.wikimedia.org/r/1023967 (https://phabricator.wikimedia.org/T358636) (owner: 10Scott French) [17:54:50] !log putting etcd back in read-write mode for T358636 [17:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:58] T358636: etcdmirror does not recover from a cleared waitIndex - https://phabricator.wikimedia.org/T358636 [17:56:48] !log swfrench@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: etcd replication maintenance - T358636 (duration: 55m 11s) [17:57:28] !log xcollazo@deploy1002 Started deploy [analytics/refinery@4836095]: Regular analytics weekly train [analytics/refinery@4836095f] [17:59:19] etcd maintenance is done, FYI oncallers jhathaway, herron [18:00:05] jnuche and brennen: #bothumor I � Unicode. All rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1800). [18:00:52] nicely done swfrench-wmf! [18:02:10] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "running manually for cp7013 - sukhe@cumin1002" [18:02:25] (SystemdUnitFailed) firing: (2) docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:02:28] (03PS3) 10Andrea Denisse: grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) [18:03:35] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7001.wikimedia.org with OS bookworm [18:03:41] thanks, sukhe - just down to the wire :) [18:03:45] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "running manually for cp7013 - sukhe@cumin1002" [18:03:46] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm [18:04:39] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS bullseye [18:04:49] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759003 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7014.magru.wmnet with OS bullseye [18:06:12] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [18:06:19] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye [18:06:47] (03CR) 10Dzahn: [C:03+1] "I suggested this as a way to test changes only on the "grafana-next" service without touching the "grafana" services. What the exact defin" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:08:28] o/ [18:08:47] nothing for this window, so far's i'm aware. [18:09:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:09:32] !log running cookbook -d sre.dns.netbox "test" [18:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:41] (ConfdResourceFailed) firing: (28) confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:11:29] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7015.magru.wmnet with OS bullseye [18:11:39] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dns7001.wikimedia.org with OS bookworm [18:11:41] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye executed with errors: - cp7015 (**F... [18:11:43] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm exe... [18:11:45] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7014.magru.wmnet with OS bullseye [18:11:51] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759007 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7014.magru.wmnet with OS bullseye executed with errors: - cp7014 (**F... [18:12:37] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS bullseye [18:12:48] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7014.magru.wmnet with OS bullseye [18:13:32] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [18:13:32] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [18:13:32] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp7003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [18:13:36] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [18:13:44] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@4836095]: Regular analytics weekly train [analytics/refinery@4836095f] (duration: 16m 16s) [18:13:49] (03PS2) 10Dzahn: deployment_server: stop including redis::client::python [puppet] - 10https://gerrit.wikimedia.org/r/1024447 (https://phabricator.wikimedia.org/T363415) [18:14:10] !log xcollazo@deploy1002 Started deploy [analytics/refinery@4836095] (thin): Regular analytics weekly train THIN [analytics/refinery@4836095f] [18:14:56] (03CR) 10Bking: [C:03+1] sre.wdqs.restart-nginx: Also restart Envoy alongside [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [18:15:07] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2203/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:15:40] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge pending changes - sukhe@cumin1002" [18:16:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merge pending changes - sukhe@cumin1002" [18:16:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:20] !log aokoth@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host lists1004 [18:17:52] (03CR) 10Bking: global_config: add elasticearch instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [18:18:07] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@4836095] (thin): Regular analytics weekly train THIN [analytics/refinery@4836095f] (duration: 03m 57s) [18:18:53] !log xcollazo@deploy1002 Started deploy [analytics/refinery@4836095] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4836095f] [18:18:58] !log aokoth@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lists1004 [18:19:50] jouncebot: nowandnext [18:19:50] For the next 1 hour(s) and 40 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1800) [18:19:51] In 1 hour(s) and 40 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T2000) [18:20:25] (03CR) 10Dzahn: [C:03+2] deployment_server: stop including redis::client::python [puppet] - 10https://gerrit.wikimedia.org/r/1024447 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [18:20:37] (03CR) 10Bking: [C:03+1] updateQueryServiceLag: tune the min query rate of a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [18:20:42] jouncebot: now [18:20:43] For the next 1 hour(s) and 39 minute(s): MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T1800) [18:20:54] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:21:45] !log xcollazo@deploy1002 Finished deploy [analytics/refinery@4836095] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@4836095f] (duration: 02m 52s) [18:22:27] (03CR) 10Dzahn: [C:03+2] "re: this and the previous patch: I am not manually removing packages from existing deployment servers, but it will unblock new deployment " [puppet] - 10https://gerrit.wikimedia.org/r/1024447 (https://phabricator.wikimedia.org/T363415) (owner: 10Dzahn) [18:24:02] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:24:08] (03CR) 10Bking: [C:03+1] rdf-streaming-updater: increase s3 socket-timeout to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020263 (https://phabricator.wikimedia.org/T362508) (owner: 10DCausse) [18:26:35] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp7014.magru.wmnet with OS bullseye [18:26:46] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7014.magru.wmnet with OS bullseye executed with errors: - cp7014 (**F... [18:30:31] !log aokoth@cumin1002 START - Cookbook sre.hosts.provision for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [18:31:00] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001.magru.wmnet'] [18:31:23] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7001.magru.wmnet'] [18:31:29] !log sukhe@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns7001.magru.wmnet'] [18:31:37] !log sukhe@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dns7001.magru.wmnet'] [18:32:50] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7003.magru.wmnet with OS bullseye [18:33:01] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7003.magru.wmnet with OS bullseye [18:36:01] (03PS1) 10BCornwall: Set ncredir100X to use nginx variant "custom" [puppet] - 10https://gerrit.wikimedia.org/r/1025838 (https://phabricator.wikimedia.org/T357976) [18:37:33] (03CR) 10Ssingh: [C:03+2] wmnet: add esams services under the right origin [dns] - 10https://gerrit.wikimedia.org/r/1025800 (owner: 10Ssingh) [18:37:51] (03PS2) 10Ssingh: wmnet: add esams services under the right origin [dns] - 10https://gerrit.wikimedia.org/r/1025800 [18:38:12] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp7014'] [18:38:30] (03CR) 10Ssingh: [C:03+1] Set ncredir100X to use nginx variant "custom" [puppet] - 10https://gerrit.wikimedia.org/r/1025838 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [18:38:34] (03CR) 10Dzahn: [C:04-1] "after talking about this some more I think you only need 2 discovery names. You have 4 public DNS names but 2 of them should always move b" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [18:38:48] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp7014'] [18:40:01] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759134 (10ssingh) [18:40:04] (03CR) 10Ssingh: [V:03+2 C:03+2] wmnet: add esams services under the right origin [dns] - 10https://gerrit.wikimedia.org/r/1025800 (owner: 10Ssingh) [18:40:19] !log running authdns-update for CR 1025800 [18:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:48] (03CR) 10BCornwall: [C:03+2] Set ncredir100X to use nginx variant "custom" [puppet] - 10https://gerrit.wikimedia.org/r/1025838 (https://phabricator.wikimedia.org/T357976) (owner: 10BCornwall) [18:41:48] !log sukhe@cumin1002 START - Cookbook sre.dns.wipe-cache _etcd._tcp.esams.wmnet on all recursors [18:41:51] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) _etcd._tcp.esams.wmnet on all recursors [18:42:01] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9759136 (10jsn.sherman) >>! In T363377#9758296, @Eevans wrote: > @jsn.sherman Could you please do one of the following? Either: > > - Edit your [[ https://www.mediawiki.org/wi... [18:45:06] (03CR) 10Bking: "Sorry, should've posted this directly in my last comment." [puppet] - 10https://gerrit.wikimedia.org/r/1024613 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [18:45:53] (03CR) 10Herron: [C:03+2] alertmanager: irc: clarify count and move firing to beginning [puppet] - 10https://gerrit.wikimedia.org/r/1019840 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [18:46:16] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7007.magru.wmnet with OS bullseye [18:46:17] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7006.magru.wmnet with OS bullseye [18:46:28] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7007.magru.wmnet with OS bullseye [18:46:28] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759143 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7006.magru.wmnet with OS bullseye [18:47:53] !log sudo cumin -b1 -s10 'C:confd and *.esams.wmnet' 'systemctl restart confd' [18:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:10] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7008.magru.wmnet with OS bullseye [18:48:20] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759145 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7008.magru.wmnet with OS bullseye [18:49:29] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bookworm [18:49:41] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm [18:50:25] (03Abandoned) 10Bking: flink-kubernetes-operator: restart failed jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1017115 (https://phabricator.wikimedia.org/T361870) (owner: 10Bking) [18:52:41] 7002 still looks so odd :) [18:52:52] mutante: what about 10002 [18:52:53] : [18:52:53] ) [18:52:59] omg [18:53:02] (03CR) 10Ryan Kemper: [C:03+1] sre.wdqs.restart-nginx: Also restart Envoy alongside [cookbooks] - 10https://gerrit.wikimedia.org/r/1023863 (owner: 10Muehlenhoff) [18:53:34] (03CR) 10Bking: [C:03+2] rdf-streaming-updater: increase s3 socket-timeout to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020263 (https://phabricator.wikimedia.org/T362508) (owner: 10DCausse) [18:53:36] !log updated alertmanager IRC alert text format. for details please see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019840 [18:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:13] sukhe: I wanted to say it needs tbe added to "typos" but it already is, wow [18:54:25] ha! [18:54:27] (? :) [18:54:32] (03Merged) 10jenkins-bot: rdf-streaming-updater: increase s3 socket-timeout to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020263 (https://phabricator.wikimedia.org/T362508) (owner: 10DCausse) [18:55:19] but yea, 2 more POPs and then "starts with 1 means eqiad" will be false :p [18:55:53] !log bking@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [18:56:41] (03CR) 10BCornwall: [C:03+2] hiera:magru: adding magru dc to authorized ncredir regex [puppet] - 10https://gerrit.wikimedia.org/r/1025784 (https://phabricator.wikimedia.org/T362729) (owner: 10Fabfur) [18:56:58] !log sukhe@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host dns7001 [18:57:01] !log sukhe@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns7001 [18:58:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374#9759172 (10Arnoldokoth) 05Resolved→03Open [18:58:53] FIRING: [4x] JobUnavailable: Reduced availability for job ncredir in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:59:10] !log bking@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [19:01:28] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374#9759181 (10Arnoldokoth) Hi, I was going through the steps listed here https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging to re-image lists1004 t... [19:01:41] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [19:04:02] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7003.magru.wmnet with reason: host reimage [19:05:45] 06SRE, 10SRE-Access-Requests: Requesting access to deployment shell access for Jsn.sherman - https://phabricator.wikimedia.org/T363377#9759200 (10Eevans) [19:08:25] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host lists1004.mgmt.eqiad.wmnet with reboot policy FORCED [19:10:49] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374#9759264 (10Arnoldokoth) Never mind. Got some insight from @eoghan [19:10:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install lists1004.eqiad.wmnet - https://phabricator.wikimedia.org/T342374#9759265 (10Arnoldokoth) 05Open→03Resolved [19:12:26] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [19:14:34] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [19:14:38] herron: is there intentionally two spaces instead of a single one after the 'FIRING:' text [19:14:41] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7007.magru.wmnet with reason: host reimage [19:15:25] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host lists1004.wikimedia.org with OS bookworm [19:15:34] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9759271 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm [19:15:53] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7008.magru.wmnet with reason: host reimage [19:16:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [19:16:50] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7006.magru.wmnet with reason: host reimage [19:18:08] (03PS1) 10BCornwall: Revert "Set ncredir100X to use nginx variant "custom"" [puppet] - 10https://gerrit.wikimedia.org/r/1025762 [19:18:30] (03CR) 10CI reject: [V:04-1] Revert "Set ncredir100X to use nginx variant "custom"" [puppet] - 10https://gerrit.wikimedia.org/r/1025762 (owner: 10BCornwall) [19:19:40] (03PS2) 10BCornwall: Revert "Set ncredir100X to use nginx variant "custom"" [puppet] - 10https://gerrit.wikimedia.org/r/1025762 [19:19:44] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7008.magru.wmnet with reason: host reimage [19:20:00] (03CR) 10CI reject: [V:04-1] Revert "Set ncredir100X to use nginx variant "custom"" [puppet] - 10https://gerrit.wikimedia.org/r/1025762 (owner: 10BCornwall) [19:21:12] (03PS3) 10BCornwall: Revert "Set ncredir100X to use nginx variant "custom"" [puppet] - 10https://gerrit.wikimedia.org/r/1025762 [19:21:35] (03CR) 10BCornwall: [C:03+2] Revert "Set ncredir100X to use nginx variant "custom"" [puppet] - 10https://gerrit.wikimedia.org/r/1025762 (owner: 10BCornwall) [19:22:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1001.eqiad.wmnet with OS bookworm [19:22:33] (03CR) 10Jdlrobson: Deploy a11y settings to testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) (owner: 10Kimberly Sarabia) [19:22:38] taavi: 🙃 [19:22:50] I think I see where its coming from, although I don't hate it [19:24:09] (03PS1) 10BCornwall: Revert "Revert "Set ncredir100X to use nginx variant "custom""" [puppet] - 10https://gerrit.wikimedia.org/r/1025763 [19:26:05] (03CR) 10Ssingh: [C:03+1] Revert "Revert "Set ncredir100X to use nginx variant "custom""" [puppet] - 10https://gerrit.wikimedia.org/r/1025763 (owner: 10BCornwall) [19:27:14] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7002.wikimedia.org with OS bookworm [19:27:22] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm exe... [19:28:19] (03PS1) 10Bking: Revert "rdf-streaming-updater: increase s3 socket-timeout to 30s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025764 [19:28:37] (03CR) 10Bking: [V:03+2 C:03+2] Revert "rdf-streaming-updater: increase s3 socket-timeout to 30s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025764 (owner: 10Bking) [19:29:08] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7003.magru.wmnet with OS bullseye [19:29:14] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759324 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7003.magru.wmnet with OS bullseye completed: - cp7003 (**PASS**) -... [19:30:05] (03PS1) 10Andrew Bogott: cloudweb2002-dev: override ldap: hiera to point to the codfw1dev fork [puppet] - 10https://gerrit.wikimedia.org/r/1025844 [19:31:32] herron: I think I do :( [19:31:49] (03CR) 10Andrew Bogott: [C:03+2] cloudweb2002-dev: override ldap: hiera to point to the codfw1dev fork [puppet] - 10https://gerrit.wikimedia.org/r/1025844 (owner: 10Andrew Bogott) [19:32:07] !log sudo ipmitool -I lanplus -H "dns7002.mgmt.magru.wmnet" -U root -E chassis power cycle [19:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:38] taavi haha ok I'll upload a patch. I'm torn too because at the same time : is used as a delimiter twice and the extra space kinda helps with that [19:34:16] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759354 (10ssingh) [19:38:04] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host dns7002.wikimedia.org with OS bookworm [19:38:09] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm [19:40:00] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [19:40:24] (03PS1) 10Herron: alertmanager: irc: remove second space [puppet] - 10https://gerrit.wikimedia.org/r/1025845 (https://phabricator.wikimedia.org/T362239) [19:41:17] (03PS1) 10Andrew Bogott: openstack::horizon::config: notify docker service instead of Apache [puppet] - 10https://gerrit.wikimedia.org/r/1025846 [19:41:39] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [19:41:41] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7007.magru.wmnet with OS bullseye [19:41:49] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759378 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7007.magru.wmnet with OS bullseye completed: - cp7007 (**PASS**) -... [19:43:03] (03CR) 10Herron: [C:03+2] "stray space or happy accident? I'm torn!" [puppet] - 10https://gerrit.wikimedia.org/r/1025845 (https://phabricator.wikimedia.org/T362239) (owner: 10Herron) [19:43:21] (03CR) 10Andrew Bogott: [C:03+2] openstack::horizon::config: notify docker service instead of Apache [puppet] - 10https://gerrit.wikimedia.org/r/1025846 (owner: 10Andrew Bogott) [19:44:03] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [19:44:56] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7016.magru.wmnet with OS bullseye [19:45:00] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [19:45:01] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7006.magru.wmnet with OS bullseye [19:45:02] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [19:45:03] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7016.magru.wmnet with OS bullseye [19:45:07] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759384 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7006.magru.wmnet with OS bullseye completed: - cp7006 (**PASS**) -... [19:46:37] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [19:46:38] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7008.magru.wmnet with OS bullseye [19:46:46] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7008.magru.wmnet with OS bullseye completed: - cp7008 (**PASS**) -... [19:52:31] hmm seems like something deeper causing it, I'll have to follow up on that later [19:53:38] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir1001.eqiad.wmnet with OS bookworm [19:54:23] (03CR) 10Brouberol: [C:03+1] Setup kubeconfigs for datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025792 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [19:55:14] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1001.eqiad.wmnet with OS bookworm [19:58:11] (03PS1) 10Eevans: Add new user jsherman (deployment group) [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) [19:59:59] !log sukhe@cumin1002 START - Cookbook sre.hosts.reimage for host cp7015.magru.wmnet with OS bullseye [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240430T2000). [20:00:05] kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:13] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759401 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye [20:00:33] hi - i can deploy in about 30 mins if no one else is around [20:00:45] (03PS2) 10Kimberly Sarabia: Deploy a11y settings to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) [20:01:18] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lists1004.wikimedia.org with OS bookworm [20:01:26] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9759402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm executed wi... [20:02:09] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759403 (10Fabfur) [20:02:36] (03CR) 10CI reject: [V:04-1] Deploy a11y settings to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) (owner: 10Kimberly Sarabia) [20:03:43] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host lists1004.wikimedia.org with OS bookworm [20:03:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9759410 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm [20:04:00] (03PS3) 10Kimberly Sarabia: Deploy a11y settings to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) [20:05:26] (03CR) 10Kimberly Sarabia: Deploy a11y settings to testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) (owner: 10Kimberly Sarabia) [20:07:57] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [20:08:26] Hello [20:08:34] (03CR) 10Jdlrobson: [C:03+1] Deploy a11y settings to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) (owner: 10Kimberly Sarabia) [20:10:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1001.eqiad.wmnet with reason: host reimage [20:10:50] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7016.magru.wmnet with reason: host reimage [20:11:08] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [20:13:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7016.magru.wmnet with reason: host reimage [20:15:37] hi kimberly_sarabia - my mtg finished early - i can deploy! [20:15:53] cjming: Thank you [20:16:10] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns7002.wikimedia.org with reason: host reimage [20:16:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) (owner: 10Kimberly Sarabia) [20:17:05] (03Merged) 10jenkins-bot: Deploy a11y settings to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1025819 (https://phabricator.wikimedia.org/T362147) (owner: 10Kimberly Sarabia) [20:17:34] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1025819|Deploy a11y settings to testwiki (T362147)]] [20:17:38] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:18:04] PROBLEM - Recursive DNS on 195.200.68.37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:18:12] ^ expected [20:18:40] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists1004.wikimedia.org with reason: host reimage [20:19:56] (03PS1) 10Bking: rdf-streaming-updater: increase s3 socket-timeout to 30s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025850 (https://phabricator.wikimedia.org/T362508) [20:20:21] !log cjming@deploy1002 ksarabia and cjming: Backport for [[gerrit:1025819|Deploy a11y settings to testwiki (T362147)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:20:47] kimberly_sarabia: can you test? [20:21:05] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists1004.wikimedia.org with reason: host reimage [20:21:18] cjming: one moment [20:23:25] np - take ur time [20:26:40] cjming: LGTM [20:26:49] cool - syncing [20:26:53] !log cjming@deploy1002 ksarabia and cjming: Continuing with sync [20:27:00] PROBLEM - Recursive DNS on 2a02:ec80:700:2:195:200:68:37 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:27:48] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7015.magru.wmnet with reason: host reimage [20:28:51] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1001.eqiad.wmnet with OS bookworm [20:30:41] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncredir1002.eqiad.wmnet with OS bookworm [20:31:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7015.magru.wmnet with reason: host reimage [20:31:43] PROBLEM - Check whether ferm is active by checking the default input chain on mw1370 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:32:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1467 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:32:41] PROBLEM - Check whether ferm is active by checking the default input chain on mw1463 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:36:27] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [20:36:48] (03CR) 10Bking: [C:03+2] updateQueryServiceLag: tune the min query rate of a pooled server [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [20:37:07] (03CR) 10Bking: [C:03+2] "Done" [puppet] - 10https://gerrit.wikimedia.org/r/1014584 (https://phabricator.wikimedia.org/T360993) (owner: 10DCausse) [20:37:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [20:37:29] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7016.magru.wmnet with OS bullseye [20:37:39] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759523 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7016.magru.wmnet with OS bullseye completed: - cp7016 (**PASS**) -... [20:38:18] !log aokoth@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aokoth@cumin1002" [20:38:19] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759535 (10ssingh) [20:38:35] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1025819|Deploy a11y settings to testwiki (T362147)]] (duration: 21m 00s) [20:38:39] T362147: Deploy reading accessibility settings menu and new typography defaults to first set of wikis - https://phabricator.wikimedia.org/T362147 [20:38:51] kimberly_sarabia: should be live! [20:38:53] FIRING: [5x] JobUnavailable: Reduced availability for job lvs_realserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:39:02] hmmmm [20:39:12] cjming: Thank you! [20:39:19] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1004.eqiad.wmnet with OS bullseye [20:39:24] yw! [20:39:56] !log end of UTC late backport window [20:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:08] !log aokoth@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - aokoth@cumin1002" [20:40:09] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lists1004.wikimedia.org with OS bookworm [20:40:14] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Reimage physical lists hosts to have public IPs - https://phabricator.wikimedia.org/T363572#9759540 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aokoth@cumin1002 for host lists1004.wikimedia.org with OS bookworm completed:... [20:43:24] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [20:46:10] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir1002.eqiad.wmnet with reason: host reimage [20:47:16] (03CR) 10Btullis: cephadm: new modules, profile, roles for cephadm-based Ceph clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [20:47:50] (03Abandoned) 10Btullis: Add docker engine to the ceph::cephadm role [puppet] - 10https://gerrit.wikimedia.org/r/1024703 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [20:48:06] (03Abandoned) 10Btullis: Start switching cephosd servers to cephadm management [puppet] - 10https://gerrit.wikimedia.org/r/1024702 (https://phabricator.wikimedia.org/T363558) (owner: 10Btullis) [20:48:55] (03PS4) 10Andrea Denisse: grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) [20:49:54] (03CR) 10Andrea Denisse: "Thanks, I've sent a new patch." [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:49:59] !log sukhe@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host dns7002.wikimedia.org with OS bookworm [20:50:08] 10ops-magru, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm exe... [20:50:45] !log sukhe@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dns7002.wikimedia.org with reason: reimaged again [20:50:47] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dns7002.wikimedia.org with reason: reimaged again [20:51:00] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts cephadm1001.eqiad.wmnet [20:51:27] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:52:09] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1025820/2206/" [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [20:54:44] (03CR) 10Btullis: [C:03+1] Setup kubeconfigs for datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1025792 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [20:54:50] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [20:55:18] (03Abandoned) 10Btullis: Add a second copy of the bootstrap-osd keyring to cephosd [puppet] - 10https://gerrit.wikimedia.org/r/941011 (https://phabricator.wikimedia.org/T330151) (owner: 10Btullis) [20:55:40] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1025775 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [20:55:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - sukhe@cumin1002" [20:55:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7015.magru.wmnet with OS bullseye [20:56:01] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759594 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host cp7015.magru.wmnet with OS bullseye completed: - cp7015 (**PASS**) -... [20:56:08] (03CR) 10Btullis: [C:03+1] Remove obsolete stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1025777 (https://phabricator.wikimedia.org/T360439) (owner: 10Muehlenhoff) [20:56:11] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [20:56:34] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759596 (10ssingh) [20:58:48] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [20:58:50] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1024541 (owner: 10Muehlenhoff) [20:58:53] FIRING: [5x] JobUnavailable: Reduced availability for job lvs_realserver in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:59:41] (03CR) 10Dzahn: [C:04-1] "The uid is just "jsn" in LDAP:" [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) (owner: 10Eevans) [20:59:45] (03CR) 10Btullis: [C:03+1] "Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1025679 (owner: 10Muehlenhoff) [21:00:08] (03CR) 10Bking: [C:03+2] Switch elasticsearch::cirrus tlsproxy to pki [puppet] - 10https://gerrit.wikimedia.org/r/1023813 (https://phabricator.wikimedia.org/T360439) (owner: 10Btullis) [21:00:11] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759599 (10ssingh) [21:00:22] (03PS3) 10Andrea Denisse: trafficserver: Add discovery entries for grafana and grafana-next [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) [21:00:32] (03CR) 10Btullis: [C:03+1] druid::broker: Switch analytics workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024410 (owner: 10Muehlenhoff) [21:00:46] (03CR) 10Btullis: [C:03+1] druid::broker: Switch public workers to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1024409 (owner: 10Muehlenhoff) [21:01:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1370 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:02:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1467 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:02:45] RECOVERY - Check whether ferm is active by checking the default input chain on mw1463 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:03:01] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1004.eqiad.wmnet with reason: host reimage [21:03:38] (03CR) 10Dzahn: [C:03+1] "lgtm. I can confirm grafana1002 is behind grafana.discovery and grafana2001 is behind grafana-next.discovery. And the "regular" and "rw" s" [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:04:39] (03PS3) 10Btullis: Remove the piwik role and profile [puppet] - 10https://gerrit.wikimedia.org/r/1021893 (https://phabricator.wikimedia.org/T349397) [21:05:09] PROBLEM - Elasticsearch HTTPS for production-search-omega-codfw on elastic2100 is CRITICAL: SSL CRITICAL - Certificate elastic2100.codfw.wmnet valid until 2024-05-28 20:58:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [21:05:13] PROBLEM - Elasticsearch HTTPS for production-search-codfw on elastic2100 is CRITICAL: SSL CRITICAL - Certificate elastic2100.codfw.wmnet valid until 2024-05-28 20:58:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [21:05:55] (03PS2) 10Btullis: Update prometheus config to reflect matomo profile change [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) [21:06:15] (03CR) 10CI reject: [V:04-1] Update prometheus config to reflect matomo profile change [puppet] - 10https://gerrit.wikimedia.org/r/1021892 (https://phabricator.wikimedia.org/T349397) (owner: 10Btullis) [21:06:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir1002.eqiad.wmnet with OS bookworm [21:08:33] (03CR) 10BCornwall: [C:03+2] Revert "Revert "Set ncredir100X to use nginx variant "custom""" [puppet] - 10https://gerrit.wikimedia.org/r/1025763 (owner: 10BCornwall) [21:09:54] (03CR) 10Dzahn: "oh wait, you can only merge this after the discovery names have been added to the TLS certs though" [puppet] - 10https://gerrit.wikimedia.org/r/1024808 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [21:11:48] (03CR) 10Dzahn: [C:03+1] grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:12:23] (03PS6) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [21:12:40] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025820 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:12:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [21:12:58] 10ops-codfw, 10ops-eqiad, 06SRE: Decommission Arelion's eqiad-codfw 10G link - https://phabricator.wikimedia.org/T353424#9759627 (10RobH) 05Stalled→03Resolved All disconnects at both codfw/eqiad are complete. Disconnected the cables in netbox as they are no longer connected (xconnects gone.) Delete... [21:15:21] (03CR) 10Dzahn: [C:03+2] "https://puppet-compiler.wmflabs.org/output/1024812/2207/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1024812 (owner: 10Dzahn) [21:16:51] (03CR) 10Eevans: "I'm glad you brought this up. Changing it seemed...wrong, but the docs say: [Some users have aliases for their nickname e.g., don't use t" [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) (owner: 10Eevans) [21:17:22] (03PS7) 10Bking: elasticsearch: Configure alerts for short-lived certs [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) [21:18:03] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [21:18:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [21:18:57] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1004.eqiad.wmnet with OS bullseye [21:23:34] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9759648 (10Eevans) >>! In T362033#9758428, @Volans wrote: > Maybe a little drastic option, but could we try to reimage one of those 2 server and wait few days? > That will surely wipe clean any manual proced... [21:25:35] 10ops-eqiad, 06SRE, 10Cassandra: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9759661 (10Eevans) The rebuild is complete: `lang=sh-session eevans@aqs1014:~$ sudo mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Tue Mar 9 14:18:06 2021 Raid Leve... [21:25:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [21:29:31] (03CR) 10Bking: [C:03+2] "pcc workers are broken ATM...in the interest of time, I'm going to go ahead and +2/merge" [puppet] - 10https://gerrit.wikimedia.org/r/1024481 (https://phabricator.wikimedia.org/T360439) (owner: 10Bking) [21:30:05] (03PS1) 10Andrea Denisse: grafana: Add the .wikimedia.org domain is in CFSSL options [puppet] - 10https://gerrit.wikimedia.org/r/1025856 (https://phabricator.wikimedia.org/T360414) [21:30:57] (03PS2) 10Andrea Denisse: grafana: Add the .wikimedia.org domain to the CFSSL options [puppet] - 10https://gerrit.wikimedia.org/r/1025856 (https://phabricator.wikimedia.org/T360414) [21:31:06] (03CR) 10Btullis: [V:03+1 C:03+2] Test a fix for the bootstrapping of mon daemons on cephosd* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025724 (https://phabricator.wikimedia.org/T332987) (owner: 10Btullis) [21:32:05] (03CR) 10Dzahn: [C:03+1] "yea, worth trying to be sure." [puppet] - 10https://gerrit.wikimedia.org/r/1025856 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:33:25] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1025856 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:33:55] !log grafana2001 - sudo -u loki /usr/bin/loki -config.file=/etc/loki/loki-local-config.yaml in an attempt to debug issue on grafana-next.wikimedia.org [21:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:18] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Add the .wikimedia.org domain to the CFSSL options [puppet] - 10https://gerrit.wikimedia.org/r/1025856 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:35:01] RECOVERY - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1004 is OK: SSL OK - Certificate relforge1004.eqiad.wmnet valid until 2024-05-23 13:34:00 +0000 (expires in 22 days) https://wikitech.wikimedia.org/wiki/Search [21:35:15] RECOVERY - Elasticsearch HTTPS for relforge-eqiad on relforge1003 is OK: SSL OK - Certificate relforge1003.eqiad.wmnet valid until 2024-05-23 13:32:00 +0000 (expires in 22 days) https://wikitech.wikimedia.org/wiki/Search [21:36:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:36:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:36:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:37:21] RECOVERY - Elasticsearch HTTPS for production-search-codfw on elastic2100 is OK: SSL OK - Certificate elastic2100.codfw.wmnet valid until 2024-05-28 20:58:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [21:37:26] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host cephosd1005.eqiad.wmnet with OS bullseye [21:40:23] RECOVERY - Elasticsearch HTTPS for relforge-eqiad-small-alpha on relforge1003 is OK: SSL OK - Certificate relforge1003.eqiad.wmnet valid until 2024-05-23 13:32:00 +0000 (expires in 22 days) https://wikitech.wikimedia.org/wiki/Search [21:42:13] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:45:25] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:13] RECOVERY - Elasticsearch HTTPS for production-search-omega-codfw on elastic2100 is OK: SSL OK - Certificate elastic2100.codfw.wmnet valid until 2024-05-28 20:58:00 +0000 (expires in 27 days) https://wikitech.wikimedia.org/wiki/Search [21:49:23] RECOVERY - Elasticsearch HTTPS for relforge-eqiad on relforge1004 is OK: SSL OK - Certificate relforge1004.eqiad.wmnet valid until 2024-05-23 13:34:00 +0000 (expires in 22 days) https://wikitech.wikimedia.org/wiki/Search [21:49:49] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cephadm1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [21:50:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cephadm1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [21:50:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:50:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cephadm1001.eqiad.wmnet [21:51:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1004 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns1006 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:33] PROBLEM - Check if ntp.service has been restarted after /etc/ntp.conf was changed on dns2005 is CRITICAL: CRITICAL: Service ntp.service has not been restarted after /etc/ntp.conf was changed (gt 4h). https://wikitech.wikimedia.org/wiki/NTP%23Monitoring [21:51:47] (03PS5) 10JHathaway: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [21:53:26] (03PS1) 10Andrea Denisse: grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) [21:54:54] (03CR) 10CI reject: [V:04-1] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [21:55:33] (03CR) 10Dzahn: "ldap_users_only is for people who get LDAP group membership like "wmf" but don't have shell access. in this case it's correct that you mov" [puppet] - 10https://gerrit.wikimedia.org/r/1025847 (https://phabricator.wikimedia.org/T363377) (owner: 10Eevans) [21:55:36] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2210/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:56:33] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1025860/2210/" [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [21:56:45] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [21:57:06] (03CR) 10Btullis: cephadm: new modules, profile, roles for cephadm-based Ceph clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [21:58:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:58:57] (03CR) 10Dzahn: [C:03+2] "no change:" [puppet] - 10https://gerrit.wikimedia.org/r/1024812 (owner: 10Dzahn) [21:59:01] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:00:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 7.627 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:00:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51784 bytes in 3.399 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:01:34] (03CR) 10JHathaway: "looks good overall, I thought it might be more helpful to push some code, rather that just provide suggestions. So I pushed a patch, the m" [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [22:02:25] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:02:26] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1005.eqiad.wmnet with reason: host reimage [22:03:48] (03PS6) 10JHathaway: cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [22:03:58] (03CR) 10JHathaway: cephadm: new modules, profile, roles for cephadm-based Ceph clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [22:04:12] (03PS1) 10Dzahn: grafana: add grafana-next-rw.wikimedia.org to cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/1025864 (https://phabricator.wikimedia.org/T360414) [22:04:56] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7014.magru.wmnet with OS bullseye [22:05:02] !log fabfur@cumin1002 START - Cookbook sre.hosts.reimage for host cp7013.magru.wmnet with OS bullseye [22:05:10] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759744 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7014.magru.wmnet with OS bullseye [22:05:11] 10ops-magru, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759745 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1002 for host cp7013.magru.wmnet with OS bullseye [22:05:30] (03CR) 10Andrea Denisse: [C:03+2] grafana: add grafana-next-rw.wikimedia.org to cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/1025864 (https://phabricator.wikimedia.org/T360414) (owner: 10Dzahn) [22:06:51] (03CR) 10CI reject: [V:04-1] cephadm: new modules, profile, roles for cephadm-based Ceph clusters [puppet] - 10https://gerrit.wikimedia.org/r/1025297 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [22:10:41] FIRING: [28x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_text-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:18:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1005.eqiad.wmnet with OS bullseye [22:19:58] (03PS1) 10Andrea Denisse: grafana: add grafana-next-rw.discovery.wmnet to cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/1025866 (https://phabricator.wikimedia.org/T360414) [22:30:49] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7014.magru.wmnet with reason: host reimage [22:31:28] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [22:32:05] !log fabfur@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp7013.magru.wmnet with reason: host reimage [22:33:18] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7014.magru.wmnet with reason: host reimage [22:35:35] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp7013.magru.wmnet with reason: host reimage [22:38:21] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [22:40:21] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 130679 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [22:43:21] PROBLEM - grafana-next.wikimedia.org on grafana2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [22:44:21] RECOVERY - grafana-next.wikimedia.org on grafana2001 is OK: HTTP OK: HTTP/1.1 200 OK - 130671 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [22:46:36] (03Abandoned) 10Andrea Denisse: grafana: add grafana-next-rw.discovery.wmnet to cfssl cert [puppet] - 10https://gerrit.wikimedia.org/r/1025866 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [22:56:29] !log fabfur@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [22:58:19] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7013.magru.wmnet with OS bullseye [22:58:22] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759844 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7013.magru.wmnet with OS bullseye completed: - cp7013 (**PASS**) - Downtimed on Icinga/A... [23:04:08] (03PS2) 10Andrea Denisse: grafana: Ensure TLS certificates are provided by CFSSL [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) [23:04:15] !log fabfur@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - fabfur@cumin1002" [23:04:16] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7014.magru.wmnet with OS bullseye [23:04:22] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1002 for host cp7014.magru.wmnet with OS bullseye completed: - cp7014 (**PASS**) - Removed from Puppet a... [23:04:59] 10ops-magru, 06DC-Ops, 06Traffic: Q4:rack/setup/install cp70[01-16] - https://phabricator.wikimedia.org/T362729#9759858 (10Fabfur) [23:07:28] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2211/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [23:07:52] (03PS1) 10Dzahn: create wikipedia-it-arbcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1025871 (https://phabricator.wikimedia.org/T363825) [23:08:27] (03CR) 10Andrea Denisse: [V:03+1] "We've tested this on grafana2001 first. We're implementing the changes at the role level. PCC results https://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/1025860 (https://phabricator.wikimedia.org/T360414) (owner: 10Andrea Denisse) [23:09:02] (03CR) 10Dzahn: [C:03+2] deployment server: Run scap clean auto on a weekly basis [puppet] - 10https://gerrit.wikimedia.org/r/1024479 (https://phabricator.wikimedia.org/T363519) (owner: 10Ahmon Dancy) [23:38:21] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024765 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024765 (owner: 10TrainBranchBot) [23:55:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1024765 (owner: 10TrainBranchBot) [23:56:13] (03PS1) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189) [23:58:25] (03PS2) 10BCornwall: ncredir: Reformat/sort the redirects file [puppet] - 10https://gerrit.wikimedia.org/r/1025875 (https://phabricator.wikimedia.org/T355189)