[00:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P53272 and previous config saved to /var/cache/conftool/dbconfig/20231110-000322-root.json [00:10:28] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup) [00:11:36] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup) [00:12:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P53273 and previous config saved to /var/cache/conftool/dbconfig/20231110-001219-arnaudb.json [00:23:34] (03PS1) 10BryanDavis: Fix BlockDisablesLogin recursion [extensions/OAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973247 (https://phabricator.wikimedia.org/T350836) [00:27:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53274 and previous config saved to /var/cache/conftool/dbconfig/20231110-002725-arnaudb.json [00:27:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:27:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [00:27:41] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [00:27:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53275 and previous config saved to /var/cache/conftool/dbconfig/20231110-002747-arnaudb.json [00:31:09] (03PS2) 10BryanDavis: Fix BlockDisablesLogin recursion [extensions/OAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973247 (https://phabricator.wikimedia.org/T350836) [00:31:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53276 and previous config saved to /var/cache/conftool/dbconfig/20231110-003141-arnaudb.json [00:31:57] !log removing 1 file for legal compliance [00:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [00:34:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [00:37:26] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [00:37:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye [00:39:01] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972523 [00:39:03] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972523 (owner: 10TrainBranchBot) [00:44:42] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye [00:44:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL**... [00:45:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [00:45:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye [00:46:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P53277 and previous config saved to /var/cache/conftool/dbconfig/20231110-004647-arnaudb.json [00:50:10] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye [00:50:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL**... [00:50:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye [00:50:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye [00:55:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972523 (owner: 10TrainBranchBot) [00:58:54] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) @MoritzMuehlenhoff It doesn't need the full 16G. I was just basing that off of the initial requests/approval when Quim was correspon... [01:01:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P53278 and previous config saved to /var/cache/conftool/dbconfig/20231110-010154-arnaudb.json [01:02:56] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [01:03:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:05:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [01:08:44] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage [01:10:09] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye [01:10:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [01:10:24] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [01:10:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:13:12] !log SAL test (T343157) [01:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:16] T343157: Remove Twitter support - https://phabricator.wikimedia.org/T343157 [01:15:48] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye [01:15:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [01:15:59] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [01:16:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:17:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53279 and previous config saved to /var/cache/conftool/dbconfig/20231110-011701-arnaudb.json [01:17:04] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [01:17:06] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [01:17:07] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [01:17:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53280 and previous config saved to /var/cache/conftool/dbconfig/20231110-011712-arnaudb.json [01:20:43] !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye [01:20:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**... [01:21:40] !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye [01:21:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye [01:27:16] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS bullseye [01:27:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (**PASS**) - Remov... [01:27:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [01:36:44] !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [01:38:11] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53281 and previous config saved to /var/cache/conftool/dbconfig/20231110-013810-arnaudb.json [01:38:15] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [01:39:42] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage [01:42:36] !log removing 16 files for legal compliance [01:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:53:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P53282 and previous config saved to /var/cache/conftool/dbconfig/20231110-015317-arnaudb.json [01:58:38] !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS bullseye [01:58:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye completed: - cp1114 (**PASS**) - Remov... [02:01:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh) [02:08:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P53283 and previous config saved to /var/cache/conftool/dbconfig/20231110-020823-arnaudb.json [02:15:53] !log removing 3 files for legal compliance [02:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53284 and previous config saved to /var/cache/conftool/dbconfig/20231110-022330-arnaudb.json [02:23:32] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [02:23:34] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [02:23:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [02:23:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53285 and previous config saved to /var/cache/conftool/dbconfig/20231110-022351-arnaudb.json [02:35:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53286 and previous config saved to /var/cache/conftool/dbconfig/20231110-023534-arnaudb.json [02:35:39] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [02:38:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P53287 and previous config saved to /var/cache/conftool/dbconfig/20231110-025041-arnaudb.json [03:05:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P53288 and previous config saved to /var/cache/conftool/dbconfig/20231110-030547-arnaudb.json [03:08:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:20:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53289 and previous config saved to /var/cache/conftool/dbconfig/20231110-032053-arnaudb.json [03:20:58] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [03:39:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:09:03] (03PS1) 10RLazarus: Add golang instructions to README [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973280 [04:09:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:20:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:22:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:26:34] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) >>! In T349402#9321478, @Dwisehaupt wrote: > @MoritzMuehlenhoff It doesn't need the full 16G. I was just basing that off of t... [06:27:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:27:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::search_platform [06:31:17] (03PS1) 10Muehlenhoff: Switch insetup::search_platform to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973284 (https://phabricator.wikimedia.org/T349619) [06:32:38] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [06:33:27] (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::search_platform to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973284 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:44:33] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:45:01] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:52:57] RECOVERY - Check systemd state on sretest1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231110T0700) [07:01:15] !log cleaning up digicert-2022 update-ocsp config bits from cp servers [07:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:17] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:08:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:17] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:28:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:30:59] (03PS1) 10Slyngshede: P:idp:services add Cicalese OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) [07:34:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::search_platform [07:37:09] (03PS1) 10Muehlenhoff: Switch currently unused insetup roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973288 (https://phabricator.wikimedia.org/T349619) [07:41:39] (03PS2) 10Slyngshede: P:idp:services add Catalyst OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) [07:42:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:51:07] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:51:18] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [07:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231110T0800) [08:01:23] !log imported php7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf11u1 to component/php74 for bullseye-wikimedia [08:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:18] (03CR) 10Muehlenhoff: [apt-staging] Add rsync endpoint for ci->apt pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [08:08:03] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973211 (owner: 10Dzahn) [08:12:45] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:16:19] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:45] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:29:22] 10Puppet, 10iPoid-Service: Rename FEED_API_KEY - https://phabricator.wikimedia.org/T350903 (10jijiki) [08:32:04] (03CR) 10Ayounsi: [C: 03+1] Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney) [08:34:56] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [08:35:15] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:35:16] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [08:35:48] !log imported php-defaults 2:7.4+76+wmf1~bullseye1 to component/php74 for bullseye-wikimedia [08:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:38:59] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:39:35] (03CR) 10Ayounsi: [C: 03+1] Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's [homer/public] - 10https://gerrit.wikimedia.org/r/973239 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [08:41:52] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [08:46:14] (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner1002 [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) [08:47:09] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10ayounsi) I'd rather do it the other way around, start as a private IP behind the CDN and move it to a public one if there are are blockers. But... [08:47:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:48:40] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [08:49:45] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:53] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:18] (03PS1) 10Effie Mouzeli: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 [08:52:51] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:01] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:23] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:57:13] (03PS2) 10Effie Mouzeli: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 [08:57:51] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:20] (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: unregister gitlab-runner1002 [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:00:54] (03PS3) 10Effie Mouzeli: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 [09:00:59] (PuppetZeroResources) firing: Puppet has failed generate resources on idp2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:02:16] (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 (owner: 10Effie Mouzeli) [09:03:00] (03Merged) 10jenkins-bot: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 (owner: 10Effie Mouzeli) [09:03:28] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ayounsi) Can you ping me when you're around so we can have a look? afaik nothing changed on the switch side. [09:07:57] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet [09:09:08] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:09:25] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:12:35] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet [09:15:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:19:11] (03PS1) 10Effie Mouzeli: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) [09:23:36] (03CR) 10Kosta Harlan: [C: 04-1] ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:24:30] (03CR) 10Kosta Harlan: [C: 04-1] ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:29:04] (03Abandoned) 10Kosta Harlan: Enable WelcomeSurvey on ukwiki, huwiki, hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565438 (https://phabricator.wikimedia.org/T238295) (owner: 10Catrope) [09:29:15] !log imported wikidiff2 1.14.1-0+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [09:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:21] (03PS2) 10Effie Mouzeli: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) [09:30:33] (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:30:52] (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:32:47] (03CR) 10Kosta Harlan: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:33:41] (03CR) 10Kosta Harlan: ipoid: disable emptyDir (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:34:30] (03CR) 10Kosta Harlan: [C: 03+1] "this still needs a merge" [puppet] - 10https://gerrit.wikimedia.org/r/894000 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx) [09:35:47] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) @AlexisJazz While we are happy that you are excited about this, this is by far not ready for discussion. Developers just handed out the code, but this r... [09:36:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:42:57] (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:48:40] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runner1002 [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:50:28] (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:51:31] (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runner1002" [puppet] - 10https://gerrit.wikimedia.org/r/973255 (https://phabricator.wikimedia.org/T344951) [09:52:11] (03CR) 10Kosta Harlan: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:52:26] (03PS3) 10Kosta Harlan: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:52:33] (03CR) 10Kosta Harlan: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:53:16] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:54:08] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973255 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [09:54:18] (03Merged) 10jenkins-bot: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli) [09:54:19] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [09:57:24] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [09:57:42] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [09:59:38] (03PS1) 10Kosta Harlan: ipoid: Set TMPDIR to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/973304 [10:00:33] (03CR) 10Btullis: Send metrics from Airflow analytics test (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [10:01:12] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Set TMPDIR to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/973304 (owner: 10Kosta Harlan) [10:01:59] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye [10:02:03] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL**) - Downtime... [10:02:05] (03Merged) 10jenkins-bot: ipoid: Set TMPDIR to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/973304 (owner: 10Kosta Harlan) [10:02:20] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [10:02:25] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [10:02:36] !log imported php-excimer 1.0.2-1+wmf3+bullseye1 to component/php74 for bullseye-wikimedia [10:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:13] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [10:05:10] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:05:15] (03CR) 10Jelto: [V: 03+1 C: 03+2] Revert "gitlab_runner: unregister gitlab-runner1002" [puppet] - 10https://gerrit.wikimedia.org/r/973255 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [10:05:22] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:06:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:07:30] (03CR) 10Cathal Mooney: [C: 03+2] Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney) [10:08:06] (03Merged) 10jenkins-bot: Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney) [10:09:32] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye [10:09:44] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL**) - Removed... [10:10:32] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [10:10:36] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [10:16:41] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye [10:16:45] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL**) - Removed... [10:16:55] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye [10:16:59] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye [10:25:54] !log imported dh-php 0.35+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [10:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:29] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [10:31:25] (03PS1) 10Slyngshede: NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306 [10:32:05] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage [10:33:17] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 [10:33:29] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) [10:33:36] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan) [10:35:05] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage [10:38:08] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:38:44] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:38:48] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:39:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:39:08] (03PS3) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) [10:39:11] (03CR) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan) [10:40:01] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan) [10:41:13] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [10:41:29] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [10:42:07] (03PS1) 10Volans: spicerack: log cookbook execution stats [software/spicerack] - 10https://gerrit.wikimedia.org/r/973309 [10:43:28] (03CR) 10Muehlenhoff: sre.ganeti.*: customize lock arguments (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [10:46:38] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:46:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:49:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.417 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:50:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:53:21] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1109.eqiad.wmnet with OS bullseye [10:53:25] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye completed: - cp1109 (**PASS**) - Removed from Puppet... [11:04:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:05:44] !log imported php-imagick 3.4.4+php8.0+3.4.4-2+deb11u2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [11:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:25] (03CR) 10Volans: "thanks fo the feedback, replies/questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [11:08:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:12:54] (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:13:04] (03PS4) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 [11:15:41] (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973312 (https://phabricator.wikimedia.org/T345238) [11:15:49] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973312 (https://phabricator.wikimedia.org/T345238) (owner: 10Kosta Harlan) [11:16:06] !log imported tideways 5.0.4-2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [11:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:33] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973312 (https://phabricator.wikimedia.org/T345238) (owner: 10Kosta Harlan) [11:16:35] (03PS4) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 [11:16:45] (03CR) 10Jbond: puppet: add hiera_lookup function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [11:17:57] !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:18:15] !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:19:53] (03CR) 10Volans: [C: 03+1] "LGTM (needs rebase)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [11:21:28] (03PS1) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) [11:21:43] (03CR) 10Muehlenhoff: [apt-staging] Add rsync endpoint for ci->apt pipeline (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:22:17] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [11:31:39] (03PS1) 10Cathal Mooney: Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) [11:33:58] (03PS5) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 [11:34:00] (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:34:14] (03CR) 10Ayounsi: [C: 03+1] Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney) [11:36:09] (03CR) 10Cathal Mooney: [C: 03+2] Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney) [11:36:45] (03Merged) 10jenkins-bot: Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney) [11:37:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one final nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:39:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:40:04] (03PS6) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 [11:40:11] (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:40:46] (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:41:47] (03PS1) 10Jbond: sre.hosts.reimage: reimage with current puppet version unless new [cookbooks] - 10https://gerrit.wikimedia.org/r/973315 [11:42:36] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney) [11:42:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [11:42:46] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [11:43:05] * Emperor here [11:43:10] same [11:43:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:43:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Confirmed, both servers can see the full 256 GB of RAM. Thanks again @VRiley-WMF. [11:43:32] ok [11:43:47] already going down https://librenms.wikimedia.org/graphs/to=1699616400/id=19111/type=port_bits/from=1699530000/ [11:43:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) [11:43:59] I'm going to lunch, but maybe hotlinking? [11:44:03] I see no increase in traffic, so it doesn't seem related to regular http traffic [11:44:41] if it is going down, lets monitor [11:44:50] try to see if there is something on superset [11:45:39] oh, did we p.age everyone because it's a US holiday? [11:45:53] both text and upload for esams look normal re: http requests [11:46:06] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [11:46:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:46:35] I've ACKd it anyhow [11:46:42] (03CR) 10Jbond: "in relation to this function i was originally going to use it to lookup the puppet version of a host and to see if a host was classified i" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond) [11:47:22] seems back to normal levels now [11:47:30] Emperor: I resolved it [11:47:45] (Primary inbound port utilisation over 80% #page) resolved: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [11:47:46] (Primary inbound port utilisation over 80% #page) resolved: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [11:47:47] apparently I was not fast enough as I was looking at the hgraphs [11:50:06] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 75, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:50:40] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [11:50:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [11:50:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [11:50:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:56:17] !log imported php-wmerrors 2.0.0~git20190628.183ef7d-3+wmf1+bullseye1 to component/php74 for bullseye-wikimedia [11:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:26] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [11:56:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [11:56:39] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [11:56:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [11:58:38] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [11:58:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [11:59:07] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10AlexisJazz) @jcrespo thanks for letting me know. I misunderstood Bawolff's comment. Well, I can partially answer one of your open questions. You won't really ne... [11:59:13] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [11:59:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:02:33] (03PS1) 10Muehlenhoff: Set an-master1003/1004 to use to Puppet 7 via Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/973317 (https://phabricator.wikimedia.org/T349619) [12:03:58] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [12:04:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [12:04:14] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [12:04:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:04:41] (03PS2) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) [12:08:41] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye [12:08:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*... [12:10:39] (03CR) 10Muehlenhoff: sre.ganeti.*: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans) [12:12:24] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye [12:12:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye [12:20:04] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) >>! In T191804#9319565, @jcrespo wrote: > Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that... [12:21:41] 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace - https://phabricator.wikimedia.org/T350918 (10hnowlan) [12:22:36] (03PS1) 10Btullis: Add a prometheus_instance parameter to prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) [12:22:38] (03PS1) 10Btullis: Configure statsds_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [12:23:19] (03PS2) 10Btullis: Configure statsd_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [12:23:31] 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace - https://phabricator.wikimedia.org/T350918 (10hnowlan) The `WMF` group is an LDAP group rather than a shell group - is there another group that should be requested here? Tagging @Jdforrester-WMF for approval. Clarity on what the shell access is... [12:25:38] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) Thank you, @AlexisJazz that's useful feedback that without doubt will make our media storage happy- still there are additional technical operations and... [12:25:57] !log imported php-pcov 1.0.6-4+wmf1~bullseye1 to component/php74 for bullseye-wikimedia [12:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:28:46] (03PS4) 10Jbond: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [12:28:54] (03PS15) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [12:29:08] I try to take a look [12:29:31] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10hnowlan) [12:31:36] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [12:32:32] 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) Indeed, the same schema change for production has to be applied to backup metadata, as we mirrored the size from mediawiki as an unsigned int: https://... [12:32:38] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [12:33:21] 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10hnowlan) OOB key verification will be done next week [12:33:42] XioNoX: It's not from outside: https://w.wiki/877Q at least I'm not seeing any. I think it might be analytics again [12:33:57] (03PS16) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [12:34:02] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [12:35:01] (03PS5) 10Jbond: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [12:35:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [12:36:07] I wait for wmf_flow_internal to catch up and than look at it [12:36:08] (03CR) 10Aqu: "Thx @Btullis for the pointer. I've switched the strategy from a variable to an `any`." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [12:37:01] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage [12:38:51] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:41:31] 10SRE, 10Traffic-Icebox, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10jbond) @Pppery AFAIK other then blocking empty agent headers on upload (T224891#7182766) no further progress has been made to addresses the comments i... [12:43:58] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:33] (03CR) 10Jbond: [C: 04-1] "I don't think that the production idp i the best place for this https://idp.wmcloud.org/ or idp-test would be better options. I think we " [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) (owner: 10Slyngshede) [12:47:08] (03CR) 10Ayounsi: Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney) [12:48:50] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10hnowlan) 05Open→03Stalled [12:49:36] scratch that, if it's esams it can't be analytics. I need more coffee [12:54:51] (03PS15) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 [12:59:31] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye [12:59:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed: - cp1115 (**PASS**) - Remo... [13:01:15] (03CR) 10Jbond: "adding moritz and volans who i think cold provide good feedback" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [13:05:10] !log imported php-yaml 2.2.1+2.1.0+2.0.4+1.3.2-2+wmf1~bullseye1 to component/php74 for bullseye-wikimedia [13:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:29] (03CR) 10Btullis: "I like it." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [13:07:32] (03PS1) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy [puppet] - 10https://gerrit.wikimedia.org/r/973323 [13:09:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:13] (03CR) 10Btullis: Generate the netboot.cfg file to avoid typos impacting everyone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [13:12:44] (03PS1) 10Hnowlan: admin: add ecarg to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/973324 (https://phabricator.wikimedia.org/T350818) [13:16:14] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:16:40] (03CR) 10Cathal Mooney: "I don't think we need to add this policy on the switches actually. The existing policy/group that the spine's have facing the CRs can be " [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:17:08] (03PS16) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 [13:18:17] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/389/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [13:18:19] (03CR) 10Tchanders: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [13:19:02] (03CR) 10Cathal Mooney: "Looking more closely I was going to say the lack of "from protocol evpn" would be an issue, but as you send a default that doesn't _really" [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:19:04] (03PS2) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) [13:19:21] (03CR) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [13:19:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:21:08] (03PS3) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) [13:21:38] (03CR) 10Brouberol: "Here is a diff between the current and generated netboot.cfg files https://phabricator.wikimedia.org/P53293" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [13:23:00] !log imported php-geoip 1.1.1-7+wmf2+bullseye1 to component/php74 for bullseye-wikimedia [13:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:06] (03Abandoned) 10Ayounsi: Add support for non EVPN switches on spines [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:24:59] (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [13:25:04] (03CR) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney) [13:25:18] (03PS1) 10Ayounsi: Add BGP between spines and SONiC L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) [13:26:39] (03CR) 10Ayounsi: "Example diff on two spines: https://phabricator.wikimedia.org/P53294" [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:26:48] (03PS4) 10Cathal Mooney: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579) [13:27:51] 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff) [13:29:51] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:52] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Longer term we can think about whether to split these, or otherwise change the template/group name to say "sw_external" or something" [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:30:30] (03CR) 10Ayounsi: [C: 03+2] Add BGP between spines and SONiC L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:30:45] (03CR) 10Ayounsi: "I14bb16c8f9d8661953f5cde5a6e18df802b4d957" [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:31:08] (03Merged) 10jenkins-bot: Add BGP between spines and SONiC L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [13:39:51] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff) [13:44:44] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [13:45:00] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [13:45:07] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [13:45:36] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [13:45:42] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [13:46:04] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [13:47:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [13:49:16] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10lbowmaker) [13:49:29] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10lbowmaker) [13:52:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [14:01:31] (03PS1) 10Ayounsi: Add sretest1004 [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) [14:02:59] (03CR) 10Brouberol: [C: 03+1] Add sretest1004 [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [14:03:01] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [14:03:16] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [14:03:20] (03CR) 10Ayounsi: [C: 03+2] Add sretest1004 [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi) [14:03:26] 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10taavi) I think this is fixed now, right? [14:07:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:11:42] !log upgradeing LibreNMS to 23.10 [14:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:38] !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to v23.10.0 - T349492 [14:15:48] !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to v23.10.0 - T349492 (duration: 00m 10s) [14:17:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:14] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-alerts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:21] (03PS17) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 [14:21:07] (03PS10) 10Hashar: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 [14:21:34] (03CR) 10Brouberol: "as preseed.cfg is a symlink to netboot.cfg, I removed the committed symlink and made the link explicit, via a `file` resource." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [14:21:48] RECOVERY - Check systemd state on an-airflow1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:57] 10SRE, 10Data-Engineering, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10lbowmaker) [14:26:16] (03PS3) 10Btullis: Configure statsd_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [14:26:22] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:59] (03CR) 10Vgutierrez: [C: 03+1] haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur) [14:27:21] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [14:29:57] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10lbowmaker) [14:30:12] (03PS3) 10Fabfur: haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) [14:30:59] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10lbowmaker) [14:31:29] 10SRE, 10Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lbowmaker) [14:31:52] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10lbowmaker) [14:32:14] 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10lbowmaker) [14:32:22] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Engineering, 10Data-Platform-SRE, and 3 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10lbowmaker) [14:36:32] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [14:38:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:47] (03PS4) 10Btullis: Configure statsd_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) [14:40:09] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [14:40:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/973309 (owner: 10Volans) [14:46:24] (03PS1) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) [14:46:56] (03CR) 10Marostegui: "Why is this needed? The ALTER grant is already enough to create indexes:" [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [14:47:41] (03PS1) 10Marostegui: Revert "dbproxy102[2,4]: Promote db1119 to standby" [puppet] - 10https://gerrit.wikimedia.org/r/973331 [14:48:31] (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [14:48:59] (03PS2) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) [14:49:11] (03CR) 10Arnaudb: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/973331 (owner: 10Marostegui) [14:49:40] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy102[2,4]: Promote db1119 to standby" [puppet] - 10https://gerrit.wikimedia.org/r/973331 (owner: 10Marostegui) [14:53:02] (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [14:53:53] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:23] (03PS3) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) [14:57:23] 10SRE, 10Data Pipelines, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10lbowmaker) [14:58:11] (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [14:58:29] (03Abandoned) 10DCausse: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski) [14:59:00] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [15:01:05] (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz) [15:01:30] (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [15:02:58] (03PS1) 10Marostegui: mariadb: Promote db1119 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) [15:03:10] (03CR) 10Marostegui: [C: 04-2] "Not until tuesday" [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [15:05:33] (03PS4) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) [15:07:41] (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi) [15:09:50] (03PS1) 10EoghanGaffney: [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352 [15:11:23] (03PS5) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) [15:13:53] (03CR) 10Jelto: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney) [15:18:31] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney) [15:22:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:23:11] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bookworm [15:23:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm [15:30:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:30:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:31:00] (03PS1) 10EoghanGaffney: [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353 [15:31:12] (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353 (owner: 10EoghanGaffney) [15:31:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:31:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:32:01] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10MatthewVernon) Technically, this is easy - we can make a swift account and away you go. I don't want to tie anyone up in red tape, but I think it'd be good to have a lightweight process to ensure this... [15:34:46] 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10jcrespo) +1 [15:36:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [15:38:56] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage [15:39:02] (03CR) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [15:39:07] (03Abandoned) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [15:40:54] (03CR) 10Marostegui: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [15:41:07] (03Restored) 10Marostegui: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [15:41:41] (03CR) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [15:42:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [15:42:11] (03CR) 10Marostegui: [C: 03+2] ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan) [15:45:49] (03CR) 10FNegri: [C: 03+1] "nice, much cleaner!" [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah) [15:47:41] (03PS2) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and crt [puppet] - 10https://gerrit.wikimedia.org/r/973323 [15:48:15] (03CR) 10Majavah: [V: 03+1 C: 03+2] dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah) [15:49:51] 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) I think this has been edited to indicate a different problem, that still exists: > the systems are using different source addresses... [15:51:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:54:33] 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) Or maybe it's easier to create a new task and resolve this one :) [15:54:34] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1062.eqiad.wmnet with OS bookworm [15:54:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm completed: - cloud... [15:56:00] (03CR) 10Jbond: [C: 04-1] "LGTM -1 is just for the symlink" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [15:56:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:57:24] (03CR) 10Jbond: [C: 04-1] Generate the netboot.cfg file to avoid typos impacting everyone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol) [16:03:31] (03CR) 10Btullis: "I fluffed up my gerrit patch splitting, so there is a new two-part configuratino change here:" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [16:04:07] (03PS1) 10Andrew Bogott: Put new cloudvirts (cloudvirt1062-1067) online [puppet] - 10https://gerrit.wikimedia.org/r/973354 (https://phabricator.wikimedia.org/T342537) [16:04:09] (03CR) 10Btullis: "Updated patch here:" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [16:04:37] (03CR) 10Jbond: "lgtm a few nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney) [16:05:00] (03Abandoned) 10Btullis: Enable support for statsd_exporters on non-ops instances [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis) [16:05:51] (03CR) 10Andrew Bogott: [C: 03+2] Put new cloudvirts (cloudvirt1062-1067) online [puppet] - 10https://gerrit.wikimedia.org/r/973354 (https://phabricator.wikimedia.org/T342537) (owner: 10Andrew Bogott) [16:06:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:06:19] (03CR) 10Jbond: Ensure that build directories are cleaned up (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede) [16:06:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS b... [16:10:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm [16:10:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bookworm [16:10:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS b... [16:10:30] (03CR) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [16:10:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS b... [16:11:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bookworm [16:11:51] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bookworm [16:12:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS b... [16:12:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS b... [16:15:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:16:12] (03CR) 10Cathal Mooney: [C: 03+2] Change 'anycast_gw' var in int config to represent type of IRB needed [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney) [16:18:17] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:20:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:20:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:22:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [16:22:13] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [16:23:40] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [16:24:03] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [16:24:55] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [16:25:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:25:17] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [16:27:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [16:29:44] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:29:54] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [16:30:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [16:31:39] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:31:57] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:33:52] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add cloud-private subnet entries for new cloudvirt hosts - cmooney@cumin1001" [16:34:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add cloud-private subnet entries for new cloudvirt hosts - cmooney@cumin1001" [16:34:43] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:36:12] !log cmooney@cumin1001 START - Cookbook sre.dns.wipe-cache cloudvirt1062.private.eqiad.wikimedia.cloud on all recursors [16:36:16] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudvirt1062.private.eqiad.wikimedia.cloud on all recursors [16:36:36] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1119 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui) [16:37:05] (03CR) 10Jcrespo: [C: 03+1] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894) (owner: 10FNegri) [16:38:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1067.eqiad.wmnet with OS bookworm [16:38:19] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:38:23] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bookworm [16:38:27] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1066.eqiad.wmnet with OS bookworm [16:38:50] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bookworm [16:38:55] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:38:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm [16:39:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm [16:39:04] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bookworm [16:39:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm [16:39:17] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bookworm [16:39:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm [16:39:30] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bookworm [16:39:33] (03CR) 10Jcrespo: "FYI: arnaud. This is something that should be deployed, but needs nursing, as the problem is not puppet, but the changes that would be nee" [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [16:39:39] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm [16:39:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm [16:43:23] (03PS1) 10Hnowlan: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049) [16:47:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:48] !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001 [16:51:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001 [16:52:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:53:25] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [16:53:56] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [16:54:05] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [16:54:16] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [16:54:31] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [16:55:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:11] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage [16:58:16] 10SRE, 10ops-eqiad, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10fnegri) [16:58:58] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [16:58:58] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage [16:59:02] 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) 05Open→03Resolved a:03fnegri I have created {T350995} for the problem that still exists, and I'm marking this task as resolved. [16:59:16] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [17:00:15] 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) [17:01:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage [17:03:42] PROBLEM - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:11:37] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott reimage in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:12:04] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:14:32] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:16:14] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:24:59] (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:25:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:33:16] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bookworm [17:33:18] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bookworm [17:33:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm executed with erro... [17:33:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm executed with erro... [17:33:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm [17:33:43] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm [17:33:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm [17:33:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm [17:47:26] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [17:48:28] !log withdrawing IPv6 prefixes announced to AS1299 in esams to troubleshoot connectivity problem report [17:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:45] (03PS1) 10Jcrespo: sql: Migrate mediabackups metadata size from int to bigint [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804) [17:49:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [17:50:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage [17:50:55] (03CR) 10Jcrespo: "This should be the only code-ish related change needed- as in memory we use python integer values, which are unbounded AFAIK." [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804) (owner: 10Jcrespo) [17:51:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1065.eqiad.wmnet with OS bookworm [17:51:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm completed: - cloud... [17:52:42] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1067.eqiad.wmnet with OS bookworm [17:52:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm completed: - cloud... [17:53:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [17:54:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1066.eqiad.wmnet with OS bookworm [17:54:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm completed: - cloud... [18:03:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:04:05] !log brion adding more vp9 backfill to the transcode runs on mwmaint2002 (requeueTranscodes -> job queue runners). Should increase load on transcode scaler job runners but not elsewhere [18:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:13:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [18:24:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:38:16] (03CR) 10Dzahn: admin: add urbanecm to stewards-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn) [18:38:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:41:04] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:47:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS bookworm [18:47:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm completed: - cloud... [18:53:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:57:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:12:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [19:20:54] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:22:38] (03PS1) 10Zoranzoki21: throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) [19:44:52] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:53:16] PROBLEM - ensure kvm processes are running on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:04:07] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm [20:04:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm [20:22:33] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [20:25:18] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage [20:28:20] RECOVERY - ensure kvm processes are running on cloudvirt1062 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:36:11] (03PS17) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [20:36:46] (03CR) 10CI reject: [V: 04-1] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [20:39:00] (03CR) 10Brian Wolff: [C: 03+1] "next week is fine. Honestly, it will probably be a little while before stuff actually happens in production, definitely more than a week." [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804) (owner: 10Jcrespo) [20:40:03] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [20:45:18] (03PS18) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [20:51:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS bookworm [20:51:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud... [20:56:58] (03CR) 10Aqu: "I've added in this patch the mappings to customize the metrics for Prometheus." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [21:00:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS bookworm [21:00:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud... [21:16:14] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:19:08] PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 90%, RTA = 6686.60 ms [21:19:20] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 82.08 ms [21:25:14] (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:42:06] RECOVERY - ensure kvm processes are running on cloudvirt1064 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:14:30] (03PS3) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and crt [puppet] - 10https://gerrit.wikimedia.org/r/973323 [22:15:00] (03CR) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and crt (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney) [22:31:20] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:53:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:53:52] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:53:53] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:54:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:55:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:01:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [23:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [23:34:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bookworm [23:46:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [23:48:54] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [23:51:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage [23:53:53] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure