[00:03:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P53272 and previous config saved to /var/cache/conftool/dbconfig/20231110-000322-root.json
[00:10:28] <wikibugs>	 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup)
[00:11:36] <wikibugs>	 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup)
[00:12:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P53273 and previous config saved to /var/cache/conftool/dbconfig/20231110-001219-arnaudb.json
[00:23:34] <wikibugs>	 (03PS1) 10BryanDavis: Fix BlockDisablesLogin recursion [extensions/OAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973247 (https://phabricator.wikimedia.org/T350836)
[00:27:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53274 and previous config saved to /var/cache/conftool/dbconfig/20231110-002725-arnaudb.json
[00:27:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[00:27:30] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[00:27:41] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[00:27:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53275 and previous config saved to /var/cache/conftool/dbconfig/20231110-002747-arnaudb.json
[00:31:09] <wikibugs>	 (03PS2) 10BryanDavis: Fix BlockDisablesLogin recursion [extensions/OAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/973247 (https://phabricator.wikimedia.org/T350836)
[00:31:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53276 and previous config saved to /var/cache/conftool/dbconfig/20231110-003141-arnaudb.json
[00:31:57] <tzatziki>	 !log removing 1 file for legal compliance
[00:31:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:34:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh)
[00:34:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh)
[00:37:26] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[00:37:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye
[00:39:01] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972523
[00:39:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972523 (owner: 10TrainBranchBot)
[00:44:42] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye
[00:44:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL**...
[00:45:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[00:45:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye
[00:46:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P53277 and previous config saved to /var/cache/conftool/dbconfig/20231110-004647-arnaudb.json
[00:50:10] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1112.eqiad.wmnet with OS bullseye
[00:50:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye executed with errors: - cp1112 (**FAIL**...
[00:50:22] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp1112.eqiad.wmnet with OS bullseye
[00:50:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye
[00:55:36] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/972523 (owner: 10TrainBranchBot)
[00:58:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) @MoritzMuehlenhoff It doesn't need the full 16G. I was just basing that off of the initial requests/approval when Quim was correspon...
[01:01:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P53278 and previous config saved to /var/cache/conftool/dbconfig/20231110-010154-arnaudb.json
[01:02:56] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye
[01:03:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye
[01:05:38] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage
[01:08:44] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1112.eqiad.wmnet with reason: host reimage
[01:10:09] <logmsgbot>	 !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye
[01:10:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**...
[01:10:24] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye
[01:10:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye
[01:13:12] <bd808>	 !log SAL test (T343157)
[01:13:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:16] <stashbot>	 T343157: Remove Twitter support - https://phabricator.wikimedia.org/T343157
[01:15:48] <logmsgbot>	 !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye
[01:15:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**...
[01:15:59] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye
[01:16:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye
[01:17:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53279 and previous config saved to /var/cache/conftool/dbconfig/20231110-011701-arnaudb.json
[01:17:04] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[01:17:06] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[01:17:07] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance
[01:17:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53280 and previous config saved to /var/cache/conftool/dbconfig/20231110-011712-arnaudb.json
[01:20:43] <logmsgbot>	 !log sukhe@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1114.eqiad.wmnet with OS bullseye
[01:20:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye executed with errors: - cp1114 (**FAIL**...
[01:21:40] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.reimage for host cp1114.eqiad.wmnet with OS bullseye
[01:21:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye
[01:27:16] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1112.eqiad.wmnet with OS bullseye
[01:27:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp1112.eqiad.wmnet with OS bullseye completed: - cp1112 (**PASS**)   - Remov...
[01:27:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh)
[01:36:44] <logmsgbot>	 !log sukhe@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage
[01:38:11] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53281 and previous config saved to /var/cache/conftool/dbconfig/20231110-013810-arnaudb.json
[01:38:15] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[01:39:42] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1114.eqiad.wmnet with reason: host reimage
[01:42:36] <tzatziki>	 !log removing 16 files for legal compliance
[01:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:53:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P53282 and previous config saved to /var/cache/conftool/dbconfig/20231110-015317-arnaudb.json
[01:58:38] <logmsgbot>	 !log sukhe@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1114.eqiad.wmnet with OS bullseye
[01:58:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1001 for host cp1114.eqiad.wmnet with OS bullseye completed: - cp1114 (**PASS**)   - Remov...
[02:01:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ssingh)
[02:08:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P53283 and previous config saved to /var/cache/conftool/dbconfig/20231110-020823-arnaudb.json
[02:15:53] <tzatziki>	 !log removing 3 files for legal compliance
[02:15:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T348183)', diff saved to https://phabricator.wikimedia.org/P53284 and previous config saved to /var/cache/conftool/dbconfig/20231110-022330-arnaudb.json
[02:23:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[02:23:34] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[02:23:45] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[02:23:52] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53285 and previous config saved to /var/cache/conftool/dbconfig/20231110-022351-arnaudb.json
[02:35:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53286 and previous config saved to /var/cache/conftool/dbconfig/20231110-023534-arnaudb.json
[02:35:39] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[02:38:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:50:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P53287 and previous config saved to /var/cache/conftool/dbconfig/20231110-025041-arnaudb.json
[03:05:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P53288 and previous config saved to /var/cache/conftool/dbconfig/20231110-030547-arnaudb.json
[03:08:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:20:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T348183)', diff saved to https://phabricator.wikimedia.org/P53289 and previous config saved to /var/cache/conftool/dbconfig/20231110-032053-arnaudb.json
[03:20:58] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[03:39:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:53:53] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:09:03] <wikibugs>	 (03PS1) 10RLazarus: Add golang instructions to README [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/973280
[04:09:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:20:35] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:15] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:33:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:22:57] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 126, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:26:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) >>! In T349402#9321478, @Dwisehaupt wrote: > @MoritzMuehlenhoff It doesn't need the full 16G. I was just basing that off of t...
[06:27:13] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:27:55] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:29:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::search_platform
[06:31:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch insetup::search_platform to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973284 (https://phabricator.wikimedia.org/T349619)
[06:32:38] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[06:33:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::search_platform to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973284 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[06:44:33] <icinga-wm>	 PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Idle - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:45:01] <icinga-wm>	 PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv6: Idle - wmf_public_asn, AS14907/IPv4: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[06:52:57] <icinga-wm>	 RECOVERY - Check systemd state on sretest1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231110T0700)
[07:01:15] <vgutierrez>	 !log cleaning up digicert-2022 update-ocsp config bits from cp servers
[07:01:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:04:17] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:08:53] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:09:17] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:28:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:30:59] <wikibugs>	 (03PS1) 10Slyngshede: P:idp:services add Cicalese OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725)
[07:34:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::search_platform
[07:37:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch currently unused insetup roles to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/973288 (https://phabricator.wikimedia.org/T349619)
[07:41:39] <wikibugs>	 (03PS2) 10Slyngshede: P:idp:services add Catalyst OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725)
[07:42:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:51:07] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:51:18] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[07:53:53] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231110T0800)
[08:01:23] <moritzm>	 !log imported php7.4 1:7.4.33-1+0~20221108.73+debian10~1.gbpa00350a+wmf11u1 to component/php74 for bullseye-wikimedia
[08:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:18] <wikibugs>	 (03CR) 10Muehlenhoff: [apt-staging] Add rsync endpoint for ci->apt pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[08:08:03] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973211 (owner: 10Dzahn)
[08:12:45] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:16:19] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:18:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:22:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:29:22] <wikibugs>	 10Puppet, 10iPoid-Service: Rename FEED_API_KEY - https://phabricator.wikimedia.org/T350903 (10jijiki)
[08:32:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney)
[08:34:56] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[08:35:15] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:35:16] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[08:35:48] <moritzm>	 !log imported php-defaults 2:7.4+76+wmf1~bullseye1 to component/php74 for bullseye-wikimedia
[08:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:38:59] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:39:35] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Homer YAML changes to support move of sandbox-a-codfw vlan to ssw's [homer/public] - 10https://gerrit.wikimedia.org/r/973239 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney)
[08:41:52] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[08:46:14] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner1002 [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951)
[08:47:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10ayounsi) I'd rather do it the other way around, start as a private IP behind the CDN and move it to a public one if there are are blockers. But...
[08:47:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:48:40] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[08:49:45] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:49:53] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:18] <wikibugs>	 (03PS1) 10Effie Mouzeli: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291
[08:52:51] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:54:01] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:55:23] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:57:13] <wikibugs>	 (03PS2) 10Effie Mouzeli: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291
[08:57:51] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:00:20] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: unregister gitlab-runner1002 [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[09:00:54] <wikibugs>	 (03PS3) 10Effie Mouzeli: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291
[09:00:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on idp2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[09:02:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 (owner: 10Effie Mouzeli)
[09:03:00] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: tweak resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/973291 (owner: 10Effie Mouzeli)
[09:03:28] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ayounsi) Can you ping me when you're around so we can have a look? afaik nothing changed on the switch side.
[09:07:57] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet
[09:09:08] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[09:09:25] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[09:12:35] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet
[09:15:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[09:19:11] <wikibugs>	 (03PS1) 10Effie Mouzeli: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861)
[09:23:36] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-1] ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:24:30] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 04-1] ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:29:04] <wikibugs>	 (03Abandoned) 10Kosta Harlan: Enable WelcomeSurvey on ukwiki, huwiki, hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/565438 (https://phabricator.wikimedia.org/T238295) (owner: 10Catrope)
[09:29:15] <moritzm>	 !log imported wikidiff2  1.14.1-0+wmf1+bullseye1  to component/php74 for bullseye-wikimedia
[09:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:21] <wikibugs>	 (03PS2) 10Effie Mouzeli: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861)
[09:30:33] <wikibugs>	 (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:30:52] <wikibugs>	 (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:32:47] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:33:41] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: disable emptyDir (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:34:30] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] "this still needs a merge" [puppet] - 10https://gerrit.wikimedia.org/r/894000 (https://phabricator.wikimedia.org/T329718) (owner: 10Phuedx)
[09:35:47] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) @AlexisJazz While we are happy that you are excited about this, this is by far not ready for discussion. Developers just handed out the code, but this r...
[09:36:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:42:57] <wikibugs>	 (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:48:40] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runner1002 [puppet] - 10https://gerrit.wikimedia.org/r/973290 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[09:50:28] <wikibugs>	 (03CR) 10Effie Mouzeli: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:51:31] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runner1002" [puppet] - 10https://gerrit.wikimedia.org/r/973255 (https://phabricator.wikimedia.org/T344951)
[09:52:11] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:52:26] <wikibugs>	 (03PS3) 10Kosta Harlan: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:52:33] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:53:16] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: disable emptyDir (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:54:08] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/973255 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[09:54:18] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: disable emptyDir [deployment-charts] - 10https://gerrit.wikimedia.org/r/973292 (https://phabricator.wikimedia.org/T346861) (owner: 10Effie Mouzeli)
[09:54:19] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[09:57:24] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[09:57:42] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[09:59:38] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Set TMPDIR to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/973304
[10:00:33] <wikibugs>	 (03CR) 10Btullis: Send metrics from Airflow analytics test (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[10:01:12] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Set TMPDIR to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/973304 (owner: 10Kosta Harlan)
[10:01:59] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye
[10:02:03] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL**)   - Downtime...
[10:02:05] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Set TMPDIR to /tmp [deployment-charts] - 10https://gerrit.wikimedia.org/r/973304 (owner: 10Kosta Harlan)
[10:02:20] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[10:02:25] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye
[10:02:36] <moritzm>	 !log imported php-excimer 1.0.2-1+wmf3+bullseye1 to component/php74 for bullseye-wikimedia
[10:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:13] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[10:05:10] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:05:15] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] Revert "gitlab_runner: unregister gitlab-runner1002" [puppet] - 10https://gerrit.wikimedia.org/r/973255 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[10:05:22] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:06:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:07:30] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney)
[10:08:06] <wikibugs>	 (03Merged) 10jenkins-bot: Adjust BGP_Customer_out policy to send default and local POP routes [homer/public] - 10https://gerrit.wikimedia.org/r/973198 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney)
[10:09:32] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye
[10:09:44] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL**)   - Removed...
[10:10:32] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[10:10:36] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye
[10:16:41] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1109.eqiad.wmnet with OS bullseye
[10:16:45] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye executed with errors: - cp1109 (**FAIL**)   - Removed...
[10:16:55] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1109.eqiad.wmnet with OS bullseye
[10:16:59] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye
[10:25:54] <moritzm>	 !log imported  dh-php 0.35+wmf1+bullseye1 to component/php74 for bullseye-wikimedia
[10:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:29] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[10:31:25] <wikibugs>	 (03PS1) 10Slyngshede: NTP: alert on ntp/time errors [alerts] - 10https://gerrit.wikimedia.org/r/973306
[10:32:05] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage
[10:33:17] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307
[10:33:29] <wikibugs>	 (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163)
[10:33:36] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan)
[10:35:05] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1109.eqiad.wmnet with reason: host reimage
[10:38:08] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:38:44] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:38:48] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:39:00] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:39:08] <wikibugs>	 (03PS3) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163)
[10:39:11] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan)
[10:40:01] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973307 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan)
[10:41:13] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[10:41:29] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[10:42:07] <wikibugs>	 (03PS1) 10Volans: spicerack: log cookbook execution stats [software/spicerack] - 10https://gerrit.wikimedia.org/r/973309
[10:43:28] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.*: customize lock arguments (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[10:46:38] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:46:54] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:49:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.417 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:50:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:53:21] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1109.eqiad.wmnet with OS bullseye
[10:53:25] <wikibugs>	 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1109.eqiad.wmnet with OS bullseye completed: - cp1109 (**PASS**)   - Removed from Puppet...
[11:04:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:05:44] <moritzm>	 !log imported php-imagick 3.4.4+php8.0+3.4.4-2+deb11u2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia
[11:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:25] <wikibugs>	 (03CR) 10Volans: "thanks fo the feedback, replies/questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[11:08:53] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:12:54] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:13:04] <wikibugs>	 (03PS4) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996
[11:15:41] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973312 (https://phabricator.wikimedia.org/T345238)
[11:15:49] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973312 (https://phabricator.wikimedia.org/T345238) (owner: 10Kosta Harlan)
[11:16:06] <moritzm>	 !log imported tideways 5.0.4-2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia
[11:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:33] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973312 (https://phabricator.wikimedia.org/T345238) (owner: 10Kosta Harlan)
[11:16:35] <wikibugs>	 (03PS4) 10Jbond: puppet: add hiera_lookup function [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459
[11:16:45] <wikibugs>	 (03CR) 10Jbond: puppet: add hiera_lookup function (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[11:17:57] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[11:18:15] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[11:19:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM (needs rebase)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[11:21:28] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114)
[11:21:43] <wikibugs>	 (03CR) 10Muehlenhoff: [apt-staging] Add rsync endpoint for ci->apt pipeline (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:22:17] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[11:31:39] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740)
[11:33:58] <wikibugs>	 (03PS5) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996
[11:34:00] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:34:14] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney)
[11:36:09] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney)
[11:36:45] <wikibugs>	 (03Merged) 10jenkins-bot: Modify BGP_Customer_out to announce Wikimedia prefixes [homer/public] - 10https://gerrit.wikimedia.org/r/973314 (https://phabricator.wikimedia.org/T350740) (owner: 10Cathal Mooney)
[11:37:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one final nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:39:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:40:04] <wikibugs>	 (03PS6) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996
[11:40:11] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:40:46] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Add rsync endpoint for ci->apt pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:41:47] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reimage: reimage with current puppet version unless new [cookbooks] - 10https://gerrit.wikimedia.org/r/973315
[11:42:36] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add rsync endpoint for ci->apt pipeline [puppet] - 10https://gerrit.wikimedia.org/r/971996 (owner: 10EoghanGaffney)
[11:42:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[11:42:46] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) firing: Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[11:43:05] * Emperor here
[11:43:10] <jynus>	 same
[11:43:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:43:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis) Confirmed, both servers can see the full 256 GB of RAM. Thanks again @VRiley-WMF.
[11:43:32] <effie>	 ok 
[11:43:47] <XioNoX>	 already going down https://librenms.wikimedia.org/graphs/to=1699616400/id=19111/type=port_bits/from=1699530000/
[11:43:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2: install ram upgrades in an-master100[34] - https://phabricator.wikimedia.org/T349879 (10BTullis)
[11:43:59] <XioNoX>	 I'm going to lunch, but maybe hotlinking?
[11:44:03] <jynus>	 I see no increase in traffic, so it doesn't seem related to regular http traffic
[11:44:41] <effie>	 if it is going down, lets monitor
[11:44:50] <jynus>	 try to see if there is something on superset
[11:45:39] <Emperor>	 oh, did we p.age everyone because it's a US holiday?
[11:45:53] <jynus>	 both text and upload for esams look normal re: http requests
[11:46:06] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[11:46:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[11:46:35] <Emperor>	 I've ACKd it anyhow
[11:46:42] <wikibugs>	 (03CR) 10Jbond: "in relation to this function i was originally going to use it to lookup the puppet version of a host and to see if a host was classified i" [software/spicerack] - 10https://gerrit.wikimedia.org/r/972459 (owner: 10Jbond)
[11:47:22] <jynus>	 seems back to normal levels now
[11:47:30] <effie>	 Emperor: I resolved it 
[11:47:45] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[11:47:46] <jinxer-wm>	 (Primary inbound port utilisation over 80%  #page) resolved: Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[11:47:47] <effie>	 apparently I was not fast enough as I was looking at the hgraphs 
[11:50:06] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 75, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:50:40] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[11:50:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*...
[11:50:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[11:50:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[11:53:53] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:56:17] <moritzm>	 !log imported php-wmerrors  2.0.0~git20190628.183ef7d-3+wmf1+bullseye1  to component/php74 for bullseye-wikimedia
[11:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:26] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[11:56:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*...
[11:56:39] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[11:56:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[11:58:38] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[11:58:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*...
[11:59:07] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10AlexisJazz) @jcrespo thanks for letting me know. I misunderstood Bawolff's comment.  Well, I can partially answer one of your open questions. You won't really ne...
[11:59:13] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[11:59:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[12:02:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Set an-master1003/1004 to use to Puppet 7 via Hiera host entries [puppet] - 10https://gerrit.wikimedia.org/r/973317 (https://phabricator.wikimedia.org/T349619)
[12:03:58] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[12:04:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*...
[12:04:14] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[12:04:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[12:04:41] <wikibugs>	 (03PS2) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479)
[12:08:41] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1115.eqiad.wmnet with OS bullseye
[12:08:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye executed with errors: - cp1115 (**FAIL*...
[12:10:39] <wikibugs>	 (03CR) 10Muehlenhoff: sre.ganeti.*: customize lock arguments (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/967628 (https://phabricator.wikimedia.org/T341973) (owner: 10Volans)
[12:12:24] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1115.eqiad.wmnet with OS bullseye
[12:12:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye
[12:20:04] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10Bawolff) >>! In T191804#9319565, @jcrespo wrote: > Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that...
[12:21:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace - https://phabricator.wikimedia.org/T350918 (10hnowlan)
[12:22:36] <wikibugs>	 (03PS1) 10Btullis: Add a prometheus_instance parameter to prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232)
[12:22:38] <wikibugs>	 (03PS1) 10Btullis: Configure statsds_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[12:23:19] <wikibugs>	 (03PS2) 10Btullis: Configure statsd_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[12:23:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF for Grace - https://phabricator.wikimedia.org/T350918 (10hnowlan) The `WMF` group is an LDAP group rather than a shell group - is there another group that should be requested here? Tagging @Jdforrester-WMF for approval. Clarity on what the shell access is...
[12:25:38] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) Thank you, @AlexisJazz that's useful feedback that without doubt will make our media storage happy- still there are additional technical operations and...
[12:25:57] <moritzm>	 !log imported php-pcov   1.0.6-4+wmf1~bullseye1 to component/php74 for bullseye-wikimedia
[12:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:28:46] <wikibugs>	 (03PS4) 10Jbond: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[12:28:54] <wikibugs>	 (03PS15) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[12:29:08] <Amir1>	 I try to take a look
[12:29:31] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10hnowlan)
[12:31:36] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[12:32:32] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: Allow to store files between 4 and 5 GB - https://phabricator.wikimedia.org/T191804 (10jcrespo) Indeed, the same schema change for production has to be applied to backup metadata, as we mirrored the size from mediawiki as an unsigned int:  https://...
[12:32:38] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[12:33:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10hnowlan) OOB key verification will be done next week
[12:33:42] <Amir1>	 XioNoX: It's not from outside: https://w.wiki/877Q at least I'm not seeing any. I think it might be analytics again
[12:33:57] <wikibugs>	 (03PS16) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[12:34:02] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage
[12:35:01] <wikibugs>	 (03PS5) 10Jbond: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[12:35:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[12:36:07] <Amir1>	 I wait for wmf_flow_internal to catch up and than look at it
[12:36:08] <wikibugs>	 (03CR) 10Aqu: "Thx @Btullis for the pointer. I've switched the strategy from a variable to an `any`." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[12:37:01] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1115.eqiad.wmnet with reason: host reimage
[12:38:51] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:41:31] <wikibugs>	 10SRE, 10Traffic-Icebox, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10jbond) @Pppery AFAIK other then blocking empty agent headers on upload (T224891#7182766) no further progress has been made to addresses the comments i...
[12:43:58] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:45:33] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "I don't think that the production idp i the best place for this https://idp.wmcloud.org/ or idp-test would be better options.  I think we " [puppet] - 10https://gerrit.wikimedia.org/r/973287 (https://phabricator.wikimedia.org/T350725) (owner: 10Slyngshede)
[12:47:08] <wikibugs>	 (03CR) 10Ayounsi: Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney)
[12:48:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10hnowlan) 05Open→03Stalled
[12:49:36] <Amir1>	 scratch that, if it's esams it can't be analytics. I need more coffee
[12:54:51] <wikibugs>	 (03PS15) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308
[12:59:31] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1115.eqiad.wmnet with OS bullseye
[12:59:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by fabfur@cumin1001 for host cp1115.eqiad.wmnet with OS bullseye completed: - cp1115 (**PASS**)   - Remo...
[13:01:15] <wikibugs>	 (03CR) 10Jbond: "adding moritz and volans who i think cold provide good feedback" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[13:05:10] <moritzm>	 !log imported php-yaml  2.2.1+2.1.0+2.0.4+1.3.2-2+wmf1~bullseye1 to component/php74 for bullseye-wikimedia
[13:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:29] <wikibugs>	 (03CR) 10Btullis: "I like it." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[13:07:32] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy [puppet] - 10https://gerrit.wikimedia.org/r/973323
[13:09:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:12:13] <wikibugs>	 (03CR) 10Btullis: Generate the netboot.cfg file to avoid typos impacting everyone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[13:12:44] <wikibugs>	 (03PS1) 10Hnowlan: admin: add ecarg to ldap only users [puppet] - 10https://gerrit.wikimedia.org/r/973324 (https://phabricator.wikimedia.org/T350818)
[13:16:14] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[13:16:40] <wikibugs>	 (03CR) 10Cathal Mooney: "I don't think we need to add this policy on the switches actually.  The existing policy/group that the spine's have facing the CRs can be " [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:17:08] <wikibugs>	 (03PS16) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308
[13:18:17] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/389/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[13:18:19] <wikibugs>	 (03CR) 10Tchanders: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[13:19:02] <wikibugs>	 (03CR) 10Cathal Mooney: "Looking more closely I was going to say the lack of "from protocol evpn" would be an issue, but as you send a default that doesn't _really" [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:19:04] <wikibugs>	 (03PS2) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114)
[13:19:21] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[13:19:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:21:08] <wikibugs>	 (03PS3) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479)
[13:21:38] <wikibugs>	 (03CR) 10Brouberol: "Here is a diff between the current and generated netboot.cfg files https://phabricator.wikimedia.org/P53293" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[13:23:00] <moritzm>	 !log imported php-geoip 1.1.1-7+wmf2+bullseye1 to component/php74 for bullseye-wikimedia
[13:23:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:06] <wikibugs>	 (03Abandoned) 10Ayounsi: Add support for non EVPN switches on spines [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:24:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[13:25:04] <wikibugs>	 (03CR) 10Cathal Mooney: Fail when setting int relations if PuppetDB parent not found in Netbox (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/973185 (https://phabricator.wikimedia.org/T350479) (owner: 10Cathal Mooney)
[13:25:18] <wikibugs>	 (03PS1) 10Ayounsi: Add BGP between spines and SONiC L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028)
[13:26:39] <wikibugs>	 (03CR) 10Ayounsi: "Example diff on two spines: https://phabricator.wikimedia.org/P53294" [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:26:48] <wikibugs>	 (03PS4) 10Cathal Mooney: Adjust homer templates to support anycast gw with single IP [homer/public] - 10https://gerrit.wikimedia.org/r/973267 (https://phabricator.wikimedia.org/T350579)
[13:27:51] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[13:29:51] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:29:52] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! Longer term we can think about whether to split these, or otherwise change the template/group name to say "sw_external" or something" [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:30:30] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add BGP between spines and SONiC L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:30:45] <wikibugs>	 (03CR) 10Ayounsi: "I14bb16c8f9d8661953f5cde5a6e18df802b4d957" [homer/public] - 10https://gerrit.wikimedia.org/r/972702 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:31:08] <wikibugs>	 (03Merged) 10jenkins-bot: Add BGP between spines and SONiC L3 switches [homer/public] - 10https://gerrit.wikimedia.org/r/973325 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[13:39:51] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:41:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[13:44:44] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[13:45:00] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[13:45:07] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[13:45:36] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[13:45:42] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[13:46:04] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[13:47:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[13:49:16] <wikibugs>	 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10lbowmaker)
[13:49:29] <wikibugs>	 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10lbowmaker)
[13:52:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[14:01:31] <wikibugs>	 (03PS1) 10Ayounsi: Add sretest1004 [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028)
[14:02:59] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] Add sretest1004 [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[14:03:01] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[14:03:16] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[14:03:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add sretest1004 [puppet] - 10https://gerrit.wikimedia.org/r/973349 (https://phabricator.wikimedia.org/T335028) (owner: 10Ayounsi)
[14:03:26] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10taavi) I think this is fixed now, right?
[14:07:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:11:42] <denisse>	 !log upgradeing LibreNMS to 23.10
[14:11:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:38] <logmsgbot>	 !log denisse@deploy2002 Started deploy [librenms/librenms@f049593]: Upgrade LibreNMS to v23.10.0 - T349492
[14:15:48] <logmsgbot>	 !log denisse@deploy2002 Finished deploy [librenms/librenms@f049593]: Upgrade LibreNMS to v23.10.0 - T349492 (duration: 00m 10s)
[14:17:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:19:14] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: librenms-alerts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:19:21] <wikibugs>	 (03PS17) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308
[14:21:07] <wikibugs>	 (03PS10) 10Hashar: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981
[14:21:34] <wikibugs>	 (03CR) 10Brouberol: "as preseed.cfg is a symlink to netboot.cfg, I removed the committed symlink and made the link explicit, via a `file` resource." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[14:21:48] <icinga-wm>	 RECOVERY - Check systemd state on an-airflow1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:57] <wikibugs>	 10SRE, 10Data-Engineering, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10lbowmaker)
[14:26:16] <wikibugs>	 (03PS3) 10Btullis: Configure statsd_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[14:26:22] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:59] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur)
[14:27:21] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[14:29:57] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10lbowmaker)
[14:30:12] <wikibugs>	 (03PS3) 10Fabfur: haproxy: re-set varnish maxconn for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/972354 (https://phabricator.wikimedia.org/T310609)
[14:30:59] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088 (10lbowmaker)
[14:31:29] <wikibugs>	 10SRE, 10Data-Engineering, 10Observability-Logging, 10Wikimedia-Logstash, 10Event-Platform: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10lbowmaker)
[14:31:52] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10lbowmaker)
[14:32:14] <wikibugs>	 10SRE-OnFire, 10Data-Engineering, 10serviceops, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10lbowmaker)
[14:32:22] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Engineering, 10Data-Platform-SRE, and 3 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10lbowmaker)
[14:36:32] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[14:38:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:47] <wikibugs>	 (03PS4) 10Btullis: Configure statsd_exporter scraping for the analytics prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232)
[14:40:09] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[14:40:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/973309 (owner: 10Volans)
[14:46:24] <wikibugs>	 (03PS1) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230)
[14:46:56] <wikibugs>	 (03CR) 10Marostegui: "Why is this needed? The ALTER grant is already enough to create indexes:" [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[14:47:41] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy102[2,4]: Promote db1119 to standby" [puppet] - 10https://gerrit.wikimedia.org/r/973331
[14:48:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi)
[14:48:59] <wikibugs>	 (03PS2) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230)
[14:49:11] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/973331 (owner: 10Marostegui)
[14:49:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy102[2,4]: Promote db1119 to standby" [puppet] - 10https://gerrit.wikimedia.org/r/973331 (owner: 10Marostegui)
[14:53:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi)
[14:53:53] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:23] <wikibugs>	 (03PS3) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230)
[14:57:23] <wikibugs>	 10SRE, 10Data Pipelines, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10lbowmaker)
[14:58:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi)
[14:58:29] <wikibugs>	 (03Abandoned) 10DCausse: Replace Swift native API with S3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/766123 (https://phabricator.wikimedia.org/T302494) (owner: 10ZPapierski)
[14:59:00] <wikibugs>	 (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi)
[15:01:05] <wikibugs>	 (03CR) 10Krinkle: Add "db-mainstash" entry to $wgObjectCaches (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) (owner: 10Aaron Schulz)
[15:01:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi)
[15:02:58] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1119 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022)
[15:03:10] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not until tuesday" [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui)
[15:05:33] <wikibugs>	 (03PS4) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230)
[15:07:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230) (owner: 10Ayounsi)
[15:09:50] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352
[15:11:23] <wikibugs>	 (03PS5) 10Ayounsi: Add eqiad E/F 5-8 subnets to netboot and network/data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/973350 (https://phabricator.wikimedia.org/T334230)
[15:13:53] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney)
[15:18:31] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Add dummy key for apt-staging host [labs/private] - 10https://gerrit.wikimedia.org/r/973352 (owner: 10EoghanGaffney)
[15:22:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:23:11] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1062.eqiad.wmnet with OS bookworm
[15:23:19] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm
[15:30:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:30:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:31:00] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353
[15:31:12] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+2 C: 03+2] [apt-staging] Rename apt-staging.d.w cert to fix error [labs/private] - 10https://gerrit.wikimedia.org/r/973353 (owner: 10EoghanGaffney)
[15:31:14] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:31:16] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:32:01] <wikibugs>	 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10MatthewVernon) Technically, this is easy - we can make a swift account and away you go.  I don't want to tie anyone up in red tape, but I think it'd be good to have a lightweight process to ensure this...
[15:34:46] <wikibugs>	 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10jcrespo) +1
[15:36:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage
[15:38:56] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1062.eqiad.wmnet with reason: host reimage
[15:39:02] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[15:39:07] <wikibugs>	 (03Abandoned) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[15:40:54] <wikibugs>	 (03CR) 10Marostegui: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[15:41:07] <wikibugs>	 (03Restored) 10Marostegui: ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[15:41:41] <wikibugs>	 (03CR) 10Kosta Harlan: ipoid: Grant INDEX to ipoid_rw user for ipoid DB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[15:42:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[15:42:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ipoid: Grant INDEX to ipoid_rw user for ipoid DB [puppet] - 10https://gerrit.wikimedia.org/r/973313 (https://phabricator.wikimedia.org/T305114) (owner: 10Kosta Harlan)
[15:45:49] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "nice, much cleaner!" [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah)
[15:47:41] <wikibugs>	 (03PS2) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and crt [puppet] - 10https://gerrit.wikimedia.org/r/973323
[15:48:15] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] dynamicproxy: simplify redis replication code [puppet] - 10https://gerrit.wikimedia.org/r/969693 (owner: 10Majavah)
[15:49:51] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) I think this has been edited to indicate a different problem, that still exists:  >  the systems are using different source addresses...
[15:51:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:53:53] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:54:33] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) Or maybe it's easier to create a new task and resolve this one :)
[15:54:34] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1062.eqiad.wmnet with OS bookworm
[15:54:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1062.eqiad.wmnet with OS bookworm completed: - cloud...
[15:56:00] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "LGTM -1 is just for the symlink" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[15:56:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:57:24] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Generate the netboot.cfg file to avoid typos impacting everyone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (owner: 10Brouberol)
[16:03:31] <wikibugs>	 (03CR) 10Btullis: "I fluffed up my gerrit patch splitting, so there is a new two-part configuratino change here:" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[16:04:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Put new cloudvirts (cloudvirt1062-1067) online [puppet] - 10https://gerrit.wikimedia.org/r/973354 (https://phabricator.wikimedia.org/T342537)
[16:04:09] <wikibugs>	 (03CR) 10Btullis: "Updated patch here:" [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[16:04:37] <wikibugs>	 (03CR) 10Jbond: "lgtm a few nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney)
[16:05:00] <wikibugs>	 (03Abandoned) 10Btullis: Enable support for statsd_exporters on non-ops instances [puppet] - 10https://gerrit.wikimedia.org/r/969143 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[16:05:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Put new cloudvirts (cloudvirt1062-1067) online [puppet] - 10https://gerrit.wikimedia.org/r/973354 (https://phabricator.wikimedia.org/T342537) (owner: 10Andrew Bogott)
[16:06:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm
[16:06:19] <wikibugs>	 (03CR) 10Jbond: Ensure that build directories are cleaned up (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/973135 (https://phabricator.wikimedia.org/T348974) (owner: 10Slyngshede)
[16:06:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS b...
[16:10:12] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm
[16:10:14] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bookworm
[16:10:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS b...
[16:10:30] <wikibugs>	 (03CR) 10Cathal Mooney: Change 'anycast_gw' var in int config to represent type of IRB needed (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney)
[16:10:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS b...
[16:11:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bookworm
[16:11:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bookworm
[16:12:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS b...
[16:12:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS b...
[16:15:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:16:12] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Change 'anycast_gw' var in int config to represent type of IRB needed [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/971937 (https://phabricator.wikimedia.org/T350579) (owner: 10Cathal Mooney)
[16:18:17] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[16:20:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:20:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[16:22:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage
[16:22:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[16:23:40] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage
[16:24:03] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage
[16:24:55] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[16:25:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:25:17] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage
[16:27:44] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage
[16:29:44] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:29:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage
[16:30:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[16:31:39] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[16:31:57] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:33:52] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add cloud-private subnet entries for new cloudvirt hosts - cmooney@cumin1001"
[16:34:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add cloud-private subnet entries for new cloudvirt hosts - cmooney@cumin1001"
[16:34:43] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:36:12] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.wipe-cache cloudvirt1062.private.eqiad.wikimedia.cloud on all recursors
[16:36:16] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudvirt1062.private.eqiad.wikimedia.cloud on all recursors
[16:36:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1119 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui)
[16:37:05] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Revert "bacula: Add cloudservices2004-dev (openldap) to the monitoring ignoring" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/962207 (https://phabricator.wikimedia.org/T339894) (owner: 10FNegri)
[16:38:16] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1067.eqiad.wmnet with OS bookworm
[16:38:19] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1063.eqiad.wmnet with OS bookworm
[16:38:23] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1065.eqiad.wmnet with OS bookworm
[16:38:27] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1066.eqiad.wmnet with OS bookworm
[16:38:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1067.eqiad.wmnet with OS bookworm
[16:38:55] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm
[16:38:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm
[16:39:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm
[16:39:04] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1065.eqiad.wmnet with OS bookworm
[16:39:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm
[16:39:17] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1066.eqiad.wmnet with OS bookworm
[16:39:26] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm
[16:39:30] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1064.eqiad.wmnet with OS bookworm
[16:39:33] <wikibugs>	 (03CR) 10Jcrespo: "FYI: arnaud. This is something that should be deployed, but needs nursing, as the problem is not puppet, but the changes that would be nee" [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo)
[16:39:39] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm
[16:39:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm
[16:43:23] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: add params to config, rework citoid path matching [deployment-charts] - 10https://gerrit.wikimedia.org/r/973362 (https://phabricator.wikimedia.org/T329049)
[16:47:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:49:48] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001
[16:51:26] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Release v0.6.4 - cmooney@cumin1001
[16:52:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:53:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage
[16:53:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[16:54:05] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage
[16:54:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage
[16:54:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[16:55:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:56:11] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1065.eqiad.wmnet with reason: host reimage
[16:58:16] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10fnegri)
[16:58:58] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[16:58:58] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1067.eqiad.wmnet with reason: host reimage
[16:59:02] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri) 05Open→03Resolved a:03fnegri I have created {T350995} for the problem that still exists, and I'm marking this task as resolved.
[16:59:16] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[17:00:15] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10fnegri)
[17:01:30] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1066.eqiad.wmnet with reason: host reimage
[17:03:42] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:11:37] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 Andrew Bogott reimage in progress https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:12:04] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:14:32] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:16:14] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:24:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:25:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:33:16] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1064.eqiad.wmnet with OS bookworm
[17:33:18] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1063.eqiad.wmnet with OS bookworm
[17:33:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm executed with erro...
[17:33:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm executed with erro...
[17:33:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm
[17:33:43] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1063.eqiad.wmnet with OS bookworm
[17:33:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm
[17:33:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm
[17:47:26] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[17:48:28] <topranks>	 !log withdrawing IPv6 prefixes announced to AS1299 in esams to troubleshoot connectivity problem report
[17:48:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:48:45] <wikibugs>	 (03PS1) 10Jcrespo: sql: Migrate mediabackups metadata size from int to bigint [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804)
[17:49:47] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[17:50:02] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1063.eqiad.wmnet with reason: host reimage
[17:50:55] <wikibugs>	 (03CR) 10Jcrespo: "This should be the only code-ish related change needed- as in memory we use python integer values, which are unbounded AFAIK." [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804) (owner: 10Jcrespo)
[17:51:27] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1065.eqiad.wmnet with OS bookworm
[17:51:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1065.eqiad.wmnet with OS bookworm completed: - cloud...
[17:52:42] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1067.eqiad.wmnet with OS bookworm
[17:52:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1067.eqiad.wmnet with OS bookworm completed: - cloud...
[17:53:04] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[17:54:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1066.eqiad.wmnet with OS bookworm
[17:54:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1066.eqiad.wmnet with OS bookworm completed: - cloud...
[18:03:21] <jinxer-wm>	 (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:04:05] <bvibber>	 !log brion adding more vp9 backfill to the transcode runs on mwmaint2002 (requeueTranscodes -> job queue runners). Should increase load on transcode scaler job runners but not elsewhere
[18:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:08:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[18:13:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[18:24:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:29:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:38:16] <wikibugs>	 (03CR) 10Dzahn: admin: add urbanecm to stewards-users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[18:38:21] <jinxer-wm>	 (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:41:04] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:47:28] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1063.eqiad.wmnet with OS bookworm
[18:47:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1063.eqiad.wmnet with OS bookworm completed: - cloud...
[18:53:53] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:57:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:12:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[19:20:54] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:22:38] <wikibugs>	 (03PS1) 10Zoranzoki21: throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002)
[19:44:52] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:53:16] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[19:53:53] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:04:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1064.eqiad.wmnet with OS bookworm
[20:04:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm
[20:22:33] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[20:25:18] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1064.eqiad.wmnet with reason: host reimage
[20:28:20] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1062 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[20:36:11] <wikibugs>	 (03PS17) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[20:36:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[20:39:00] <wikibugs>	 (03CR) 10Brian Wolff: [C: 03+1] "next week is fine. Honestly, it will probably be a little while before stuff actually happens in production, definitely more than a week." [software/mediabackups] - 10https://gerrit.wikimedia.org/r/973364 (https://phabricator.wikimedia.org/T191804) (owner: 10Jcrespo)
[20:40:03] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[20:45:18] <wikibugs>	 (03PS18) 10Aqu: Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[20:51:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS bookworm
[20:51:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud...
[20:56:58] <wikibugs>	 (03CR) 10Aqu: "I've added in this patch the mappings to customize the metrics for Prometheus." [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[21:00:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1064.eqiad.wmnet with OS bookworm
[21:00:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[62-67] - https://phabricator.wikimedia.org/T342537 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1064.eqiad.wmnet with OS bookworm completed: - cloud...
[21:16:14] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on idp1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:19:08] <icinga-wm>	 PROBLEM - Host mr1-esams is DOWN: PING CRITICAL - Packet loss = 90%, RTA = 6686.60 ms
[21:19:20] <icinga-wm>	 RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 82.08 ms
[21:25:14] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on apt-staging2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:42:06] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1064 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[22:14:30] <wikibugs>	 (03PS3) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and crt [puppet] - 10https://gerrit.wikimedia.org/r/973323
[22:15:00] <wikibugs>	 (03CR) 10EoghanGaffney: [apt-staging] Fix apt_staging.yaml, add envoy config and crt (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973323 (owner: 10EoghanGaffney)
[22:31:20] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:53:22] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:53:52] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:53:53] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:54:38] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:55:08] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.306 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:01:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[23:31:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[23:34:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1061.eqiad.wmnet with OS bookworm
[23:46:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1023:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[23:48:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage
[23:51:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1061.eqiad.wmnet with reason: host reimage
[23:53:53] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure