[02:00:02] (NodeTextfileStale) firing: Stale textfile for puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:00:43] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:00:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:14:04] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:14:09] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:00:43] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:54:35] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:33:35] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [10:00:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:14:05] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:15:06] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:35:22] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [10:43:25] 10netops, 10Infrastructure-Foundations, 10SRE: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) 05Open→03Resolved Ports freed up in T347403 [11:00:43] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:10:19] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:11:32] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [12:52:16] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) @Volans All of these (which can be grouped in 2 just 2 categores, **mw** and **mc**, have be... [12:56:22] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) @akosiaris sure, and having a cluster deemed as *not* IPv6 ready is totally ok. The problem arise... [13:57:04] FYI, I'm installing postgres security updates on netbox db hosts [14:00:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:19:44] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) >>! In T271142#9378778, @Volans wrote: > @akosiaris sure, and having a cluster deemed as *not*... [14:21:20] moritzm: ack [14:22:08] all done already :-) [14:33:27] :D [15:00:44] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [16:09:24] 10netbox, 10Infrastructure-Foundations: https://netbox-exports.wikimedia.org/dns.git is very slow to clone (using dumb HTTP) - https://phabricator.wikimedia.org/T276403 (10joanna_borun) p:05Triage→03Low [16:15:05] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Make Spicerack cookbook to resize ganeti VM - https://phabricator.wikimedia.org/T219454 (10MoritzMuehlenhoff) 05Open→03Declined This is a rara operation and basically only requires to run a straight-forward CLI command (followed by running sre.ganeti.r... [16:15:07] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [16:17:27] 10SRE-tools, 10homer, 10Infrastructure-Foundations: Add Homer support to Cookbooks - https://phabricator.wikimedia.org/T265342 (10ayounsi) 05Open→03Invalid Hello past me, not needed anymore. [16:17:52] slyngs: https://phabricator.wikimedia.org/T311052 [16:18:08] 10Mail, 10Infrastructure-Foundations: Investigate problems with Wikimedia emails sent to op.pl domains - https://phabricator.wikimedia.org/T256199 (10MoritzMuehlenhoff) Sorry for the late followup, is this still an issue? [16:19:45] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: Tracking task for DCOps privileged commands - https://phabricator.wikimedia.org/T233685 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was handled in various other tasks. [16:20:19] 10netbox, 10Infrastructure-Foundations: Graph reports status in Prometheus - https://phabricator.wikimedia.org/T262898 (10ayounsi) See also T311052 [16:20:29] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10joanna_borun) 05Open→03Invalid [16:22:29] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Python2: track Py2 softwares - https://phabricator.wikimedia.org/T197803 (10MoritzMuehlenhoff) 05Open→03Declined Bookworm no longer includes Python 2 at all and in Bullseye Python gets uninstalled unless one sets an explicit Hiera flag to keep... [16:22:53] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10MoritzMuehlenhoff) Bookworm no longer includes Python 2 at all and in Bullseye Python gets uninstalled unless one sets an explicit Hiera flag to keep it (pybal e.g.), w... [16:23:25] 10Mail, 10Infrastructure-Foundations, 10MediaWiki-extensions-BounceHandler: Valid email address was unconfirmed for temporary spam blacklisting - https://phabricator.wikimedia.org/T99444 (10joanna_borun) 05Open→03Invalid [16:24:55] 10Mail, 10Infrastructure-Foundations: Blacklist root@tools.wmflabs.org for out-of-office notifications from wikimedia.org - https://phabricator.wikimedia.org/T151153 (10joanna_borun) 05Open→03Declined [16:30:35] 10netbox, 10Infrastructure-Foundations: Netbox: Add rack/U and asset tag fields to AssignIP script - https://phabricator.wikimedia.org/T267219 (10ayounsi) 05Open→03Declined Not needed anymore. [16:34:07] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Volans) p:05Triage→03Low @Marostegui is this request still valid/needed? If we are going to add this steps I would need... [16:35:20] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) It seems that the current defaults are generally working fine. @fgiunchedi have you encounter any specific issue in the last ~2y that still requ... [16:35:30] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) p:05Triage→03Low [16:38:32] 10netbox, 10Infrastructure-Foundations: https://netbox-exports.wikimedia.org/dns.git is very slow to clone (using dumb HTTP) - https://phabricator.wikimedia.org/T276403 (10Volans) Another option could be to: * Migrate the private copy on the netbox hosts to the git server * Do a shallow clone on the dns hosts... [18:00:19] (NodeTextfileStale) firing: Stale textfile for puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:00:50] jhathaway: expected? ^^^ [18:01:23] volans: no, let me take a look [18:01:51] thx [18:06:08] volans, this is leftover from a patch that was reverted, so I am going to remove the file [18:06:36] ack, thanks [18:10:02] (NodeTextfileStale) resolved: Stale textfile for puppetserver1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:00:44] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [20:05:24] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [20:19:22] (MDRAIDNotEnoughDisks) firing: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks [20:39:22] (MDRAIDNotEnoughDisks) resolved: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks