[00:38:27] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [04:38:27] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:38:27] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:26:20] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: ServiceLVS without monitor breaks spicerack - https://phabricator.wikimedia.org/T339243 (10Volans) We had only a couple of changes in the service.yaml schema in the last months and both were sent to Spicerack before hitting product... [09:30:19] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: adapt conftool module for etcd v3 - https://phabricator.wikimedia.org/T352153 (10Volans) [09:31:47] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Spicerack: migrate distributed locking to etcd v3 - https://phabricator.wikimedia.org/T352155 (10Volans) [09:41:18] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: [spicerack] Add remote command output to log file - https://phabricator.wikimedia.org/T347093 (10Volans) 05Open→03Declined As there is already a workaround to do that in the cookbooks on demand and it will be even simpler wi... [09:53:40] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) p:05Triage→03Low As the main blocker was resolved giving more permissions to the bot in T314917, setting the priority lower fo... [09:54:13] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10JMeybohm) Sorry, I must have missed the message. Yes, IIRC that is the correct interpretation. [10:04:38] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [10:05:07] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Allow to dry_run RemoteHosts.wait_reboot_since() and PuppetHosts.wait_since() - https://phabricator.wikimedia.org/T311050 (10Volans) p:05Triage→03Medium Perfect, thanks for the update. [10:11:07] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10Volans) As all the above cookbooks are already listed in T317855 I'm resolving this as duplicate. [10:11:29] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Investigate converting LBRemoteCluster cookbooks to SRELBBatchRunnerBase - https://phabricator.wikimedia.org/T318787 (10Volans) [10:11:33] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Migrate existing cookbooks related to rolling restarts/reboots to SREBatchBase - https://phabricator.wikimedia.org/T317855 (10Volans) [11:18:21] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10Volans) I had a quick thought about the ENC++ problem as you have named it and I think in the end given a netbox device object (hostname + location + eventually other da... [11:33:04] 10netops, 10Infrastructure-Foundations: cr2-esams Transit Tele2 down - https://phabricator.wikimedia.org/T352163 (10Volans) [11:33:24] 10netops, 10Infrastructure-Foundations: cr2-esams Transit Tele2 down - https://phabricator.wikimedia.org/T352163 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox interface ID cr2-esams:xe-0/1/2 --- **Interface cr2-esams:xe-0/1/2** - admin-status: up - ⚠️ oper-status: down - interface-flapped: 20... [11:33:38] XioNoX, topranks: FYI I've opened T352163 [11:33:38] T352163: cr2-esams Transit Tele2 down - https://phabricator.wikimedia.org/T352163 [11:33:51] 10netops, 10Infrastructure-Foundations: cr2-esams Transit Tele2 down - https://phabricator.wikimedia.org/T352163 (10cmooney) Thanks @volans, I'll have a look and reach out to them. [11:34:23] volans: thanks! [11:34:26] looking at it now [11:34:35] oh you've already replied on task, tht was quick :D [11:34:54] I'll triage it as high [11:34:56] 10netops, 10Infrastructure-Foundations: cr2-esams Transit Tele2 down - https://phabricator.wikimedia.org/T352163 (10Volans) p:05Triage→03High [11:35:02] as AFAIUI it's part of the list of those we care [11:35:30] yep, it's not super-important to us but all dedicated transits we care about definitely [11:36:02] I was following the wikitech runbook :D [11:38:22] thanks for looking [11:47:09] topranks: what did you do? we got a recovery in icinga [11:47:47] volans: eh nothing at all, I'm actually trying to fix an issue with my laptop that's stopping me getting on to the CR ! [11:47:55] ahahahah [11:48:47] will take a look now in a moment [11:56:42] 10netops, 10Infrastructure-Foundations: cr2-esams Transit Tele2 down - https://phabricator.wikimedia.org/T352163 (10cmooney) 05Open→03Resolved a:03cmooney Port seems to have come back up while I was trying to sort out a v6 issue on my laptop: ` Nov 28 11:22:52 cr2-esams mib2d[18462]: SNMP_TRAP_LINK_DOWN... [12:09:57] (SystemdUnitFailed) resolved: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:13:59] (SystemdUnitFailed) firing: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:28:59] (SystemdUnitFailed) resolved: netbox_ganeti_codfw_test_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:31:16] 10SRE-tools, 10Infrastructure-Foundations: Abstract a bit more the server provisioning process - https://phabricator.wikimedia.org/T351891 (10ayounsi) In my mind, trying to be too automatic or too smart here will only cause edge cases issues and complexity to troubleshot. And a too big project to implement. A... [13:44:41] is anyone planning to do fleet-wide deploys with debdeploy today? I have a new minor version of wmflib that I will like to push everywhere whenever is a good time (no hurry at all) (cc moritzm, slyngs) [13:46:43] No sorry, I can build a fake release of something :-) [13:51:25] lol [13:51:52] wmf_dummy_1.0.1101~10.deb :-) [13:57:19] volans: I can update it now [13:57:36] moritzm: ??? [13:57:58] I meant if I can deploy, checking if you were already deploying other things to avoid conflicts [13:59:30] ah, ok, that wasn't clear to me, I thought you wanted me to deploy it alongside some other update, go ahead then [13:59:45] the opposite :D [13:59:50] sorry for the confusion [14:04:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:23:19] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) The Ganeti version we currently run lacks support for chained certs in rapid. This was implemented in https://github.com/ganeti/ganeti/pull/1625 which I have b... [15:36:39] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [15:53:54] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [18:04:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [22:04:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk