[00:01:42] (SystemdUnitFailed) firing: dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:42] (SystemdUnitFailed) firing: (2) dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:31:42] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:42] (SystemdUnitFailed) firing: (3) httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:11:42] (SystemdUnitFailed) firing: (3) ifup@ens13.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:51:42] (SystemdUnitFailed) firing: (3) ferm.service Failed on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:28] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to latest 3.2 - https://phabricator.wikimedia.org/T314933 (10ops-monitoring-bot) Deployed netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Release v3.2.9-wmf2 to production - volans@cumin1001 - T314933 [08:11:42] (SystemdUnitFailed) firing: (4) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:16:56] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to latest 3.2 - https://phabricator.wikimedia.org/T314933 (10ops-monitoring-bot) Deployed netbox to netbox2002.codfw.wmnet,netbox1002.eqiad.wmnet with reason: Release v3.2.9-wmf2 to production - volans@cumin1001 - T314933 [08:28:26] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to latest 3.2 - https://phabricator.wikimedia.org/T314933 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans Production is now upgraded to 3.2.9, the latest of the 3.2 series. Resolving. [09:16:43] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:13] (DiskSpace) firing: Disk space netbox1002:9100:/ 5.531% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=netbox1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:51:18] 10SRE-tools, 10netbox, 10DNS, 10Infrastructure-Foundations: Enforce Netbox domain names without period termination - https://phabricator.wikimedia.org/T306809 (10ayounsi) 05Open→03Resolved a:03ayounsi Done using Netbox validators. [09:51:47] 10netbox, 10Infrastructure-Foundations: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) Validators are now live in prod! [09:56:05] 10netbox, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox: Make Netbox Active/Active - https://phabricator.wikimedia.org/T234997 (10ayounsi) [09:56:36] 10netbox, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review: Netbox and Redis - https://phabricator.wikimedia.org/T311385 (10ayounsi) 05Open→03Resolved a:03ayounsi This has been completed some time ago. [10:25:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=12105eb2-e5ac-4f19-9896-9ba53e1acd48) set by cmooney@cumin1001 f... [10:37:09] jbond, XioNoXL the sre.netbox.update-extras cookbook might be a bit overkill in downtiming and validating icinga status (that is never clear due to failing reports), thoughts? [10:40:39] volans: id agree seems a bit overkill [10:41:03] we don't do it when releasing netbox :D (but there I think we might want to) [10:41:08] *to add it [10:42:29] lol ack [10:44:44] I see you did it with teh batch classes [10:44:54] now I get it why all those failsafes :D [10:47:03] ahh yes i think that may have been me (miss)directing things to the batch bas class [10:49:00] it's an interesting use case :) [10:50:50] and not wrong per-se, I was actually thinging if it would make sense to rewrite the deploy code one that way [10:54:32] yes i think with the dynamic actions supporing additional use cases is easier but we may need to adapt the bas class so implmentors can opt out of certiain things e.g. icinga checks [10:54:51] or we just accept that some cookbooks will preform more checks then needed \o/ [10:55:42] yeah, debating with myself about that :D [11:41:34] Hello! I have a VM in ganeti that was shut down manually last week (but not decommissioned). I want to decommission it now -- is it best for me to turn it back on and then run the cookbook, or should I decom it manually instead? [11:43:33] eoghan: interesting use case, I'd say let's try to run the decom cookbook on a shutdown VM and see what happens, it might also just work [11:44:14] in case it doesn't we can re-iterate, the decom is idempotent can be run multiple times [11:44:17] safely [11:44:27] Ah great. Wasn't sure of that. [11:44:29] I'll kick it off now [11:44:49] * volans curious of the outcome [11:45:36] like I'm not sure if the shutdown fails or is happy it's already shutdonw [11:46:24] it might spit out: Failed to shutdown VM, manually run gnt-instance remove on... [11:47:00] but shouldn't prevent it to continue I think [11:47:46] eoghan: lol I just found a typo in the logged message, thanks [11:49:38] I'll fix it after discovering the behaviour with your run [12:09:39] Hm, one of the diffs popped up is `diff --git a/hosts/ldap-rw1001.yaml b/hosts/ldap-rw1001.yaml`. Is it possible something else just got switched on while this was in progress? [12:15:01] volans: That's all done. [12:15:30] eoghan: for ldap moritzm was reimaging it AFAIK [12:15:32] and from SAL [12:16:00] eoghan: nice, any error? [12:16:04] Nope! [12:16:07] nice [12:29:35] eoghan: yeah, I'm installing the new slapd hosts currently, all benign [12:29:50] Cool cool, I just got surprised because it wasn't my machine :D [13:15:49] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) [13:16:43] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:13] (DiskSpace) firing: Disk space netbox1002:9100:/ 5.399% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=netbox1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:18:25] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) [13:19:20] * volans looking at netbox1002 disk space alert [13:20:12] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Netbox: use Custom Model Validation - https://phabricator.wikimedia.org/T310590 (10ayounsi) Regarding the `port/interfaces names`, unlike interface names, the power and console ports name don't change often and haven't alerted regularly, so I sugge... [13:22:13] (DiskSpace) resolved: Disk space netbox1002:9100:/ 5.398% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=netbox1002 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [13:22:33] cleared old scap deploy cache [13:27:48] is there specific documentation on the switch upgrade maintenances in wikitech? NBD if not, I've just been asked to document our steps and was going to add it to the page if one exists (haven't found one so far) [13:38:38] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) [14:26:20] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) 05Open→03Resolved [14:43:58] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: git-sync-upstream failing - https://phabricator.wikimedia.org/T336263 (10Andrew) 05Resolved→03Open That patch gets us the much-more-helpful ` fatal: error: cannot combine '--rebase-merges' with '--strategy-option' ` [15:18:41] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [15:19:05] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [15:21:10] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [15:21:50] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [15:27:33] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [16:02:15] inflatador: https://wikitech.wikimedia.org/wiki/Service_restarts is the best thing i can think of (its more for reboots but talks about taking things out of service) [16:06:55] jbond ACK, thanks [17:16:43] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:32] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [18:29:14] 10netops, 10Infrastructure-Foundations, 10SRE: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) 05Open→03Resolved Applied to all devices now. [18:29:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) [19:29:22] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:31:36] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [19:33:01] 10netops, 10DBA, 10Data-Platform-SRE, 10Infrastructure-Foundations, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10bking) [21:16:43] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed