[06:30:40] volans: should we start defining cookbooks owners in the header description or a dedicated variable? [07:24:43] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:48] 10netbox, 10Infrastructure-Foundations, 10Puppet-Core: Make netbox the source of truth for cloudceph networks - https://phabricator.wikimedia.org/T338329 (10ayounsi) Longer timeframe but {T325531} could potentially simplify the setup even more. If I understand correctly this is not about driving the server'... [07:57:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) @JAllemandou, I was wondering if you had a time-frame for this or need anything from SRE. [08:02:03] 10netops, 10Infrastructure-Foundations, 10SRE: Peering: prefer primary IXP for direcly connected networks - https://phabricator.wikimedia.org/T338201 (10cmooney) Thanks for this one. While not ideal I think probably option 2 / adding DIRECT_PEER_PRIMARY is gonna be best. Is getting a little complex, but at... [08:45:34] XioNoX: it might be overkill, most cookbooks are inside an sre.$topic namespace that has a clear team ownership [09:40:30] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin2002 for host puppetserver2001.codfw.wmnet with OS bookworm completed: - puppetserver2001... [09:45:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) Thank you for reminding me, I had forgotten about this task @ayounsi. We can prioritize the work, the details we'll need are: - precise names... [10:04:43] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:04:43] (SystemdUnitFailed) firing: (4) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:25:06] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) I just had a conversation with @cmooney about this, with result being: * we will move the setup to 1x BGP VIP... [11:28:32] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10cmooney) @aborrero that makes sense. For the auth dns service we need to patch the dns repo to update the IPs for 'ns0'... [11:39:05] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) Created https://netbox.wikimedia.org/ipam/ip-addresses/13309/ [12:09:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:18] if anyone has feedback for the test-cookbook script here's some improvement [12:20:21] https://gerrit.wikimedia.org/r/c/operations/puppet/+/927803 [12:40:51] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) 05Open→03Resolved All links have now been migrated. Massive thanks to @Jclark-ctr for all the work on site! [12:41:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) All links have now been successfully migrated. All row E/F connectivity is now flowing via Spine switches ssw1-e1-eqiad... [12:42:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) 05Open→03Resolved [12:45:32] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) @Papaul I've merged that changed and pushed to the mr routers, so hopefully if you try again the ZTP cookbook will work. [13:04:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:54:05] XioNoX, volans: I am getting an issue running the netbox dns cookbook [13:54:09] https://phabricator.wikimedia.org/P49275 [13:54:31] I'm in the middle of deleting some IPs [13:54:38] so maybe it's because of that [13:54:45] but it shouldn't fail like that anyway [13:54:56] yep it may be related [13:55:38] it could be another error. this is for a new include file so it's one of those slightly messy to deploy ones [13:56:41] XioNoX: let me know when you're done and I'll retry anyway [13:56:54] btw on my previous run it removed a bunch of stuff related to your change - https://phabricator.wikimedia.org/P49275#199306 [13:57:24] ok [13:58:12] er, the "- cloudsw2-c8-eqiad 1H IN A 10.65.1.197" is a bit too early, proabably because I changed the switch's status [13:58:29] sorry should have checked with you [13:58:44] nah it's ok [13:58:47] I blindly told it to continue as I knew you were removing the switch [13:58:48] topranks: checking [13:59:13] topranks: I'm done with the dns changes [13:59:29] next run should re-add the mgmt entry [13:59:43] topranks: let me know how it goes and I dig more in depth if it fails again [13:59:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:59:51] volans: ok sure I'll re-run now [14:04:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:24] I deleted 2 more IPs [14:08:25] XioNoX: mgmt ip added back ok for that switch, and it added it back to hiera as it's now active. so no issue with your changes [14:08:33] cool, thx [14:08:38] topranks: all good? [14:08:38] so what's up with that error [14:08:39] > [14:08:41] ? [14:08:47] volans: to do the above I had to set 172.20.254.1/32 to 'reserved' again [14:08:56] when I set it to active it hits that error [14:09:02] https://netbox.wikimedia.org/ipam/ip-addresses/13309/ [14:09:21] must be something wrong I'm not seeing [14:09:22] it might be because of the prefix settings [14:09:27] not a pool and a container [14:09:35] either of those might be related [14:09:45] I would need to check the logic that filters the prefixes [14:10:09] yeah this is under wikimediacloud.org [14:11:00] although there are some there, for instance the public IP it's replacing: [14:11:02] 32-27.153.80.208.in-addr.arpa:47 1H IN PTR ns-recursor0.openstack.codfw1dev.wikimediacloud.org. [14:13:56] topranks: that prefix has no site [14:14:01] so [14:14:02] [prefix for prefix in matching_prefixes if self.netbox.prefixes[prefix].site] [14:14:05] is empty [14:14:18] that one is my fault [14:15:33] topranks: you should get another diff for cloudsw2, this time it's fine to remove it [14:15:42] I zeroized the switch [14:15:45] XioNoX: ok thanks [14:18:05] volans: worked fine that time, thanks for your help I'll keep an eye out for that in future [14:18:21] XioNoX: removed the mgmt IP, two irb ints and the reverses looks ok for cloudsw2-c8 [14:18:23] I agree the error could be more clear :D [14:18:32] topranks: yep [14:25:08] oh ffs now CI is failing for my authdns patch cos there are no IPs assigned from 2620:0:861:fe10::/64 anymore and thus no netbox generated file [14:25:36] leave it with me [14:26:28] that's when the atomically... section of the docs are usful :D [14:26:35] topranks: https://wikitech.wikimedia.org/wiki/DNS/Netbox#Atomically_deploy_auto-generated_records_and_a_manual_change [14:28:15] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host cloudswift1002.eqiad.wmnet with OS bulls... [14:31:22] XioNoX: https://gerrit.wikimedia.org/r/c/operations/dns/+/928563 [14:31:40] topranks: ah right... [14:33:19] +1 [14:33:37] thanks [14:35:22] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) Just adding a note here that i needed to do the following to get puppet to work on the CA. this relates to the fact that we have separate ssl directories to supp... [14:39:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10ayounsi) Sounds good, thanks! > precise names of the fields in the data (we can look for this in realtime in the data when it starts flowing) Sure, is it sa... [16:14:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:19:43] (SystemdUnitFailed) firing: (4) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:20:35] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JArguello-WMF) [17:30:26] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Netflow/pmacct: use forwardingStatus - https://phabricator.wikimedia.org/T331707 (10JAllemandou) >> precise names of the fields in the data (we can look for this in realtime in the data when it starts flowing) > Sure,... [20:19:43] (SystemdUnitFailed) firing: (3) debmonitor-maintenance-gc.service Failed on debmonitor2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed