[03:30:00] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:00] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:25] netbox 3.6.0 is officially out [06:21:38] and finally: Tags may now be restricted to use with designated object types. \o/ [06:22:08] we might even start to consider using them at this point [06:22:26] full release notes at https://github.com/netbox-community/netbox/releases/tag/v3.6.0 [07:14:38] so many useful features compared to what we're running [07:48:49] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10ayounsi) FYI there is now a pending diff for: ` [edit forwarding-options dhcp-relay] + /* T337345 */ + forward-snooped-clients non-... [07:50:06] 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [07:55:29] 10CAS-SSO, 10Data-Platform-SRE, 10Infrastructure-Foundations: Switch DataHub authentication to OIDC - https://phabricator.wikimedia.org/T305874 (10Stevemunene) [08:24:44] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) @jbond from Juniper, does it make sens? > “If the customer would like to use OIDC they enter in their token for us to use and authenticate. The vast majority of users sign... [08:26:57] XioNoX: yep, we should plan to upgrade next Q [08:27:12] cc jobo FYI :) [09:18:57] (SystemdUnitFailed) firing: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:40] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10jbond) >>! In T306238#9132987, @ayounsi wrote: > @jbond from Juniper, does it make sens? >> “If the customer would like to use OIDC they enter in their token for us to use and authe... [09:30:00] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:57] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-api-int_hourly.service Failed on cumin1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:59] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) >>! In T345273#9132938, @ayounsi wrote: > FYI there is now a pending diff for: > ` > [edit forwarding-options dhcp-relay] > +... [10:28:00] topranks: i dont see your comit to labs/private the last comit is from 3 days ago https://gerrit.wikimedia.org/g/labs/private [10:28:20] never imnd i see the chantge is just not merged [10:28:32] jbond: yep that's it [10:35:50] jbond: thanks for following up / reviews :) [10:35:59] no probs [10:36:17] topranks: FYI i created the pcc report using ` ./utils/pcc -p 953971 953674 auto ` [10:36:37] the -p switch tells pcc to use a specific change from the private repo [10:36:48] its not well tested or documented [10:37:09] but can be usefull for some edgecases [10:40:36] ok thanks good tip [10:46:00] topranks: finally getting somewhere! https://grafana.wikimedia.org/d/iUATvNzSz/network-queues?orgId=1&var-device=lsw1-e2-eqiad.mgmt.eqiad.wmnet (cc godog) [10:48:26] XioNoX: oh wow nice! [10:48:46] that will be a mega help, had a quick look, will need to dig into the data but seems to be exactly what we need :) [10:50:44] plenty left to do but you can see what's currently exposed using `prometheus1005:~$ curl netflow1002:9804/metrics` [11:00:09] ok nice will have a look [11:27:01] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) 05Open→03Resolved [11:27:06] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) [11:30:33] 10SRE-tools, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) p:05Triage→03Medium [11:35:25] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) [11:38:23] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337 (10jbond) Also worth noting that version >= 6 are not currently working with spicerack (T328775) [12:08:57] (SystemdUnitFailed) resolved: update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:00] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:57] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:57] (SystemdUnitFailed) firing: (2) netbox_ganeti_drmrs01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:30:00] (SystemdUnitFailed) resolved: (2) netbox_ganeti_drmrs01_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:51] 10netops, 10Infrastructure-Foundations: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) 05Open→03Resolved Rolled everywhere, another example, cr1-codfw: `name=before Prefix Nexthop MED Lclpref AS path... [13:56:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) We have data https://grafana.wikimedia.org/d/iUATvNzSz/network-queues ! And a doc: https://wikitech.wikimedia.org/wiki/Netwo... [14:43:52] and now with the interface descriptions! https://grafana.wikimedia.org/d/iUATvNzSz/network-queues?orgId=1&from=now-1h&to=now&viewPanel=4 (thanks godog !) [14:44:36] \o/ \o/ \o/ [15:27:40] XioNoX: nice [15:28:17] 10SRE-tools, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Fabfur) [15:47:17] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Fabfur) [15:51:58] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Cookbook should ask for confirmation at beginning of execution - https://phabricator.wikimedia.org/T345370 (10Volans) For context some cookbooks that deems what they are doing dangerous already do that, for example the aforementioned `sre.hosts.reimage`... [16:01:47] 10SRE-tools, 10Infrastructure-Foundations: Cookbooks could be more verbose in listing the completed/missing steps - https://phabricator.wikimedia.org/T345375 (10Volans) Improving the cookbook outputs and readability of it is surely always a great idea. I'm not sure though what are you proposing as actionable.... [16:45:30] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) Thanks, I submitted the on-boarding form, let's see what happens now.