[05:40:41] 10netops, 10Infrastructure-Foundations, 10SRE: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10akosiaris) FYI, same mitigation applies to https://supportportal.juniper.net/s/article/2023-08-29-Out-of-Cycle-Security-Bulletin-Junos-OS-and-Junos-OS-Evolved-A-craft... [06:33:49] 10netops, 10Infrastructure-Foundations, 10SRE: xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013} - https://phabricator.wikimedia.org/T345138 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox circuit ID 33 --- **Interface cr1-esams:xe-0/0/7** - admin-status... [06:35:48] 10netops, 10Infrastructure-Foundations, 10SRE: xe-3/2/1: down -> Transport: cr1-esams:xe-0/0/7 (Lumen, BDFS2448 80ms 10Gbps wave) {#2013} - https://phabricator.wikimedia.org/T345138 (10ayounsi) 05Open→03Resolved a:03ayounsi RFO sent by email. [07:22:31] 10Puppet, 10netbox, 10Infrastructure-Foundations, 10SRE, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) > In other words what is its usage today? One use case that I'm aware of is Icinga not alerting for hosts unreach... [08:34:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10cmooney) Agreed this seems to make sense, and Juniper are advising it: https://supportportal.juniper.net/s/article/2023-08-29-Out-of-Cycle-Secu... [08:41:51] (ProbeDown) firing: (2) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:46:51] (ProbeDown) firing: (4) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:09] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10aborrero) [10:08:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) @cmooney I came across https://www.juniper.net/documentation/us/en/software/junos/interfaces-telemetry/topics/ref/statement/... [10:10:18] 10netops, 10Infrastructure-Foundations, 10SRE: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10ayounsi) 05Open→03Resolved a:03ayounsi All done. [10:10:23] 10netops, 10Infrastructure-Foundations, 10SRE: Configure bgp-error-tolerance on Juniper routers - https://phabricator.wikimedia.org/T340111 (10ayounsi) Relevant: https://blog.benjojo.co.uk/post/bgp-path-attributes-grave-error-handling [10:45:59] 10netops, 10Infrastructure-Foundations, 10SRE: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Resolved→03Open Re-opening as the fasw got upgraded since, so we can enable `mgmt_junos` [11:21:51] (ProbeDown) firing: (4) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:44] slyngs: are you aware of this ^^ [11:34:13] Yes, I'm currently trying to work out why it thinks that idm1001 is down, but not idm2001 [11:42:31] slyngs: im not sure why there is no alert for idm2001 . however i think ther reason we see the alert on idm1001 is because the check is configured with body_regex_matches => ['Bitu'], [11:43:08] Yes, but I fixed that earlier: https://gerrit.wikimedia.org/r/c/operations/puppet/+/953570/2/modules/profile/manifests/idm.pp [11:43:14] and the following returns no results [11:43:15] curl --connect-to idm1001.wikimedia.org "https://idm.wikimedia.org/wikimedia/login/?next=/" | grep Bitu [11:43:41] ahh sorry i had an old checkout [11:43:52] id check if the blackbox checks follow redirects [11:43:58] It would make sense that it's the 200 and not 302 [11:44:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/953585 [11:44:53] yes that lgtm [11:44:53] slyngs: the idm-django-settings.erb hunk crept in from a different patch [11:45:10] ohh yes sorry missed that bit [11:45:10] The other way around... but yes [11:47:45] Let's do this instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/953589 [11:48:13] looks good, +1d [11:51:36] Now I do wonder why it worked with "Bitu" ... Anyway [11:51:51] (ProbeDown) resolved: (2) Service idm1001:443 has failed probes (http_idm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#idm1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:07] 10netops, 10Infrastructure-Foundations, 10SRE: Use mgmt_junos on all network devices - https://phabricator.wikimedia.org/T327862 (10ayounsi) 05Open→03Resolved Nevermind, still doesn't work on the fasw. [12:12:46] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10ayounsi) Before running homer, the cookbook needs to call the `sre.network.tls` cookbook with the device's name as parameter to add the TLS cert... [12:33:56] (SystemdUnitFailed) firing: gnmic.service Failed on netflow4002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:29] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 (10Volans) To which types are you referring to? Do you mean to define some standard setup for VMs? Where would those be defined? [12:58:01] 10SRE-tools, 10Infrastructure-Foundations, 10SRE: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 (10MoritzMuehlenhoff) Basically when creating a new bastion one wouldn't need to look up the current config, but would be able to simply pass --type bastion which would be a s... [12:58:56] (SystemdUnitFailed) resolved: gnmic.service Failed on netflow4002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) I rolled the certificate to all the cloudsw, cr, and asw devices. I enabled gnmic on all the cloudsw and asw devices. I conf... [13:47:46] moritzm: for T344972 ("sre.ganeti.makevm: Create machine types") where do you were planning to store the types? I'm asking because we do have server "default configs" in netbox but using the model but that can't be used in the virtualization side of netbox [13:47:47] T344972: sre.ganeti.makevm: Create machine types - https://phabricator.wikimedia.org/T344972 [13:48:04] so either we hardcode that in the cookbook that is a bit meh or I'm not sure where [13:48:38] * volans will no mention netbox tags :D [13:48:40] *not [13:48:45] it could just be config file maintained via Puppet [13:48:58] which gets read by the cookbook and if that file isn't present, no types are offered [14:00:10] sure we could do that but it's all a bit manual and the operator can easily just forget about it and create a host of the same "type" with different specs [14:04:01] obviously any option can be forgotten to be used, but it's a simply pragmatic fix for a common use case and I don't see any real downside to [14:04:05] to it [14:09:22] sure sure I just wanted to understand the context/idea [14:16:32] ack, I'll add a more detailed proposal to the task before I start to work on it [14:24:03] ack, thx [15:51:43] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) p:05Triage→03Medium [15:52:02] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [15:52:07] 10SRE-tools, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Setup zero touch provisioning (ZTP) for network devices - https://phabricator.wikimedia.org/T336485 (10cmooney) [15:53:18] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [15:59:00] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [16:02:46] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [16:12:13] 10netops, 10Infrastructure-Foundations, 10SRE: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) [16:57:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10ayounsi) Could we use `forward-only` everywhere once we move to DHCP option 97 with {T304677} ? [17:01:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Juniper ZTP fails on certain devices due to DHCP binding on management router - https://phabricator.wikimedia.org/T345273 (10cmooney) >>! In T345273#9131609, @ayounsi wrote: > Could we use `forward-only` everywhere once we move to DHCP opti... [18:30:00] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:28:56] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-api-ext_hourly.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed