[01:03:46] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:53:21] (ProbeDown) firing: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:58:21] (ProbeDown) resolved: (2) Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:03:47] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:01:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10Traffic: Do we need ping offload servers at all POPs? - https://phabricator.wikimedia.org/T345809 (10MoritzMuehlenhoff) >>! In T345809#9168091, @cmooney wrote: > Do we have any way to measure it's impact? I had a quick look at available promethues metrics a... [07:08:33] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:07] 10netops, 10Infrastructure-Foundations: Renumber esams-eqiad GTT link - https://phabricator.wikimedia.org/T346421 (10ayounsi) [08:05:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) @ayounsi I'm in two minds as to whether it makes sense to make this change for the EVPN switches. In terms of the traffic between s... [08:34:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10ayounsi) Good point! That was done before the VXLAN deployment to have more predictability on the anycast traffic to the end servers. If we... [08:40:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) >>! In T339852#9169159, @ayounsi wrote: > If we can't have different behavior for vxlan vs. servers it seems more important to me th... [08:43:36] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Access port speed <= 100Mbps False positives - https://phabricator.wikimedia.org/T336511 (10ayounsi) 05Open→03Resolved a:03ayounsi I removed the alert as it was being problematic in {T346317} as well. [08:44:01] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10SRE Observability (FY2023/2024-Q1): Alert "access port speed less 100mbit" and librenms upgrade - https://phabricator.wikimedia.org/T346317 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks, that's related to {T336511} and I j... [08:47:34] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) @jbond from Juniper: [08:48:44] moritzm or slyngs, as John is away, would you have the answer to Juniper's questions: https://phabricator.wikimedia.org/T306238#9169186 ? [08:49:43] I think ID Tokens are a bit old school, but I believe we can enable it in CAS [08:51:38] It could also be that it just needs to be encryptet [08:56:02] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10SLyngshede-WMF) We might need to change the Juniper configuration in CAS from: ` "supportedResponseTypes": [ "java.util.HashSet", [ "code"... [08:57:21] slyngs: could you take care of it or is it a John thing? I think he is away for one more week [08:57:50] I feel pretty confident in mangling CAS :-) [08:58:14] I'll just wrap up another patch and then give it go [09:15:23] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:15:39] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:28:22] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [09:28:54] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) a:05aborrero→03cmooney [10:08:33] (SystemdUnitFailed) firing: (3) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:33] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10aborrero) 05Open→03Resolved [12:49:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10User-aborrero: Configure eqiad cloudsw devices to support cloud-private - https://phabricator.wikimedia.org/T341223 (10cmooney) [13:25:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) Added another patch above as on the QFX5100 you need to explicitly set the "hash mode" to layer2-payload (i.e. IP header), otherwise... [13:58:07] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10SRE Observability (FY2023/2024-Q1): Alert "access port speed less 100mbit" and librenms upgrade - https://phabricator.wikimedia.org/T346317 (10fgiunchedi) Sweet, thank you @ayounsi ! [14:14:36] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:29] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10aborrero) [14:37:07] 10netbox, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team, 10User-aborrero: netbox: add support for cloud-private subnet in server network provisioning automation - https://phabricator.wikimedia.org/T346428 (10cmooney) Thanks for the task @aborrero Yeah the goal here will be to extend the [[... [15:54:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Renumber esams-eqiad GTT link - https://phabricator.wikimedia.org/T346421 (10cmooney) 05Open→03Resolved a:03cmooney [18:16:31] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:16:31] (SystemdUnitFailed) firing: (5) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed