[03:35:35] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:17] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:35] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:50:35] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:17] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:19:45] slyngs: fyi, netbox-more-metrics is live on netbox-next : https://netbox-next.wikimedia.org/plugins/more-metrics/metrics/ [07:38:57] Nice, I'll see if I can set up a few metrics this week [09:10:00] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9968313 (10elukey) Thanks @aborrero! If we go for the direct upload to reprepro please let's create a Wikitech page explaining the procedure, so we can have a way to redo it in the future (if needed). [09:16:07] 10netbox, 06Infrastructure-Foundations: Netbox : sync src/ submodule - https://phabricator.wikimedia.org/T369690 (10ayounsi) 03NEW [09:28:07] 10netbox, 06Infrastructure-Foundations: Netbox : sync src/ submodule - https://phabricator.wikimedia.org/T369690#9968433 (10SLyngshede-WMF) I just tested installing Netbox using pip from Git and the repo/project setup isn't really designed for it. Maybe we could get a patch in that fixes pyproject/setuptools.... [09:47:59] 10netbox, 06Infrastructure-Foundations: Netbox : sync src/ submodule - https://phabricator.wikimedia.org/T369690#9968498 (10SLyngshede-WMF) This isn't all of it, but maybe we can design a patch for the pyproject.toml in Netbox, which would allow it to be installed using pip. The following is still missing stu... [10:15:40] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9968563 (10ayounsi) https://wikitech.wikimedia.org/w/index.php?title=Ganeti&diff=2204519&oldid=2203166 [10:45:22] hi folks, what's the current PKI guidelines for WMCS projects? [10:46:02] I'm trying to fix the puppetization of the cache instances in the traffic WMCS project, and right now profile::cache::purge depends unconditionally on profile::pki::get_cert() [10:46:35] do we need to use the PKI infra living on the pki WMCS project? do we need to spawn our own? [10:49:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:07] answered to Valentin in pvt [11:15:25] (didn't see this query before writing directly) [11:16:40] the TL;DR is that it is sufficient to define profile::pki::client::auth_key (with the value in the wmcs's pki project) as local commit in the target cloud environment's puppetserver [11:16:53] and then get_client() will fetch certs from the pki project [11:18:37] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635#9968801 (10cmooney) I think the work on this can be done in tandem with the review of the setup in {T367203}. Off my head an OSPF/IBGP design simi... [11:21:49] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9968809 (10Marostegui) @cmooney got to be closed? [11:37:17] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 22.4R3 - https://phabricator.wikimedia.org/T364092#9968852 (10cmooney) [11:42:02] elukey: does that means that it's supported only if there is a local puppetserver in the project? [11:49:38] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Include vlans with defined IRB int in device vlans even if no port present - https://phabricator.wikimedia.org/T366348#9968898 (10cmooney) 05Open→03Resolved [11:51:55] 10netops, 06Infrastructure-Foundations, 06SRE: Adjust IBGP route-reflector spine/leaf automation to support separate client clusters - https://phabricator.wikimedia.org/T364103#9968908 (10cmooney) 05Open→03Resolved Closing task - is a duplicate work was completed under T365169 [12:41:55] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9969045 (10cmooney) 05Open→03Resolved >>! In T365995#9968809, @Marostegui wrote: > @cmooney got to be closed... [12:48:34] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Support PyBal routes announced with lower priority than "backup" - https://phabricator.wikimedia.org/T354839#9969092 (10ssingh) a:03ssingh [13:08:35] volans: in theory yes, at least this is the only way I managed to make it work.. with the big assumption that we want to keep the auth-id a secret [13:09:26] if we had something like "batman" then the local puppet master wouldn't be needed, and we might think about it in cloud.. but it feels weird [13:10:01] ack, I was thinking about hiera config on horizon as a second-best option [13:53:54] volans: topranks: hello! you remember once we had that issue where someone forgot to run the Netbox cookbook and authdns-update was borked in the middle of an emergency depool? [13:54:15] I am working on T369366 and I wanted to revisit that. I remember you guys had notes somewhere about that but I am not sure where [13:54:16] T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366 [13:54:21] can you remind me again please? thanks <3 [13:54:42] basically, I want to "try to fix" that issue as part of this task [13:56:16] sukhe: I'll try to have a dig around later I'm trying to remember [13:56:25] np not urgent for sure, I should have said that [13:56:29] it was probably some ipv6 reverse include or something [13:59:16] yeah and an empty PTR file but I think you both discussed something unless I am mistaken and I wanted to see what was tha before I dig on my own [14:00:11] basically, once we move to confctl for this, we will be running authdns-local-update without user review anytime the state changes and we don't want to be in a position where Netbox DNS is borked, so we should either try to fix that or alert for it somehow [14:00:23] but that's on me -- you don't need to worry about that but I wanted to see the notes if there were any [14:08:56] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9969455 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9ca0faf1-4b9d-4345-9bb8-9c7153e17163) se... [14:11:12] 10Packaging, 06Infrastructure-Foundations, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q1): upgrade prometheus-ipmi-exporter to 1.8.0 - https://phabricator.wikimedia.org/T368088#9969474 (10fgiunchedi) [14:32:39] topranks: I think I found it https://docs.google.com/document/d/1lwgiSgrbFapRjFvQlqeU8Zk0LVljTI90SJGG-TFg5gE/edit#heading=h.terxoaoy0a9s [April 24 notes] [14:33:01] and https://phabricator.wikimedia.org/T362985 [14:33:16] direct link: https://docs.google.com/document/d/1lwgiSgrbFapRjFvQlqeU8Zk0LVljTI90SJGG-TFg5gE/edit#heading=h.ntnypbvmz2p1 [14:33:26] ty! [14:49:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:09:14] 10Packaging, 06Infrastructure-Foundations: Package ipxe-qemu - https://phabricator.wikimedia.org/T369136#9969795 (10ayounsi) Thanks @aborrero indeed ! I added the packages to bookworm-wikimedia on the APT repo and linked to that task from the Ganeti doc in case we need to redo it. Keeping the task open in cas... [15:23:39] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9969926 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=5386f05e-734c-49b0-a4c5-1acbef4c187a) se... [15:24:15] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9969929 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9475b2b6-bc5f-41f8-97d1-970eb62b38bc) se... [15:45:40] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9970099 (10cmooney) Switch upgraded successfully and all hosts back online/pinging. Thanks everyone for the assista... [15:46:55] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9970119 (10ABran-WMF) db1190 repooling dbproxy reloaded everything looks OK [15:52:19] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9970127 (10Eevans) ms-fe1012 repooled, and everything looks good. [18:49:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:05] 10CAS-SSO, 06Infrastructure-Foundations: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#9971402 (10bd808) I have a strong hunch that https://turnilo.wikimedia.org/ returning me HTTP 431 Request Header Fields Too Large responses is the same underlying issue. [22:49:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:17] FIRING: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:17] RESOLVED: [2x] SystemdUnitFailed: geoip_update_ipinfo.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed