[00:38:16] 10netops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: use old asw switches from row A and B as msw switches in row C and D - https://phabricator.wikimedia.org/T361871#9792816 (10Papaul) 05Open→03Resolved All the old mgmt switch are back in place [00:53:02] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:02] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:48:02] FIRING: [3x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:48:02] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:36:11] Juniper released an advisory in regards to their OpenSSH version (since those automated vulnerability scanners detect that 7.5 what they use is outdated). their descriptions are all in line what our previous estimations wrt OpenSSH updates when these were listed in the quarterly updates: [07:36:12] https://supportportal.juniper.net/s/article/2024-05-Reference-Advisory-Junos-OS-and-Junos-OS-Evolved-Multiple-CVEs-reported-in-OpenSSH?language=en_US [08:51:48] cool thanks for the heads up moritz [08:53:06] agree we seem to be ok on most of them, user enumeration potentially but we are not hiding that anyway [08:54:55] all those are in puppet.git anyway... [08:54:57] I'll be upgrading serpens (LDAP server in codfw) to bullseye in a bit, should not be noticeable in practice since practically all LDAP requests are going against the replicas [08:55:30] in practive most deployments will only have one root user anyway, so enumration is kinda moot anyway :-) [09:28:36] 10netops, 06Infrastructure-Foundations, 06SRE: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9793592 (10ayounsi) @cmooney what do you think of duplicating the other POPs allocation scheme? For example looking at eqiad as example, keep 2a02:ec80:a000::/40 as "reserved for future growth" Then... [09:30:06] serpens upgraded to bullseye [10:48:02] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:59:37] 10netops, 06Infrastructure-Foundations, 06SRE: Cloud IPv6 subnets - https://phabricator.wikimedia.org/T187929#9793972 (10cmooney) >>! In T187929#9793592, @ayounsi wrote: > @cmooney what do you think of duplicating the other POPs allocation scheme? > For example looking at eqiad as example, keep 2a02:ec80:a00... [13:38:43] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10Znuny, 13Patch-For-Review: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9794649 (10LSobanski) a:03LSobanski [13:39:26] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10Znuny, 13Patch-For-Review: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9794646 (10LSobanski) p:05Medium→03Low [14:48:02] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:08:36] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480#9795483 (10ops-monitoring-bot) Deployed homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to add modified... [16:08:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795486 (10Papaul) [16:09:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795489 (10Papaul) [16:10:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9795496 (10Papaul) [16:34:22] tried to run the capirca netbox script with new timeout of 15 min... but of course [16:34:24] An exception occurred: JobTimeoutException: Task exceeded maximum timeout value (900 seconds) [16:37:33] something more complex going on obviously [16:37:48] on nextbox-next it just completed in 35 seconds [16:38:42] topranks: sorry I can't have a look right now, I have a hard stop to be afk in 15 minutes [16:39:00] volans: no probs, it's not an issue [16:39:15] or more to the point - the issue is the same as it has been [16:39:32] I know but I hoped the timeout would fix it [16:39:37] weird that it works fine on -next [16:39:42] yeah [16:39:44] do they have the same data? [16:39:46] something fishy going on [16:39:50] yeah [16:40:01] I can't say for sure data is the same, but it should be close, same order of magnitude of hosts [16:40:26] the difference is on API usage though [16:40:53] as in.... -next isn't busy as it's not used for normal ops? [16:40:54] I'm off tomorrow but could try to look on Thu if you remind me ;) [16:40:57] yes [16:41:00] yeah [16:41:07] we do have all the timers and cookbooks and integrations [16:41:08] np will try to, we need to eventually [16:41:18] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10Znuny: Clean up OTRS/Znuny addresses handles by gsuite - https://phabricator.wikimedia.org/T284145#9796112 (10LSobanski) 05Open→03Resolved Resolving as the change is now in place. [16:41:21] it's the same pattern we've seen before, something must be blocking somewhere [16:41:33] I want to have a look at the queries on postgres while it's running [16:41:49] it may need that kind of low-level analysis yeah [17:16:37] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9796326 (10Papaul) [17:24:21] 10netops, 06Infrastructure-Foundations, 06SRE: Extend BGP peer automation via Netbox to include VMs - https://phabricator.wikimedia.org/T364480#9796348 (10cmooney) 05Open→03Resolved Patch to Homer wmf plugin merged now, so BGP to VMs at POPs / on L3 switches now under automation too. [18:01:20] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024 - https://phabricator.wikimedia.org/T364893#9796533 (10CDanis) To add some context: The ports that saturated weren't ports for individual machines on the acc... [19:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:57:20] 10netops, 06Data-Platform-SRE, 06Infrastructure-Foundations: an-worker1165.eqiad.wmnet and increased network activity resulting in page on May 13 2024 - https://phabricator.wikimedia.org/T364893#9797744 (10cmooney) Thanks for the task and analysis. > it seems like it was an-worker1165.eqiad.wmnet and 10.64.... [23:48:02] FIRING: [2x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed