[06:58:56] re:I/F alerting, that's ok for me but I +1 what mori.tz said, we should have a central place with all the alerts and I think we're already *not* doing that with alertmanager rules for other teams... [06:59:15] it becomes harder to know if your work has had any unexpected effect on other services [07:48:41] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10ayounsi) 05Open→03Resolved a:03ayounsi Done! [09:13:22] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) I pondered multiple options for the Netbox `server_bgp` custom field, feedback from ServiceOps welcome ba... [10:41:17] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) We're missing a "dry_run" for services and puppet, but Puppet doesn't need is as the decorator also checks for _remote_hosts. [10:59:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) PuppetMaster Class needs dry_run, this can be done by letting the class inherit from RemoteHostsAdapter. Service class should have a... [11:11:03] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for th... [11:13:25] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) On the Netbox side I'm happy with the current status, or having it as a dropdown. I think it's good to k... [13:24:23] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) Hoping to kick-start some more discussion around this and try to close this out. I still firmly believe tha... [13:45:07] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) Overall I agree it's an improvement to have the parent interfaces defined in Netbox. I lost a bit context o... [13:58:22] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729090, @ayounsi wrote: > I lost a bit context on how it will be done on a day to bay basis,... [15:27:17] 10netbox, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) Looks ok to me too, I'm no sure about all the details involved if we need to patch things like the dns genera... [15:36:48] 10Packaging, 10Infrastructure-Foundations, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech: Migrate WDQS to Java 11 - https://phabricator.wikimedia.org/T316103 (10Gehel) [16:01:01] (SystemdUnitFailed) firing: (11) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:26] (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:00] that links to &editPanel=13 [16:44:25] so when clicking the link I get "Permission to edit panel denied" [16:46:24] also "generate_os_reports" doesn't show up in the dashboard [16:46:38] XioNoX: i allready fixed the os_reports one [16:46:45] thx! [16:47:06] althug it shouldn;t be in the label. ill take a look at that tomorrow [16:51:33] ohh, I get it, there is no recovery because there are still 10 similar alerts [16:51:43] yes exctaly [16:51:45] but the one you fixed brought it from 11 to 10 [16:51:56] yes sorry i should have been more explicit :) [16:52:02] nah it's ok [16:52:32] it makes sens to have the labels related to I/F in the IRC chan here though [16:53:05] otherwise it would say, "11 alerts including 1 for I/F, good luck finding it" [16:53:28] the "recovery" should be more explicit though [16:53:51] i think the problem is that the count also includes silenced alerts https://alerts.wikimedia.org/?q=team%3Dinfrastructure-foundations&q=alertname%3DSystemdUnitFailed&q=%40state%3Dsuppressed [16:54:16] sop there where a total of 11 but 8 where silenced (and i have since fixed two) [16:54:20] #two more [16:54:42] I see [16:54:50] ftr i think we should imporove this just explaining [16:55:26] in the IRC message we shouldn't have the non-actionable (silenced) ones [16:55:34] yes agree [16:55:53] if I get interrupted I want to quickly know why [16:59:35] (SystemdUnitFailed) firing: (9) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:01:25] should we have a permanant rule for all the -test and -dev hosts to be silenced? [17:01:41] +1 i think that make senses to me [17:04:41] thanks [17:04:48] https://phabricator.wikimedia.org/T333204 [17:20:37] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer? Th... [18:11:17] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) We moved the fist batch of servers today all went well. [18:22:02] 10netops, 10Infrastructure-Foundations, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10ayounsi) [18:37:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney second batch proposal below |Host|U space|Existing port|New port| |cloudcephosd2002-de... [21:01:01] (SystemdUnitFailed) firing: (4) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:11] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:22:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10fundraising-tech-ops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10Dwisehaupt) Thanks! Verified working and runs good.