[00:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:20] topranks, XioNoX: anything on netbox-next DB that you're experimenting with? can I import a fresh prod backup? the deploy to -next is failing due to: duplicate key value violates unique constraint "dcim_cablepath_origin_type_id_origin_id_41b6f814_uniq" [08:51:34] and might be due to some experiment we've done there with the data [08:53:22] * volans retrying first deleting one cable [08:54:27] volans: we're both off today [08:54:38] doh sorry [08:54:42] ignore my ping [08:54:44] but yeah I think you can zab the db nothing we'd be doing there would need to stay [08:54:53] sounds like a duplicate cable id from a previous test though [08:55:02] so yeah delete that or just restore db from prod no worries [08:55:45] thx, go off, sorry again for the ping [09:21:02] slyngs: is/was netbox-next carrying some special config for openID connect? [09:21:09] after upgrading the venv I get: ModuleNotFoundError: No module named 'jose' [09:21:17] from social_core/backends/open_id_connect.py", line 5 [09:21:41] jose is a dependency of the social_auth [09:22:08] Did I not add that.... just a sec. [09:22:26] none of them are part of https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/1024838/1/frozen-requirements-bullseye.txt [09:22:41] but is the same setting in prod too? [09:22:55] or are we having 2 different settings between -next and prod? [09:23:38] Prod doesn't use the OIDC, so the dependency is never "triggered"/used [09:23:55] ok [09:25:12] so I don't get how was it working before [09:25:19] did some deps were installed manually? [09:25:39] how long are we planning to keep -next and prod with 2 dfferent auth setup? [09:26:21] The plan was to wait until after the upgrade [09:27:02] ok [09:27:56] Okay, "fun" story, the Jose module isn't actually a dependency of social_auth, because "You might not actually need it", but if you do need it, then it is a dependency [09:28:18] rotfl [09:28:32] and facepalm at the same time [09:29:12] so should I just install it manually for now? [09:29:21] Yes, but there's two [09:29:29] I'm trying to locate the correct one [09:30:49] I'd also very much like to know how it's being pulled in [09:31:58] https://github.com/python-social-auth/social-core/commit/013d27d291ae44063dcbfbbb3f1a96cc87251643 [09:32:44] we have now on -next social_auth_core-4.3.0 and social_auth_app_django-5.0.0 [09:32:50] We can also update the social_core module, and then the dependency goes away [09:33:29] for now I can just do pip install python-jose and hopefully we'll fix that in the repo itself once netbox is upgraded [09:36:08] works :D [09:36:38] okay, if we can hit at least 4.5.0 for social_auth_core the dependency is no longer required [09:38:42] I think I know what happened, in any case it's probably my fault. I pulled in a new version of social_auth, to get the CAS support. I did that manually, so either that pulled in jose or I did manually as well [09:40:49] volans: One more of my bugs [09:41:21] We need this one as well: https://pypi.org/project/ApereoSocialPipeline/ [09:47:36] ok [09:48:12] which version os social auth do you need? [09:48:44] 4.3.0 is fine, if we have jose, otherwise 4.5.0 or newer [09:50:24] ok [09:51:20] installed ApereoSocialPipeline, how to test it? [09:52:25] It works now. If you're logged out of idp-test and then go to Netbox-Next and reauthenticate it should either work ... or fail if the module isn't there [10:00:21] all seems to work fine on netbox-next, I'll deploy to netbox prod now [10:12:49] I will hold of on lunch for a bit then :-) [10:14:55] nah no worries [10:17:21] The Django REST Framework is slowly driving me insane [10:19:14] only slowly? :D [10:20:22] Because at first it makes perfect sense, but then it tricks you, but in a very sneaky way where it pretend to listen, and then go do it's own thing [10:20:49] Import an authentication class that doesn't exist, fails... Perfect, makes. [10:21:24] Then you give it one that you based on the included TokenAuthentication, it yes sure... but then never use the code [10:21:41] lol [10:21:45] not nice [10:22:14] I'm fairly sure I'm holding it wrong or something, but that's a bit tricky to work out when your code isn't even called [10:34:20] slyngs: I'll delay the upgrade too many people running cookbooks that interact with netbox right now :D [10:43:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757114 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7001.magru.wmnet with OS bookworm [11:47:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7001.magru.wmnet with OS bookworm com... [11:51:07] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757342 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7002.magru.wmnet with OS bookworm [12:20:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757585 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7002.magru.wmnet with OS bookworm com... [12:51:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757586 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm [12:51:47] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757592 (10MoritzMuehlenhoff) [13:20:25] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:25] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye [13:55:12] topranks || XioNoX: I made a change in kubernetes staging that leads to more prefixes being announced to BGP, aparently hitting a limit. Can you tell what the limit of prefixes is? [13:55:12] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757869 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7003.magru.wmnet with OS bookworm com... [13:55:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757870 (10MoritzMuehlenhoff) [13:55:13] context: I went from ipv4 blocks of /26 to /30 [13:55:13] jayme: we’re both off today I’m afraid [13:55:13] oops [13:55:13] in general better to keep the number down, but we can probably accommodate this [13:55:13] For today, maybe better to roll back if that is possible? [13:55:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, and 2 others: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9757873 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm [13:55:13] I can assist tomorrow morning and up the limit [13:55:25] If that’s not possible let me know I can try to dig out where it is and show how to change but I’m on my phone [13:55:27] yeah, I was trying to figure out if going to /28 would not hit the limit [13:56:02] 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9757891 (10MoritzMuehlenhoff) [13:56:23] don't worry. I'll roll back (or to /28) please feel free to continue being off :) [14:08:07] thanks I’ll link you in the morning cheers [14:09:43] I'm off tomorrow :) [14:26:00] Any objections to us continuing with the change from private to public IP on lists1004? We started earlier but there was a change for dns7001 that wasn't merged that got pulled in with one of the netbox changes. We'd like to continue from where we left off and get this host reprovisioned and reimaged [14:50:16] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758116 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye exe... [14:54:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758121 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti7004.magru.wmnet with OS bookworm comp... [14:54:50] eoghan: that sounds sensible to me [14:55:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758122 (10MoritzMuehlenhoff) [15:09:34] jhathaway: Thanks! [15:28:10] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 10cloud-services-team (FY2023/2024-Q3-Q4), and 2 others: Remove elasticsearch-curator dependency from Spicerack/Elastic cookbooks - https://phabricator.wikimedia.org/T361647#9758252 (10RKemper) 05Open→03Resolved [15:31:11] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758278 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye [15:49:31] jayme: I'm around for a bit if needed, but yeah for such changes a task might be more more appropriate :) [15:50:41] jayme: the limit so far is 50 prefixes per host [15:58:10] XioNoX: thanks! I found a.lex with permissions to "clear bgp neighbor". My theory is that the /30 blocks made bird announce too many prefixes to the routers and then they got stuck in that error state [15:58:47] as changing the block size to /28 did not clear out the error although bird was only trying to anounce one prefix then... [15:59:04] anyhow - nothing needed from your side I guess as /28 is also fine by me [16:01:20] jayme: yeah, the routers are configured to wait for human intervention when the limit is triggered to prevent flapping [16:04:58] ah, I see. Good theory then :D [16:05:04] thanks for responding <3 [16:08:33] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9758441 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bullseye exe... [17:45:25] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:03:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759001 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm [18:11:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759006 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7001.wikimedia.org with OS bookworm exe... [18:49:42] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759148 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm [19:27:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759320 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm exe... [19:38:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm [20:50:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-magru, 06Traffic: Q4:rack/setup/install magru misc servers - https://phabricator.wikimedia.org/T362730#9759563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1002 for host dns7002.wikimedia.org with OS bookworm exe... [21:25:28] just a heads-up that I rebooted pcc-worker1003 as it was claiming it was out of disk space, although it really wasn't. Things look OK post-reboot. [21:28:51] I stand corrected...it's still saying no space on device. I don't have time to take a closer look but just FYI [21:34:12] it seems to be out of inodes [21:36:01] thanks for checking, I'm rolling out CFSSL to the Elastic hosts ATM [21:45:26] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:34] we should re-format them with a better -i bytes-per-inode value [22:12:52] also the cleanup job for old pcc runs doesn't work since it was moved over to NFS (unrelated) as the find doesn't find anything because it's on a symlink :sigh: [22:18:53] also this command doesn't delete anything, just finds them :facepalm: [22:18:56] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/puppet_compiler.pp#30 [22:19:29] I've actually deleted what it would have found and now [22:19:30] Filesystem Inodes IUsed IFree IUse% Mounted on [22:19:32] /dev/sda1 1310720 229699 1081021 18% / [22:31:28] jhathaway: FYI in case you want to open tasks/fix any of the above. I'm heading to bed ;) [22:43:18] I'm cleaning up all the other pcc workers too [22:48:46] {done} [23:03:58] thanks volans, I'll take a look