[00:01:43] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:43] (SystemdUnitFailed) firing: (3) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:57] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:34] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [06:26:08] 10netbox, 10Infrastructure-Foundations: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [06:51:28] I'm looking at ^ [06:53:15] XioNoX: ack, lmk if I can help [06:53:43] volans: probably :) [06:53:55] volans: the issue seems related to what shows up on https://netbox.wikimedia.org/api/extras/reports/ [06:54:23] ack, I'll check the accounting one that is failing completely [06:54:38] the netbox_report_accounting_run.service failure is just a consequence of the report itself failing [06:54:47] yeah [06:55:00] full stacktrace https://www.irccloud.com/pastebin/CP0VLkI1/ [06:55:01] report debugging hasn't improved in 3.2.9 :D [06:55:50] that traceback is useless, we need the ttraceback of why the report is failing [06:57:06] https://github.com/netbox-community/netbox/issues/11614 [06:57:55] eh [06:59:20] to get the proper one I've run [06:59:20] python manage.py runreport -v3 accounting.Accounting [06:59:37] as netbox user on netbox1002 with the venv activated [06:59:44] nice [06:59:50] I'll update the doc when I can reproduce [06:59:58] no, wait [07:00:01] this one too is useless... [07:00:08] for test_name, attrs in job_result.data.items(): [07:00:08] AttributeError: 'NoneType' object has no attribute 'items' [07:00:24] ah yeah, it's because there are no job results [07:05:02] I was trying to see if something can be done through nbshell, but no luck so far [07:05:33] I'm addinga try/except and no luck so far either, give me 5 [07:15:10] and it works fine on netbox-dev... [07:17:42] XioNoX: ? doesn't work on -next either, I'll move debugging there [07:17:49] https://netbox-next.wikimedia.org/extras/reports/results/4270493/ [07:18:47] volans: I was looking at https://netbox-next.wikimedia.org/api/extras/reports/accounting.Accounting/ no 500 [07:19:30] but still the report errored out [07:19:37] so do we have 2 distinct bugs? [07:20:18] not clear at that point, looks at least related to the accounting report [07:23:49] volans: might be worth setting DEBUG = True in NEtbox's config [07:24:04] I'd rather not if possible [07:24:07] leaks too many info [07:27:54] interesting, now it keeps in running :/ [07:28:30] uh [07:28:43] volans: like runs fine or runs forever? [07:29:02] runs forever or doens't run at all, let me restart things in -next [07:30:49] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [07:35:24] volans, XioNoX: ok to reboot the netbox hosts or you currently working on them? [07:36:21] moritzm: which ones? [07:37:52] netbox1002/2002 [07:37:53] moritzm: I'm debugging on netbox-dev right now [07:38:10] I'd use the cookbook, it only reboots the production nodes [07:38:16] but I can also do it later [07:38:27] just wanted to do it before DC ops are around [07:38:29] production for me is good anytime [07:38:33] +1 [07:38:45] and before people start running cookbooks like crazy :D [07:38:50] k, I'll do a quick headsup in -sre and proceed in ~ 5m [07:41:14] k [07:54:46] XioNoX, topranks : GRRRR the netbox-extras dir in netbox-next is back at Feb because of local modiciations, could you please clean that up and keep it clean? It's ok to test but please don't let it stay out of sync for more than a week at a time [07:55:07] * volans moves debugging to prod as is not possible to debug it there [07:55:13] given the false positives [07:56:30] FYI, netbox is back up (but the cookbook is still running (waiting for the recovery of various Netbox reports in vain)) [07:56:38] eh... [07:56:42] thx [07:59:45] volans: fixed [08:06:24] thx [08:06:46] (I saved the diff in my home dir in case it's needed) [08:08:06] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [08:16:42] (SystemdUnitFailed) firing: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:23] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [08:31:19] XioNoX: fix in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/918365 [08:35:15] volans: how did you figure it out? [08:35:43] removing things block by block [08:35:52] :( [08:36:22] was not raising an exception with a try/except because it was the inherent structure that was wrong [08:36:35] we were not calling the parent init, not sure why, maybe there wasn't one in the past [08:36:42] but that was wrong on our side [08:36:53] the move to pre_run is just for convenience given it's there [08:40:34] XioNoX: https://netbox-next.wikimedia.org/extras/reports/results/4270511/ :D [08:41:02] volans: now it can go back to alerting 24/7 :) [08:41:10] rotfl [08:42:50] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [08:43:29] volans: see task description on https://phabricator.wikimedia.org/T336275 I cleaned up already lots of prerequisites for the 3.5 migration, with some pending patches (one can be merged now afaik) [08:44:17] volans: do you know what's the impact of updating Django's SECRET_KEY? so it's >50 chars long? Can I do it now? [08:46:11] I think you need to set SECRET_KEY_FALLBACKS [08:46:24] to keep the old one around for decrypting signed things with the old one [08:46:42] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:07] also there might be a way to do it without logging everyone out [08:47:20] although they will be re-login via the idp... [08:47:27] I think it's fine to log people out [08:47:40] it's a one time thing [08:49:05] for what I read on https://docs.djangoproject.com/en/4.2/ref/settings/ people being logged out will be the largest impact? [08:49:40] if that's true, not sure it's worth spending time on puppet code to implement SECRET_KEY_FALLBACKS [08:50:22] https://code.djangoproject.com/ticket/22310 [08:50:57] I was wondering abot cookies [08:57:38] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10aborrero) [09:01:40] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10aborrero) [09:06:15] If Netbox uses the Django REST API framework resetting the secret key might also invalidate any tokens [09:06:28] XioNoX, topranks: could either of you please run "sudo keyholder arm" on cumin2002? I don't have access to the homer passphrase in pwstote [09:06:59] mortizm: let me have a look [09:08:50] cheers [09:09:56] thins brings the topic again, we should add some more people in @netops in pwstore [09:10:19] I think mori.tz is a good candidate as most of the time the one rebooting them [09:10:29] but in general we could add someone that has already root on the network devices [09:10:43] volans: or we can hire someone :) [09:11:03] so you can start slacking off? :-P [09:11:59] mortizm: done now [09:12:10] I also promise to learn how to spell your name :P [09:12:39] volans: yeah I agree I think it might make sense to have a slightly wider group access to those [09:13:15] and moritz would seem to be a good candidate, both given the requirement to use it and security focus [09:13:47] and one of the two pwstore owners :D [09:14:32] slyngs: yes they do use it [09:15:10] so yeah I'd be a bit more careful with the secret key change, needs at least some testing in -next [09:15:54] topranks: looks good, thanks [09:17:31] and sgtm wrt changing the access, I'll doublecheck with Joanna and then add myself to the homer-key-passphrase secret (we can have a mix of groups and users, so this can be "access: @netops, jmm" without needing to add me to @netops [09:23:05] * jbond nevermind i see from -foundsations you did yuo the cookbook :) [09:24:06] jbond: ? [09:24:34] moritzm: +1 [09:25:04] ignore that was ment to be a direct to mori.tzm [09:26:07] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Jenkins: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10hashar) Looking on the Jenkins controller logs at https://integration.wikimedia.org/ci/log/WARNING/ **pcc-worker1003.puppet-diff... [10:11:08] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) [10:33:20] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Jenkins: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10jbond) 05Open→03Resolved a:03jbond This was caused by an exhaustion of inodes in `/srv/jenkins/puppet-compiler`. ultimatel... [10:39:44] XioNoX, topranks: is this something we should fix? https://netbox.wikimedia.org/extras/reports/results/4557524/ [10:39:57] same for https://netbox.wikimedia.org/extras/reports/results/4557509/ [10:41:45] volans: https://phabricator.wikimedia.org/T331519 [10:42:17] 10netops, 10Cloud-VPS, 10Infrastructure-Foundations, 10SRE, and 2 others: Upgrade cloudsw1-c8-eqiad and cloudsw1-d5-eqiad to Junos 20+ - https://phabricator.wikimedia.org/T316544 (10cmooney) @aborrero @dcaro @Andrew I think we are in a position to look at doing this again? I've updated the list of servers... [10:42:26] 2nd one is for papaul [10:42:39] ok [10:42:51] the Network one had a timeout in one of the checks, re-running works but takes 2 minutes [10:45:57] volans: I've seen that before too. [10:46:06] with the port-block consistency check? [10:48:17] I've tried to look at the code of that to see if there is any obvious problems but I can't really see why it might be taking so long [10:48:18] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/reports/network.py#319 [10:49:43] ack, I'll have a look [11:04:10] note that this will move to a validator and/or the interface_automation script so not sure it's worth spending too much time on the report [11:04:26] unless the improvment there will be useful later on of course [11:06:13] it's the test_matching_vlan the slow one [11:19:10] that's doing a lot of queries IMHO and I think could be simplified [11:52:05] bookworm installations in Ganeti are also working now. the mysterious crash in early startup which made it drop in a busybox shell could be tracked down to increased memory usage in the Linux kernel and/or glibc [11:52:48] so with the 1G we've so far been using for small VMs this leads to steal-ctty segfaulting, which ends up ina cascade of other failures [11:53:20] I'll update sre.ganeti.makevm to ensure we have at least 1.5G for newly created VMs [12:00:06] bummer [12:00:23] I don't envy you, it must not have been funny to track this donw [12:04:05] at least it works now :-) [12:30:17] volans: ok yeah with the test_matching_vlan I can see it does a lot of lookups [12:30:37] perhaps we should grab all the vlans and their associated prefixes first? then compare each device IP? [12:30:39] I'm testing a patch to improve it [12:30:49] ah ok thanks :) [12:48:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10aborrero) [12:50:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10aborrero) [13:30:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10cmooney) @aborrero my apologies I messed up the vlan list for cloudgw2002. Should be ok now. ` cmooney@cloudsw1-b1-codfw> show arp no-resol... [13:36:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team: cloudgw: review security policy for edge network - https://phabricator.wikimedia.org/T336368 (10cmooney) @aborrero re-reading the description it sounds like there may be some other issues? Let me know if there is anything specific, the... [13:45:59] topranks: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/918479 should reduce the time from ~1m to ~15s for that test [13:49:07] TIL .prefetch_related() [13:50:07] fixed CI, sent too early :) [13:50:34] volans: what does the ".select_related("_path")" do? [13:51:20] that;s for connected_endpoint, a bit hacky probably, and will need to be adapted for 3.5 as they renamed it to connected_endpoints that returns a list [13:51:38] that's how the field is named [13:51:42] I can add a comment [13:52:10] yeah I started to write the patches to be 3.5 compatible [13:55:54] added comment [14:02:04] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Jenkins: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10hashar) 05Resolved→03Open Thank you so much @jbond ! As a side track on T336356 I have tried to add inodes to the Grafana da... [14:03:48] volans: great work thanks, also TIL on the prefetch stuff [14:04:52] yw :) I didn't spend time to trick nbshell to tell me how many queries were performed before/after, but the *after* should be just 3 AIUI [14:05:52] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Jenkins: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10hashar) 05Open→03Resolved My bad, it has 87% FREE inodes or 13% usage: ` pcc-worker1003:~$ df -hi / Filesystem Inodes IUs... [14:08:29] topranks: btw I was getting a nonetype error in -next but I ignored it because it might just be test data [14:08:43] I was thinking to re-import a fresh DB to -next, thoughts? [14:08:58] volans: +1 on the fresh db import [14:09:27] I can't say for sure if some manual edits are the cause of the nonetype, but it's certainly not impossible [14:10:32] total run time for the network report: PRE -> 5 minutes, 0.29 seconds POST: -> 0 minutes, 54.58 seconds [14:11:02] * volans is happy [14:12:58] impressive! [14:16:25] yeah nice work [14:16:43] Even though I +1'd the change, I was actually just logging some nits :P [14:17:12] not important, thought maybe the vars "vlan" and "vlans", containing IPNetwork objects, was a bit confusing [14:18:10] * sukhe lurking [14:18:18] sorry, this is about the automatic vlan tagging? [14:19:29] sukhe: no, although I am hopeful we may have some good news soon there [14:19:39] :D [14:19:48] this was on some cleanup work v.olans did, I was just suggesting a var name change for readability [14:28:19] nice work all! [14:28:53] topranks: followup https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/918495 [14:30:02] volans: thanks :) [14:31:06] fixed one thing [14:31:19] waiting for gerrit maintenance to be over before any follow up change/merge/deploy [14:51:10] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10serviceops, 10serviceops-collab: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666 (10LSobanski) [14:57:13] (DiskSpace) firing: Disk space apt1001:9100:/ 5.327% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=apt1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:36:54] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade Netbox to 3.5.x - https://phabricator.wikimedia.org/T336275 (10ayounsi) [16:33:13] 10CAS-SSO, 10Infrastructure-Foundations, 10GitLab (Auth & Access), 10Release-Engineering-Team (Priority Backlog 📥), 10User-brennen: GitLab sessions expire frequently - https://phabricator.wikimedia.org/T330359 (10thcipriani) Keeping an eye on this one. hypothesis: 2fa seems to be expiring on the gitlab... [16:52:18] topranks: doh, I saw that network flapped again, and when times out it takes 5 minutes, weird, I wonder if there is some operations on netbox that locks some data needed for it to run [16:53:08] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Jenkins: PCC runs failing with complaints about disk space - https://phabricator.wikimedia.org/T335111 (10hashar) [16:53:10] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Jenkins: PCC: worker out of disk space - https://phabricator.wikimedia.org/T336350 (10hashar) [16:53:38] 10puppet-compiler, 10Continuous-Integration-Infrastructure, 10Infrastructure-Foundations, 10Release-Engineering-Team, 10Jenkins: PCC runs failing with complaints about disk space - https://phabricator.wikimedia.org/T335111 (10hashar) That one got solved by @jbond today via T336350. The host was out of in... [16:55:06] see https://netbox.wikimedia.org/extras/reports/results/4558303/ [17:01:42] (SystemdUnitFailed) firing: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:06:42] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:27] (SystemdUnitFailed) firing: (2) kube-controller-manager.service Failed on aux-k8s-ctrl1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:12] (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:01] volans: hmmm yeah [17:37:13] (DiskSpace) resolved: Disk space apt1001:9100:/ 5.841% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=apt1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [17:37:56] did not expect that after the massive improvement, I had previously wondered whether it was some other factor, was looking at the cpu stats there don't see any signs of it maxing out at any stage [17:39:40] one potential improvement we could think about is not doing the pre-fetch for "connected_endpoint__untagged_vlan__prefixes" [17:40:18] instead build a dict, before the loop over all interfaces, of all the vlan prefixes [17:40:37] and then just get the "connected_endpoint__untagged_vlan__vid" [17:41:17] not sure if that would make much difference. but with the current approach we will hit any given vlan potentially hundreds of times on different interfaces, pulling the prefixes each time [17:41:31] I suspect that won't fix the timeout. But unusual why it happens with this function and not others [17:49:00] volans: do you know of the top of your head if we do a apt-get upgarde during provisioning, and more importantly if we shuld (cc moritz) [17:49:45] i just noticed that prometheus-ipmi-exporter was failed on sretest2002 because it needed upgrading with a package in bullseye-wikimedia/main [17:50:52] perhaps we could/should upgrade evrything in $distro-wikimedia and $distro-security [18:01:35] actually I notice in the network graphs for the host there is a regular burst of data at 22 and 52 minutes past the hour [18:01:56] the failed report ran at 15:50, maybe coincided with that? unsure what causes the traffic burst [18:06:12] netops folks: I just filed https://phabricator.wikimedia.org/T336428 [18:06:34] I am not looking for actionables, just letting you know that I can't ping an lvs host in codfw from cumin2002, just in case :) [18:50:14] jbond: with "provisioning" do you mean "reimage"? in that case no we don't do apt-get upgrade AFAIK, open to discuss if we should [18:53:07] volans: yes i ment the reimage. lets see what mori.tzm thinks [18:53:12] ack [18:54:58] topranks: I thought more of a DB lock for some operations, but feels weird anyway. Yes we could factor out some part of it I can try that approach too. Although if you see https://netbox.wikimedia.org/extras/reports/results/4558303/#test_matching_vlan it has a KeyError: 'device_role' that in turn causes the timeout,will need more time to dig into it [19:16:18] hmm yeah, that is odd. will try to dig into it more tomorrow too. [19:16:51] I did a quick check on netbox-next and there is no device object that throws an exception requesting device.device_role [20:12:16] 10SRE-tools, 10SRE, 10Spicerack: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180 (10Dzahn) [20:13:52] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Spicerack: Preserve SSH host key when re-imaging hosts - https://phabricator.wikimedia.org/T129180 (10taavi) Could this be closed in favour of {T268344}? [20:51:21] sukhe: I did some checks on that. Definitely a kind of crazy one. [20:51:54] I updated the task, there was a switch config issue, and after resolving that there is some problem the uRPF filter is causing [20:51:58] I'll look more tomorrow. [20:52:16] Might be worth rebooting the box just in case the switch issue somehow has put things in an odd state.