[02:08:59] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [06:09:00] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [08:52:58] Is anyone using netbox-next or idp-test? I'd like to run a few tests, but that may make Netbox-next unavailable for a few minutes here and there [08:53:17] go ahead for me [08:53:58] +1 [08:55:15] Cool, everything is easily reverted if need be. I'll "grap" both servers [09:12:08] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:13:38] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [09:19:40] Okay, so I actually managed to make netbox use OIDC just PLAIN, without my own CAS backend [09:22:02] slyngs: that sounds like a good thing :) [09:22:29] I believe it is :-) [09:23:32] Okay, I'm just going to roll back and do a Puppet patch and then I'll have taavi test that login works [09:25:20] +1 thx [09:31:10] To be fair it was probably broken because I did it wrong the first time around, but such is life :-) [10:09:00] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:20:57] 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Volans) [12:36:23] 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10SLyngshede-WMF) The extras.social_pipeline.add_user_to_groups never got added, hence Netbox isn't actually able to add groups. For some unknown reason group synchron... [12:39:33] volans: I just messed up and accidentally deleted a device in Netbox :( [12:39:46] I think I'm gonna need to do a db restore [12:40:06] server or network device? [12:40:44] ah switch doh [12:40:47] yeah [12:40:47] network device [12:40:54] yeah too complex to rebuild [12:40:55] ugh [12:41:15] let's try to find a dump after 11:18 [12:41:28] that UI with the tick boxes of interfaces, delete at the bottom will delete things ticked, delete at the top removes the device [12:41:32] is what got me [12:41:49] I know [12:41:53] it's always tht [12:41:55] yeah I can probably do it I'm looking at the wikitech here [12:42:07] although it gives a summary of what wil be done [12:42:33] psql-all-dbs-2023-11-29-11-37.sql.gz should do it [12:42:39] yeah it's improved it now warns you - I'm not defending myself it was dumb [12:42:55] thanks - I will give it a go [12:43:03] I'm saying it's bad UI, you're not the first nor the last [12:43:22] lmk if you need me to do it [12:43:30] but I'd prefer to hve other confortable too to do i [12:44:14] topranks: you could also start a sre.dns.netbox cookbook to just hold the lock (without letting it finish and leaving it at the prompt) [12:44:33] also tell around in -sre and -dcops to not make changes to netbox UI [12:44:44] volans: that is a good trick nice one [12:44:50] yes I will advise there now [12:45:53] doh, I set the ttl of the lock too short, just 300s, didn't consider this use case :D [12:48:37] I'll re-fire it in a few mins [12:49:17] while true that sucker [12:50:06] topranks: this chnge of yours will not be included in the restore [12:50:09] https://netbox.wikimedia.org/extras/changelog/?request_id=408b7dae-8b11-4059-81de-3171c50b10a2 [12:50:31] so you'll have to redo those 2 by hand [12:51:45] thanks yep, was trying to undo one of them anyway :P [12:52:24] the undo just got waaay bigger :D [12:54:58] (SystemdUnitFailed) firing: netbox_ganeti_ulsfo_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:00:02] volans: it says the db is in use and I can't delete it, in postgres cli [13:00:12] postgres=# DROP DATABASE netbox; [13:00:12] ERROR: database "netbox" is being accessed by other users [13:00:12] DETAIL: There is 1 other session using the database. [13:00:26] Should I stop netbox service or something perhaps? [13:00:29] have you dontimed netbox? [13:00:48] never happened to me [13:00:53] no, sry I should have [13:01:16] and IIRC there is connection always opens so it should hve happened before too, maybe a newer postgres [13:01:29] yest I'd say stop netbox for a moment and retry [13:10:17] how's going? [13:14:26] volans: yeah I'm not making good progress here [13:14:34] what's the issue? [13:14:35] there are more sessions now than when I began [13:15:05] I stopped the rq-netbox service on netbox1002, also on netbox2002, as there were connections from that [13:15:27] but postgress on netboxdb1002 still has sessions from netbox1002 and netboxdb2002 [13:15:58] did I stop the wrong service? [13:16:02] checking [13:16:06] thanks man [13:16:24] you didn't stop uwsgi [13:16:28] tht's netbox [13:17:08] hmm... I did issue the command on netbox1002 - "sudo systemctl stop uwsgi-netbox.service" [13:17:11] unless we changed puppetizaation as I don't see it [13:17:39] I see also uwsgi-netbox-scriptproxy.service [13:17:41] probably related [13:17:56] Nov 29 13:08:39 netbox1002 systemd[1]: Stopping uwsgi-netbox uwsgi app... [13:17:56] Nov 29 13:08:40 netbox1002 systemd[1]: uwsgi-netbox.service: Succeeded. [13:18:36] yep yep [13:18:40] stop the other one too [13:18:46] the proxy for the scripts [13:19:04] ok [13:20:20] this failure mode is new btw [13:22:15] yeah, I think say the reports are running off systemd timers [13:22:51] not sure if I should stop the timer execution or how best to prevent them kicking off for a time [13:23:17] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [13:24:05] most of them are "ganeti-sync" services [13:24:43] they go via the API [13:24:47] so if netbox is down [13:24:49] you don't cre [13:24:50] care [13:25:35] ok [13:26:04] db dropped :) [13:26:31] yay [13:27:10] restoring from backup now [13:27:22] volans: did you do anything? [13:27:49] no [13:28:34] https://netbox.wikimedia.org/dcim/devices/3929/ [13:28:37] yay :) [13:29:00] ok - I think it was just a matter of either waiting until the reports completed [13:29:00] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:35] I manually killed this which was the report running: [13:29:42] https://www.irccloud.com/pastebin/9jMCs5EP/ [13:29:55] k [13:29:58] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:59] yeah the slow one [13:30:13] it owned the tcp process that was connecting, I wasn't sure how to properly stop it via systemctl, without disabling the timer completely etc. [13:31:40] timer is scheduled to run again in 18 mins for that so I think it looks ok [13:31:46] ack [13:31:46] volans: thanks for the help! [13:32:17] https://netbox.wikimedia.org/extras/changelog/ LGTM [13:32:22] thanks for fixing [13:32:26] feel free to improve the docs [13:32:41] on how to stop the timers and also maybe disble puppet and downtme (not sure if they are there) [13:33:14] While we're at netbox, I'll be borrowing netbox-next again [13:33:54] 5 EUR/10 minutes [13:34:00] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:34:11] volans: I will add some notes, I think the learnings I have are incomplete but we've most of the picture I will add as much info as I can [13:34:23] Do you accept DKK or christmas cookies? [13:34:36] Based on those prices I can only imagine how much I owe you in service credits for the downtime :D [13:34:51] * volans coookieeeees [13:34:58] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:12] thanks topranks [13:39:00] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:48] topranks: some issues you could contribute with your feedback upstream: [13:39:51] https://github.com/netbox-community/netbox/issues/13954 [13:39:58] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:39:58] https://github.com/netbox-community/netbox/issues/8551 [13:40:15] oh nice, yep let me chime in :) [13:40:34] I really think it's bad UI and I would instead open a bug [13:40:47] that if there is a table with some rows selected the top-button be grayed out [13:40:50] and disabled [13:40:53] :D [13:42:20] * volans tempted to open a bug for that [13:44:00] (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:44:30] volans: we unfortunately have to indicate that when opening a new FR or bug report... https://usercontent.irccloud-cdn.com/file/yrF1BkDq/Screenshot%202023-11-29%20at%2014-43-37%20New%20Issue%20%C2%B7%20netbox-community_netbox.png [13:44:31] topranks: did you restart everything? run puppet? [13:44:58] (SystemdUnitFailed) resolved: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:18] volans: I restarted the services, didn't run puppet let me doi that now [13:45:59] XioNoX: it's the same on demo.netbox.dev ;) [13:46:02] so I can use that [13:46:09] The proposal in the issue you linked I think would prevent it - i.e. if we had to type the name of the device when deleting [13:46:19] volans: hahaha, you will be scolded by Jeremy [13:46:26] topranks: check also icinga and when all green remove the downtime if it doesn't expire soon [13:48:09] XioNoX: I can poke directly the CEO now :-P [13:50:11] volans: yep, it already expired, report icinga alerts fired but they will get re-run in next few mins and should clear [13:50:19] k [13:54:23] I guess you all found a topic to talk about for the podcast :-9 [13:54:48] volans: Shouldn't I be able to locate netbox-next log messages in logstash, I can find the production hosts, but not netbox-dev2002 [13:55:07] they are in /srv/log/netbox/... [13:55:15] not sure if shipped to logstash [13:55:58] Ah, looked in that, so it turns out that I'm just bad at logging :-( [13:56:27] you might need to enble debug logging as netbox doesn't logs much [13:56:30] in general [13:56:33] nothing really useful [13:56:51] moritzm: yeah, although not sure we want to spend the time saying "we're really dumb - we keep hitting the wrong button" :P [13:57:03] This is my own log, so I doubt I can blame Netbox [14:04:59] topranks: just word it as "the software had inadequate protections against experienced operators inadvertently performing a wrong, destructive action" and you'll sound much smarter [14:05:30] haha [14:06:47] you know there is some truth in the "experienced operator" bit, if I was less familiar with it I'd have been double-checking my steps more, instead I was moving fairly quick given I "know what I'm doing" [14:07:24] not the first time that assumption has landed me in trouble :( [14:07:47] just reset your RAM every night so you're not anymore familiar with the tools :D [14:08:19] or we can ask upstream to randomly ask for captchas to resolve [14:08:45] if you really want to remove that switch, please identify all fire hydrants [14:09:02] lol [14:09:21] you'll need them after we delete it and everything will be onfire [14:09:33] hahahaha [14:09:55] a modern captcha would be something like "click on the arrows until this USB drive is oriented correctly" [14:09:57] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [14:15:18] Yeeeeh, I've worked around yet another Netbox limitation [14:15:38] taavi: Can you try to login on https://netbox-next.wikimedia.org/ [14:15:55] sure, one moment [14:16:07] I'll get one of those cookies in the meantime :-) [14:18:26] slyngs: it works.. but seems like now it creates a nexbox group for every single group I'm in LDAP (including cloud vps projects and toolforge tools), compared to only matching the ops/nda/wmf groups like netbox.w.o does. I also don't seem to be a django superuser on netbox-next so I can't access the admin panel [14:18:26] (https://netbox-next.wikimedia.org/admin/) unlike the main netbox install [14:19:14] Hmm, yeah, might need a limitation there and a "make user superuser / staff" [14:19:14] the curent on netbox prod is using the CAS protocol with our custom code, so it was tailored to what we needed [14:19:55] We need a pipeline for Social Auth to do the same with OIDC, Netbox doesn't really support groups if you do SSO [14:19:56] this one is (trying to) using the oidc default implementtion, so unless they have knobs to define the groups to assign, yes it's possible we'll endup with a lot more groups assigned [14:20:44] I'll have to make the knobs, so we can do pretty much whatever we want. [14:21:24] taavi: Okay, thank I'll work out the superuser/staff bits [14:21:24] :'( we'll endup keeping the netbox fork forever [14:21:40] this time I had my hopes high :D [14:21:47] or maybe we can send some of the knobs upstream? [14:22:07] Yeeeeah, I tried, it comflicted with their commercial offering :-) [14:22:32] SSO==enterprise in every software unfortunately [14:22:34] What we can do it bundle the pipeline as a separate Python package. [14:22:57] It's all contained in a module anyway, it doesn't have to be in the Netbox repo [14:25:21] I'm going to do a separate repo, then we can build as a package and keep netbox unforked [14:29:37] wfm, if you don't need to specify imports in the netbox code but it's enough to list it in config [14:30:32] You just add the pipeline as a string to SOCIAL_AUTH_PIPELINE in the configuration, it just need to be a valid module path [14:31:23] k [14:31:35] It's going to be great :-) [14:31:58] Hard to test though [14:32:00] lol [14:36:22] topranks: should we resolve T352286 ? [14:36:23] T352286: Restore Netbox DB from before lsw1-e1-eqiad was removed - https://phabricator.wikimedia.org/T352286 [14:36:30] thx for opening it btw [14:44:31] let me add a little detail and do so - I did in a rush to have task id to run the downtime [14:45:12] <3 [15:00:56] 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff) [15:54:04] (SystemdUnitFailed) firing: prometheus-puppet-ca-exporter.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:59:01] (SystemdUnitFailed) resolved: prometheus-puppet-ca-exporter.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:59] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [18:09:58] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [22:14:00] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk