[02:08:59] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[06:09:00] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[08:52:58] <slyngs>	 Is anyone using netbox-next or idp-test? I'd like to run a few tests, but that may make Netbox-next unavailable for a few minutes here and there
[08:53:17] <volans>	 go ahead for me
[08:53:58] <XioNoX>	 +1
[08:55:15] <slyngs>	 Cool, everything is easily reverted if need be. I'll "grap" both servers
[09:12:08] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:13:38] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[09:19:40] <slyngs>	 Okay, so I actually managed to make netbox use OIDC just PLAIN, without my own CAS backend
[09:22:02] <XioNoX>	 slyngs: that sounds like a good thing :)
[09:22:29] <slyngs>	 I believe it is :-)
[09:23:32] <slyngs>	 Okay, I'm just going to roll back and do a Puppet patch and then I'll have taavi test that login works
[09:25:20] <volans>	 +1 thx
[09:31:10] <slyngs>	 To be fair it was probably broken because I did it wrong the first time around, but such is life :-)
[10:09:00] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[11:20:57] <wikibugs>	 10netbox, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036 (10Volans)
[12:36:23] <wikibugs>	 10netbox, 10Infrastructure-Foundations, 10Patch-For-Review: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950 (10SLyngshede-WMF) The extras.social_pipeline.add_user_to_groups never got added, hence Netbox isn't actually able to add groups. For some unknown reason group synchron...
[12:39:33] <topranks>	 volans: I just messed up and accidentally deleted a device in Netbox :(
[12:39:46] <topranks>	 I think I'm gonna need to do a db restore 
[12:40:06] <volans>	 server or network device?
[12:40:44] <volans>	 ah switch doh
[12:40:47] <volans>	 yeah 
[12:40:47] <topranks>	 network device 
[12:40:54] <topranks>	 yeah too complex to rebuild 
[12:40:55] <topranks>	 ugh 
[12:41:15] <volans>	 let's try to find a dump after 11:18
[12:41:28] <topranks>	 that UI with the tick boxes of interfaces, delete at the bottom will delete things ticked, delete at the top removes the device 
[12:41:32] <topranks>	 is what got me 
[12:41:49] <volans>	 I know
[12:41:53] <volans>	 it's always tht
[12:41:55] <topranks>	 yeah I can probably do it I'm looking at the wikitech here 
[12:42:07] <volans>	 although it gives a summary of what wil be done
[12:42:33] <volans>	 psql-all-dbs-2023-11-29-11-37.sql.gz should do it
[12:42:39] <topranks>	 yeah it's improved it now warns you - I'm not defending myself it was dumb 
[12:42:55] <topranks>	 thanks - I will give it a go 
[12:43:03] <volans>	 I'm saying it's bad UI, you're not the first nor the last
[12:43:22] <volans>	 lmk if you need me to do it
[12:43:30] <volans>	 but I'd prefer to hve other confortable too to do i
[12:44:14] <volans>	 topranks: you could also start a sre.dns.netbox cookbook to just hold the lock (without letting it finish and leaving it at the prompt)
[12:44:33] <volans>	 also tell around in -sre and -dcops to not make changes to netbox UI
[12:44:44] <topranks>	 volans: that is a good trick nice one 
[12:44:50] <topranks>	 yes I will advise there now 
[12:45:53] <volans>	 doh, I set the ttl of the lock too short, just 300s, didn't consider this use case :D
[12:48:37] <topranks>	 I'll re-fire it in a few mins 
[12:49:17] <claime>	 while true that sucker
[12:50:06] <volans>	 topranks: this chnge of yours will not be included in the restore
[12:50:09] <volans>	 https://netbox.wikimedia.org/extras/changelog/?request_id=408b7dae-8b11-4059-81de-3171c50b10a2
[12:50:31] <volans>	 so you'll have to redo those 2 by hand
[12:51:45] <topranks>	 thanks yep, was trying to undo one of them anyway :P
[12:52:24] <volans>	 the undo just got waaay bigger :D
[12:54:58] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_ganeti_ulsfo_sync.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:00:02] <topranks>	 volans: it says the db is in use and I can't delete it, in postgres cli 
[13:00:12] <topranks>	 postgres=# DROP DATABASE netbox;
[13:00:12] <topranks>	 ERROR:  database "netbox" is being accessed by other users
[13:00:12] <topranks>	 DETAIL:  There is 1 other session using the database.
[13:00:26] <topranks>	 Should I stop netbox service or something perhaps?
[13:00:29] <volans>	 have you dontimed netbox?
[13:00:48] <volans>	 never happened to me
[13:00:53] <topranks>	 no, sry I should have 
[13:01:16] <volans>	 and IIRC there is  connection always opens so it should hve happened before too, maybe a newer postgres
[13:01:29] <volans>	 yest I'd say stop netbox for a moment and retry
[13:10:17] <volans>	 how's going?
[13:14:26] <topranks>	 volans: yeah I'm not making good progress here 
[13:14:34] <volans>	 what's the issue?
[13:14:35] <topranks>	 there are more sessions now than when I began 
[13:15:05] <topranks>	 I stopped the rq-netbox service on netbox1002, also on netbox2002, as there were connections from that 
[13:15:27] <topranks>	 but postgress on netboxdb1002 still has sessions from netbox1002 and netboxdb2002
[13:15:58] <topranks>	 did I stop the wrong service?
[13:16:02] <volans>	 checking
[13:16:06] <topranks>	 thanks man 
[13:16:24] <volans>	 you didn't stop uwsgi
[13:16:28] <volans>	 tht's netbox
[13:17:08] <topranks>	 hmm... I did issue the command on netbox1002 - "sudo systemctl stop uwsgi-netbox.service"
[13:17:11] <volans>	 unless we changed puppetizaation as I don't see it
[13:17:39] <volans>	 I see also uwsgi-netbox-scriptproxy.service
[13:17:41] <volans>	 probably related
[13:17:56] <topranks>	 Nov 29 13:08:39 netbox1002 systemd[1]: Stopping uwsgi-netbox uwsgi app...
[13:17:56] <topranks>	 Nov 29 13:08:40 netbox1002 systemd[1]: uwsgi-netbox.service: Succeeded.
[13:18:36] <volans>	 yep yep
[13:18:40] <volans>	 stop the other one too
[13:18:46] <volans>	 the proxy for the scripts
[13:19:04] <topranks>	 ok
[13:20:20] <volans>	 this failure mode is new btw
[13:22:15] <topranks>	 yeah, I think say the reports are running off systemd timers 
[13:22:51] <topranks>	 not sure if I should stop the timer execution or how best to prevent them kicking off for a time 
[13:23:17] <wikibugs>	 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff)
[13:24:05] <topranks>	 most of them are "ganeti-sync" services 
[13:24:43] <volans>	 they go via the API
[13:24:47] <volans>	 so if netbox is down
[13:24:49] <volans>	 you don't cre
[13:24:50] <volans>	 care
[13:25:35] <topranks>	 ok
[13:26:04] <topranks>	 db dropped :)
[13:26:31] <volans>	 yay
[13:27:10] <topranks>	 restoring from backup now 
[13:27:22] <topranks>	 volans: did you do anything?
[13:27:49] <volans>	 no
[13:28:34] <topranks>	 https://netbox.wikimedia.org/dcim/devices/3929/
[13:28:37] <topranks>	 yay :)
[13:29:00] <topranks>	 ok - I think it was just a matter of either waiting until the reports completed 
[13:29:00] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:29:35] <topranks>	 I manually killed this which was the report running:
[13:29:42] <topranks>	 https://www.irccloud.com/pastebin/9jMCs5EP/
[13:29:55] <volans>	 k
[13:29:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:29:59] <volans>	 yeah the slow one
[13:30:13] <topranks>	 it owned the tcp process that was connecting, I wasn't sure how to properly stop it via systemctl, without disabling the timer completely etc.
[13:31:40] <topranks>	 timer is scheduled to run again in 18 mins for that so I think it looks ok 
[13:31:46] <volans>	 ack
[13:31:46] <topranks>	 volans: thanks for the help!
[13:32:17] <volans>	 https://netbox.wikimedia.org/extras/changelog/ LGTM
[13:32:22] <volans>	 thanks for fixing
[13:32:26] <volans>	 feel free to improve the docs
[13:32:41] <volans>	 on how to stop the timers and also maybe disble puppet and downtme (not sure if they are there)
[13:33:14] <slyngs>	 While we're at netbox, I'll be borrowing netbox-next again
[13:33:54] <volans>	 5 EUR/10 minutes
[13:34:00] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:34:11] <topranks>	 volans: I will add some notes, I think the learnings I have are incomplete but we've most of the picture I will add as much info as I can 
[13:34:23] <slyngs>	 Do you accept DKK or christmas cookies?
[13:34:36] <topranks>	 Based on those prices I can only imagine how much I owe you in service credits for the downtime :D
[13:34:51] * volans coookieeeees
[13:34:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:35:12] <volans>	 thanks topranks 
[13:39:00] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:48] <volans>	 topranks: some issues you could contribute with your feedback upstream:
[13:39:51] <volans>	 https://github.com/netbox-community/netbox/issues/13954
[13:39:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:39:58] <volans>	 https://github.com/netbox-community/netbox/issues/8551
[13:40:15] <topranks>	 oh nice, yep let me chime in :)
[13:40:34] <volans>	 I really think it's bad UI and I would instead open a bug
[13:40:47] <volans>	 that if there is a table with some rows selected the top-button be grayed out
[13:40:50] <volans>	 and disabled
[13:40:53] <volans>	 :D
[13:42:20] * volans tempted to open a bug for that
[13:44:00] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:44:30] <XioNoX>	 volans: we unfortunately have to indicate that when opening a new FR or bug report... https://usercontent.irccloud-cdn.com/file/yrF1BkDq/Screenshot%202023-11-29%20at%2014-43-37%20New%20Issue%20%C2%B7%20netbox-community_netbox.png
[13:44:31] <volans>	 topranks: did you restart everything? run puppet?
[13:44:58] <jinxer-wm>	 (SystemdUnitFailed) resolved: (10) check_netbox_uncommitted_dns_changes.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:45:18] <topranks>	 volans: I restarted the services, didn't run puppet let me doi that now 
[13:45:59] <volans>	 XioNoX: it's the same on demo.netbox.dev ;)
[13:46:02] <volans>	 so I can use that 
[13:46:09] <topranks>	 The proposal in the issue you linked I think would prevent it - i.e. if we had to type the name of the device when deleting 
[13:46:19] <XioNoX>	 volans: hahaha, you will be scolded by Jeremy
[13:46:26] <volans>	 topranks: check also icinga and when all green remove the downtime if it doesn't expire soon
[13:48:09] <volans>	 XioNoX: I can poke directly the CEO now :-P
[13:50:11] <topranks>	 volans: yep, it already expired, report icinga alerts fired but they will get re-run in next few mins and should clear 
[13:50:19] <volans>	 k
[13:54:23] <moritzm>	 I guess you all found a topic to talk about for the podcast :-9
[13:54:48] <slyngs>	 volans: Shouldn't I be able to locate netbox-next log messages in logstash, I can find the production hosts, but not netbox-dev2002
[13:55:07] <volans>	 they are in /srv/log/netbox/...
[13:55:15] <volans>	 not sure if shipped to logstash
[13:55:58] <slyngs>	 Ah, looked in that, so it turns out that I'm just bad at logging :-(
[13:56:27] <volans>	 you might need to enble debug logging as netbox doesn't logs much
[13:56:30] <volans>	 in general
[13:56:33] <volans>	 nothing really useful
[13:56:51] <topranks>	 moritzm: yeah, although not sure we want to spend the time saying "we're really dumb - we keep hitting the wrong button" :P
[13:57:03] <slyngs>	 This is my own log, so I doubt I can blame Netbox
[14:04:59] <taavi>	 topranks: just word it as "the software had inadequate protections against experienced operators inadvertently performing a wrong, destructive action" and you'll sound much smarter
[14:05:30] <topranks>	 haha
[14:06:47] <topranks>	 you know there is some truth in the "experienced operator" bit, if I was less familiar with it I'd have been double-checking my steps more, instead I was moving fairly quick given I "know what I'm doing" 
[14:07:24] <topranks>	 not the first time that assumption has landed me in trouble :(
[14:07:47] <volans>	 just reset your RAM every night so you're not anymore familiar with the tools :D
[14:08:19] <moritzm>	 or we can ask upstream to randomly ask for captchas to resolve
[14:08:45] <moritzm>	 if you really want to remove that switch, please identify all fire hydrants
[14:09:02] <volans>	 lol
[14:09:21] <volans>	 you'll need them after we delete it and everything will be onfire
[14:09:33] <topranks>	 hahahaha
[14:09:55] <taavi>	 a modern captcha would be something like "click on the arrows until this USB drive is oriented correctly"
[14:09:57] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[14:15:18] <slyngs>	 Yeeeeh, I've worked around yet another Netbox limitation
[14:15:38] <slyngs>	 taavi: Can you try to login on https://netbox-next.wikimedia.org/
[14:15:55] <taavi>	 sure, one moment
[14:16:07] <slyngs>	 I'll get one of those cookies in the meantime :-)
[14:18:26] <taavi>	 slyngs: it works.. but seems like now it creates a nexbox group for every single group I'm in LDAP (including cloud vps projects and toolforge tools), compared to only matching the ops/nda/wmf groups like netbox.w.o does. I also don't seem to be a django superuser on netbox-next so I can't access the admin panel
[14:18:26] <taavi>	 (https://netbox-next.wikimedia.org/admin/) unlike the main netbox install
[14:19:14] <slyngs>	 Hmm, yeah, might need a limitation there and a "make user superuser / staff"
[14:19:14] <volans>	 the curent on netbox prod is using the CAS protocol with our custom code, so it was tailored to what we needed
[14:19:55] <slyngs>	 We need a pipeline for Social Auth to do the same with OIDC, Netbox doesn't really support groups if you do SSO
[14:19:56] <volans>	 this one is (trying to) using the oidc default implementtion, so unless they have knobs to define the groups to assign, yes it's possible we'll endup with a lot more groups assigned
[14:20:44] <slyngs>	 I'll have to make the knobs, so we can do pretty much whatever we want.
[14:21:24] <slyngs>	 taavi: Okay, thank I'll work out the superuser/staff bits
[14:21:24] <volans>	 :'( we'll endup keeping the netbox fork forever
[14:21:40] <volans>	 this time I had my hopes high :D
[14:21:47] <taavi>	 or maybe we can send some of the knobs upstream?
[14:22:07] <slyngs>	 Yeeeeah, I tried, it comflicted with their commercial offering :-)
[14:22:32] <volans>	 SSO==enterprise in every software unfortunately
[14:22:34] <slyngs>	 What we can do it bundle the pipeline as a separate Python package. 
[14:22:57] <slyngs>	 It's all contained in a module anyway, it doesn't have to be in the Netbox repo
[14:25:21] <slyngs>	 I'm going to do a separate repo, then we can build as a package and keep netbox unforked
[14:29:37] <volans>	 wfm, if you don't need to specify imports in the netbox code but it's enough to list it in config
[14:30:32] <slyngs>	 You just add the pipeline as a string to SOCIAL_AUTH_PIPELINE in the configuration, it just need to be a valid module path
[14:31:23] <volans>	 k
[14:31:35] <slyngs>	 It's going to be great :-)
[14:31:58] <slyngs>	 Hard to test though
[14:32:00] <volans>	 lol
[14:36:22] <volans>	 topranks: should we resolve T352286 ?
[14:36:23] <stashbot>	 T352286: Restore Netbox DB from before lsw1-e1-eqiad was removed - https://phabricator.wikimedia.org/T352286
[14:36:30] <volans>	 thx for opening it btw
[14:44:31] <topranks>	 let me add a little detail and do so - I did in a rush to have task id to run the downtime 
[14:45:12] <volans>	 <3
[15:00:56] <wikibugs>	 10CFSSL-PKI, 10Ganeti, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate Ganeti-rapi to use pki - https://phabricator.wikimedia.org/T350686 (10MoritzMuehlenhoff)
[15:54:04] <jinxer-wm>	 (SystemdUnitFailed) firing: prometheus-puppet-ca-exporter.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:59:01] <jinxer-wm>	 (SystemdUnitFailed) resolved: prometheus-puppet-ca-exporter.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:28:59] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[18:09:58] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[22:14:00] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk