[02:15:41] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:34] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 4.185% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[03:34:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:49:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:19:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:45:41] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:32:49] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 5.233% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[07:43:20] <moritzm>	 FYI, I'm rebooting the netboxdb servers in a few
[07:58:53] <moritzm>	 all done
[08:57:39] <volans>	 thx
[09:45:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:32:50] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 5.212% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[12:11:50] <moritzm>	 slyngs, volans: I'd go ahead and decom the old debmonitor VMs, is there anything you still want to keep/save on the old VMs?
[12:26:25] <volans>	 moritzm: let me have a quick look after lunch
[12:26:29] <volans>	 I don't recall :D
[12:27:13] <moritzm>	 sure, no rush :-)
[12:43:37] <slyngs>	 I'm good, I'd hate to do it, but should everything else fail, there's always git
[12:54:16] <moritzm>	 what do you mean with "I'd hate to do it"? the new setup works flawlessly, I can't foresee any need to switch back at this point?
[13:13:24] <volans>	 moritzm, slyngs: has anyone checked if the periodic maintenance task in debmonitor is run correclty in the new setup?
[13:13:38] <volans>	 debmonitor-maintenance-gc.timer
[13:14:19] <slyngs>	 Looks good on 2003, just checking 1003
[13:14:51] <volans>	 it did run tonight and deleted some things
[13:14:54] <volans>	 so looks ojk
[13:15:06] <volans>	 but
[13:15:06] <volans>	 You have 1 unapplied migration(s). Yo
[13:15:21] <volans>	 Your project may not work properly until you apply the migrations for app(s): auth.
[13:15:46] <slyngs>	 Looking, sound weird as there where no database changes
[13:16:30] <volans>	 might depend on the dependencies/django versions
[13:16:43] <volans>	 auth is not our own app IIRC
[13:16:54] <slyngs>	 No, that's supplied by Django
[13:18:00] <slyngs>	 $ sudo debmonitor migrate --plan
[13:18:00] <slyngs>	 Planned operations:
[13:18:00] <slyngs>	 auth.0012_alter_user_first_name_max_length
[13:18:00] <slyngs>	     Alter field first_name on use
[13:20:42] <volans>	 I guess we'll need to do that
[13:20:59] <slyngs>	 Yup, I can just do that, unless there are any objections
[13:21:14] <volans>	 +1 for me
[13:21:15] <slyngs>	 Should be fairly safe
[13:21:25] <volans>	 I'm tailing main.log
[13:21:53] <slyngs>	 Cool, and we're on debmonitor1003
[13:22:04] <slyngs>	 Just a sec, and I'll apply it migration
[13:22:22] <slyngs>	 Oh, it didn't ask... Okay, applied
[13:22:40] <slyngs>	 Normally it asks if you're sure :-)
[13:22:42] <volans>	 all good
[13:23:08] <slyngs>	 should we dream up a way to monitor that?
[13:23:29] <slyngs>	 Missing migrations that is
[13:24:32] <volans>	 not sure, the way we were deploying was running the migrations at each deployment
[13:24:42] <volans>	 with the new way... it needs some new way :D
[13:25:01] <slyngs>	 I took out the auto-apply from the debian package
[13:25:17] <volans>	 slyngs: FYI I just updated https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Fmisc&diff=2145558&oldid=2126423
[13:25:28] <volans>	 could you please double check it before I notify the DBAs?
[13:26:00] <slyngs>	 That's perfect
[13:26:02] <volans>	 doh maybe I should have put sly.ngs instead of simon... up to you :D
[13:27:20] <slyngs>	 Also just says Moritz
[13:27:43] <volans>	 and Volans? :D
[13:27:44] <slyngs>	 So I think we're fine.... until someone hires an additional Simon
[13:28:06] <slyngs>	 Fun fact: there are two of me, but one is Swedish
[13:28:46] <volans>	 :D
[13:33:22] <volans>	 slyngs: for manual interaction what's the best way now? before we were doing . /srv/deployment/debmonitor/venv/bin/activate and running DJANGO_SETTINGS_MODULE=debmonitor.settings.prod python manage.py
[13:33:46] <slyngs>	 I more lazy than that, just run "debmonitor"
[13:34:08] <volans>	 nice!
[13:34:14] <slyngs>	 Very :-)
[13:34:23] <volans>	 that solves my question D:
[13:34:38] <slyngs>	 Depending on what you need either sudo -u www-data or just sudo, both should be fine. 
[13:34:59] <volans>	 slyngs, moritzm: if you agree I'd rather rm -rf /srv/log/debmonitor on both new hosts to avoid confusion as we're now logging in /var/log/debmonitor
[13:35:28] <slyngs>	 I do think it's "sudo debmonitor migrate" that does prompt for confirmation and just apply migrations
[13:35:34] <slyngs>	 moritzm: Yes please
[13:35:40] <moritzm>	 sounds good
[13:35:45] <volans>	 ok doing
[13:36:41] <volans>	 should I delete also /srv/deployment/debmonitor?
[13:37:12] <slyngs>	 Yes, that appeared because Puppet tried to apply the old configuration
[13:37:14] <volans>	 I guess puppet didn't cleared up everthing
[13:37:21] <volans>	 from the old setup
[13:37:40] <slyngs>	 No, I was considering just reimaging the 2003 host
[13:37:47] <volans>	 it might be a good idea
[13:39:34] <volans>	 {done} puppet noop
[13:40:03] <slyngs>	 Cool, I'll just note down that I should reimage 2003
[13:40:11] <volans>	 1003 too?
[13:40:45] <volans>	 there was no /srv/deployment there but there was /srv/log/debmonitor
[13:40:58] <slyngs>	 No, that's not required, that was insetup until everything worked. The log was me debugging
[13:42:14] <volans>	 perfect, thx
[13:42:55] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF)
[13:43:14] <slyngs>	 I'll just go ahead and reimage 2003, we're running on 1003 currently
[13:43:37] <volans>	 ack
[13:44:21] <moritzm>	 sgtm
[13:45:02] <moritzm>	 so any remaining objections against decomming 1002/2002?
[13:45:18] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host debmonitor2003.codfw.wmnet with OS bookworm
[13:45:28] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low
[13:45:42] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:50:30] <volans>	 moritzm: all good from my side, I don't have anything to keep at this point
[14:50:54] <inflatador>	 volans is there a host I can reimage to test that install-console CR?
[14:51:38] <volans>	 inflatador: usually any of the sretest* hosts are ok, just ask in here if anyone is using the one you pick, also check with OS they are running to re-install them in the same OS version
[14:54:13] <inflatador>	 volans ACK, will do
[14:56:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host debmonitor2003.codfw.wmnet with OS bookworm completed: - debmonitor2003 (**WARN**)   - Downtimed on...
[15:14:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) envoyproxy yet again failed to build its configuration file. Manually ran generation script and removed downtime.
[15:14:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) 05In progress→03Resolved
[15:32:50] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 5.207% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[15:35:52] <volans>	 moritzm: should we do some cleanup for build2001 space? ^^^
[15:36:43] <moritzm>	 I had a quick peek on Friday, there's a huge chunk (120G or so) related to the cloud image builds
[15:37:11] <volans>	 wow, that seems a lot :)
[15:37:14] <moritzm>	 haven't looked closer yet, we're possibly missing some cleanup timer there
[15:41:47] <volans>	 ack, lmk if you want me to have a look
[15:42:15] <moritzm>	 sure, go ahead :-)
[15:42:44] <inflatador>	 Planning on reimaging sretest2003 in ~20m to test the reimage cookbook. If that's going to be a problem LMK
[15:45:41] <volans>	 SGTM, thx
[15:46:34] <inflatador>	 XioNoX ^^ any objections to the above reimage, I see you logged in there on Thursday
[15:46:45] <volans>	 he's out this week
[15:51:09] <inflatador>	 no worries, based on `last` he seems to have hit all test hosts at the same time last wk
[15:52:09] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[16:39:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-dpkg-success-textfile.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:50:30] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[16:55:23] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[18:03:31] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10Jhancock.wm) rack is physically prepped for tomorrow.
[18:24:33] <inflatador>	 looks like NIC firmware issues on sretest2003...updating now, but I'm going to use the VM (sretest2005) to test the cookbook
[18:27:18] <volans>	 inflatador: sretest2005 is special is part of the tests on routed ganeti
[18:27:25] <volans>	 it would probably not work
[18:27:35] <volans>	 and be stuck without some manual intervention
[18:27:39] <volans>	 is a WIP
[18:28:01] <inflatador>	 volans apologies, should I abort?
[18:28:09] <volans>	 depend where you are :D
[18:28:14] <inflatador>	 volans nm, it's too late ;(
[18:28:30] <inflatador>	 sorry about taht
[18:29:45] <volans>	 it's a test host, no big deal, maybe let arzhel know (but he's off this week)
[18:33:31] <inflatador>	 ACK, will drop a note in T300152
[18:33:31] <stashbot>	 T300152: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152
[18:34:15] <volans>	 thx
[18:52:33] <wikibugs>	 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10bking) @ayounsi Apologies for the trouble, I didn't realize `sretest2005` was in active use. Unfortunately, I reimaged it while I was working on T3...
[19:32:50] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[19:53:33] <kamila_>	 o/ I have a weird problem with reimages (again!)
[19:55:05] <kamila_>	 I ran sudo `cookbook sre.hosts.reimage -t T351074 --os bullseye mw1386 --new` and for whatever reason the debian installer is sitting there waiting for an interactive install
[19:55:07] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[20:31:18] <volans>	 kamila_: have you looked at the console via mgmt?
[20:31:31] <volans>	 I see the host is in a busybox right now
[20:31:51] <kamila_>	 last time I looked it was in the interactive installer
[20:32:17] <volans>	 but waiting for which answer?
[20:32:31] <kamila_>	 https://usercontent.irccloud-cdn.com/file/2BajmnY2/image.png
[20:32:49] <kamila_>	 actually originally it wanted network config, I tried to proceed through that out of curiosity
[20:33:19] <volans>	 the "automted" instller is just the normal one with answers pre-answered, if an answer is not valid, not there, etc... it just asks it
[20:33:27] <volans>	 there is no suck thing as non-interactive installer ;)
[20:33:39] <kamila_>	 right
[20:33:51] <kamila_>	 I think mw1387 is still on the original screen?
[20:34:23] <kamila_>	 (yes, I ran a bunch in a row again '^^)
[20:36:09] <kamila_>	 oh, not anymore...
[20:37:53] <volans>	 modules/profile/data/profile/installserver/preseed.yaml looks weird to me
[20:38:00] <kamila_>	 1388 is though
[20:38:02] <volans>	 06763bffb6a
[20:38:04] <kamila_>	 oh, did I break it?
[20:38:26] <volans>	 not sure yet
[20:38:53] <volans>	 things like mw138[0-368] 
[20:38:56] <volans>	 look weird
[20:39:21] <volans>	 unless that's correctly doing 1380-1383,1386,1388
[20:39:27] <volans>	 and I'm just tired :D
[20:39:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus-dpkg-success-textfile.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:40:21] <kamila_>	 I believe that's correct
[20:42:14] <kamila_>	 it used to be 138[0-3] and I added 1386 and 1388
[20:42:28] <kamila_>	 (and I'm tired too and 1377 is not part of the pile I'm reimaging '^^)
[20:42:42] <kamila_>	 1378 is in the original state if you check the console
[20:43:45] <kamila_>	 this is the log tab https://www.irccloud.com/pastebin/mdzSh0r5/
[20:47:54] <kamila_>	 and it's sitting there like this https://usercontent.irccloud-cdn.com/file/XkUyr5hc/image.png
[20:50:11] <volans>	 it should get it from dhcp
[20:51:56] <kamila_>	 hm, then that's weird and potentially bad :D 
[20:53:11] <volans>	 agree, but it's already quite late over here... I might not be able to dig more tonight (have already worked until now basically)
[20:54:16] <kamila_>	 sure, no worries
[20:54:39] <kamila_>	 it's weird, it's happening in codfw too (mw2317)
[20:54:51] <kamila_>	 but it'll probably be weird tomorrow too :D 
[20:55:48] <volans>	 could you please open a task with the host list and their status?
[20:55:56] <volans>	 so I know what to look and what to expect tomorrow :)
[20:56:17] <volans>	 sorry for the trouble, I'm unsure what could be as nothing that I know of changed in the last few days
[20:56:24] <kamila_>	 sure
[20:56:25] <kamila_>	 thank you!
[20:56:50] <kamila_>	 sorry for troubling you again, I seem to be a weird bug magnet :D
[20:57:03] <volans>	 ahahah
[20:57:13] <volans>	 nah it's not you, that's called hardware ;)
[20:57:34] <kamila_>	 let's pretend that that's true :D
[20:57:44] <kamila_>	 good night o/
[20:58:13] <volans>	 :D
[20:58:20] <volans>	 thx you too
[23:32:50] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace