[02:15:41] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:34] (DiskSpace) firing: Disk space build2001:9100:/ 4.185% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:34:29] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:29] (SystemdUnitFailed) firing: (2) generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:30] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:45:41] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:49] (DiskSpace) firing: Disk space build2001:9100:/ 5.233% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:43:20] FYI, I'm rebooting the netboxdb servers in a few [07:58:53] all done [08:57:39] thx [09:45:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:32:50] (DiskSpace) firing: Disk space build2001:9100:/ 5.212% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:11:50] slyngs, volans: I'd go ahead and decom the old debmonitor VMs, is there anything you still want to keep/save on the old VMs? [12:26:25] moritzm: let me have a quick look after lunch [12:26:29] I don't recall :D [12:27:13] sure, no rush :-) [12:43:37] I'm good, I'd hate to do it, but should everything else fail, there's always git [12:54:16] what do you mean with "I'd hate to do it"? the new setup works flawlessly, I can't foresee any need to switch back at this point? [13:13:24] moritzm, slyngs: has anyone checked if the periodic maintenance task in debmonitor is run correclty in the new setup? [13:13:38] debmonitor-maintenance-gc.timer [13:14:19] Looks good on 2003, just checking 1003 [13:14:51] it did run tonight and deleted some things [13:14:54] so looks ojk [13:15:06] but [13:15:06] You have 1 unapplied migration(s). Yo [13:15:21] Your project may not work properly until you apply the migrations for app(s): auth. [13:15:46] Looking, sound weird as there where no database changes [13:16:30] might depend on the dependencies/django versions [13:16:43] auth is not our own app IIRC [13:16:54] No, that's supplied by Django [13:18:00] $ sudo debmonitor migrate --plan [13:18:00] Planned operations: [13:18:00] auth.0012_alter_user_first_name_max_length [13:18:00] Alter field first_name on use [13:20:42] I guess we'll need to do that [13:20:59] Yup, I can just do that, unless there are any objections [13:21:14] +1 for me [13:21:15] Should be fairly safe [13:21:25] I'm tailing main.log [13:21:53] Cool, and we're on debmonitor1003 [13:22:04] Just a sec, and I'll apply it migration [13:22:22] Oh, it didn't ask... Okay, applied [13:22:40] Normally it asks if you're sure :-) [13:22:42] all good [13:23:08] should we dream up a way to monitor that? [13:23:29] Missing migrations that is [13:24:32] not sure, the way we were deploying was running the migrations at each deployment [13:24:42] with the new way... it needs some new way :D [13:25:01] I took out the auto-apply from the debian package [13:25:17] slyngs: FYI I just updated https://wikitech.wikimedia.org/w/index.php?title=MariaDB%2Fmisc&diff=2145558&oldid=2126423 [13:25:28] could you please double check it before I notify the DBAs? [13:26:00] That's perfect [13:26:02] doh maybe I should have put sly.ngs instead of simon... up to you :D [13:27:20] Also just says Moritz [13:27:43] and Volans? :D [13:27:44] So I think we're fine.... until someone hires an additional Simon [13:28:06] Fun fact: there are two of me, but one is Swedish [13:28:46] :D [13:33:22] slyngs: for manual interaction what's the best way now? before we were doing . /srv/deployment/debmonitor/venv/bin/activate and running DJANGO_SETTINGS_MODULE=debmonitor.settings.prod python manage.py [13:33:46] I more lazy than that, just run "debmonitor" [13:34:08] nice! [13:34:14] Very :-) [13:34:23] that solves my question D: [13:34:38] Depending on what you need either sudo -u www-data or just sudo, both should be fine. [13:34:59] slyngs, moritzm: if you agree I'd rather rm -rf /srv/log/debmonitor on both new hosts to avoid confusion as we're now logging in /var/log/debmonitor [13:35:28] I do think it's "sudo debmonitor migrate" that does prompt for confirmation and just apply migrations [13:35:34] moritzm: Yes please [13:35:40] sounds good [13:35:45] ok doing [13:36:41] should I delete also /srv/deployment/debmonitor? [13:37:12] Yes, that appeared because Puppet tried to apply the old configuration [13:37:14] I guess puppet didn't cleared up everthing [13:37:21] from the old setup [13:37:40] No, I was considering just reimaging the 2003 host [13:37:47] it might be a good idea [13:39:34] {done} puppet noop [13:40:03] Cool, I'll just note down that I should reimage 2003 [13:40:11] 1003 too? [13:40:45] there was no /srv/deployment there but there was /srv/log/debmonitor [13:40:58] No, that's not required, that was insetup until everything worked. The log was me debugging [13:42:14] perfect, thx [13:42:55] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) [13:43:14] I'll just go ahead and reimage 2003, we're running on 1003 currently [13:43:37] ack [13:44:21] sgtm [13:45:02] so any remaining objections against decomming 1002/2002? [13:45:18] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1002 for host debmonitor2003.codfw.wmnet with OS bookworm [13:45:28] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) 05Open→03In progress p:05Triage→03Low [13:45:42] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:50:30] moritzm: all good from my side, I don't have anything to keep at this point [14:50:54] volans is there a host I can reimage to test that install-console CR? [14:51:38] inflatador: usually any of the sretest* hosts are ok, just ask in here if anyone is using the one you pick, also check with OS they are running to re-install them in the same OS version [14:54:13] volans ACK, will do [14:56:46] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1002 for host debmonitor2003.codfw.wmnet with OS bookworm completed: - debmonitor2003 (**WARN**) - Downtimed on... [15:14:46] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) envoyproxy yet again failed to build its configuration file. Manually ran generation script and removed downtime. [15:14:54] 10SRE-tools, 10Infrastructure-Foundations: Reimage debmonitor2003 - https://phabricator.wikimedia.org/T356638 (10SLyngshede-WMF) 05In progress→03Resolved [15:32:50] (DiskSpace) firing: Disk space build2001:9100:/ 5.207% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [15:35:52] moritzm: should we do some cleanup for build2001 space? ^^^ [15:36:43] I had a quick peek on Friday, there's a huge chunk (120G or so) related to the cloud image builds [15:37:11] wow, that seems a lot :) [15:37:14] haven't looked closer yet, we're possibly missing some cleanup timer there [15:41:47] ack, lmk if you want me to have a look [15:42:15] sure, go ahead :-) [15:42:44] Planning on reimaging sretest2003 in ~20m to test the reimage cookbook. If that's going to be a problem LMK [15:45:41] SGTM, thx [15:46:34] XioNoX ^^ any objections to the above reimage, I see you logged in there on Thursday [15:46:45] he's out this week [15:51:09] no worries, based on `last` he seems to have hit all test hosts at the same time last wk [15:52:09] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [16:39:30] (SystemdUnitFailed) firing: (2) prometheus-dpkg-success-textfile.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:30] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [16:55:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [18:03:31] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10Jhancock.wm) rack is physically prepped for tomorrow. [18:24:33] looks like NIC firmware issues on sretest2003...updating now, but I'm going to use the VM (sretest2005) to test the cookbook [18:27:18] inflatador: sretest2005 is special is part of the tests on routed ganeti [18:27:25] it would probably not work [18:27:35] and be stuck without some manual intervention [18:27:39] is a WIP [18:28:01] volans apologies, should I abort? [18:28:09] depend where you are :D [18:28:14] volans nm, it's too late ;( [18:28:30] sorry about taht [18:29:45] it's a test host, no big deal, maybe let arzhel know (but he's off this week) [18:33:31] ACK, will drop a note in T300152 [18:33:31] T300152: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 [18:34:15] thx [18:52:33] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10bking) @ayounsi Apologies for the trouble, I didn't realize `sretest2005` was in active use. Unfortunately, I reimaged it while I was working on T3... [19:32:50] (DiskSpace) firing: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:53:33] o/ I have a weird problem with reimages (again!) [19:55:05] I ran sudo `cookbook sre.hosts.reimage -t T351074 --os bullseye mw1386 --new` and for whatever reason the debian installer is sitting there waiting for an interactive install [19:55:07] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [20:31:18] kamila_: have you looked at the console via mgmt? [20:31:31] I see the host is in a busybox right now [20:31:51] last time I looked it was in the interactive installer [20:32:17] but waiting for which answer? [20:32:31] https://usercontent.irccloud-cdn.com/file/2BajmnY2/image.png [20:32:49] actually originally it wanted network config, I tried to proceed through that out of curiosity [20:33:19] the "automted" instller is just the normal one with answers pre-answered, if an answer is not valid, not there, etc... it just asks it [20:33:27] there is no suck thing as non-interactive installer ;) [20:33:39] right [20:33:51] I think mw1387 is still on the original screen? [20:34:23] (yes, I ran a bunch in a row again '^^) [20:36:09] oh, not anymore... [20:37:53] modules/profile/data/profile/installserver/preseed.yaml looks weird to me [20:38:00] 1388 is though [20:38:02] 06763bffb6a [20:38:04] oh, did I break it? [20:38:26] not sure yet [20:38:53] things like mw138[0-368] [20:38:56] look weird [20:39:21] unless that's correctly doing 1380-1383,1386,1388 [20:39:27] and I'm just tired :D [20:39:30] (SystemdUnitFailed) firing: (2) prometheus-dpkg-success-textfile.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:40:21] I believe that's correct [20:42:14] it used to be 138[0-3] and I added 1386 and 1388 [20:42:28] (and I'm tired too and 1377 is not part of the pile I'm reimaging '^^) [20:42:42] 1378 is in the original state if you check the console [20:43:45] this is the log tab https://www.irccloud.com/pastebin/mdzSh0r5/ [20:47:54] and it's sitting there like this https://usercontent.irccloud-cdn.com/file/XkUyr5hc/image.png [20:50:11] it should get it from dhcp [20:51:56] hm, then that's weird and potentially bad :D [20:53:11] agree, but it's already quite late over here... I might not be able to dig more tonight (have already worked until now basically) [20:54:16] sure, no worries [20:54:39] it's weird, it's happening in codfw too (mw2317) [20:54:51] but it'll probably be weird tomorrow too :D [20:55:48] could you please open a task with the host list and their status? [20:55:56] so I know what to look and what to expect tomorrow :) [20:56:17] sorry for the trouble, I'm unsure what could be as nothing that I know of changed in the last few days [20:56:24] sure [20:56:25] thank you! [20:56:50] sorry for troubling you again, I seem to be a weird bug magnet :D [20:57:03] ahahah [20:57:13] nah it's not you, that's called hardware ;) [20:57:34] let's pretend that that's true :D [20:57:44] good night o/ [20:58:13] :D [20:58:20] thx you too [23:32:50] (DiskSpace) firing: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace