[00:04:30] (SystemdUnitFailed) firing: (3) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:32:51] (DiskSpace) firing: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [03:44:30] (SystemdUnitFailed) firing: (3) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:51] (DiskSpace) firing: Disk space build2001:9100:/ 1.627% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [07:45:45] (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:30] (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:51:51] moritzm: Regarding build2002, maybe we add something to clear out old pbuilds [08:52:28] Hmm, actually that's no so bad [08:54:42] Oh, there's a ton of docker images [08:54:43] pbuilder builds are already pruned by a job, the current excess storage is used by container builds [08:55:12] Yeah, docker image are not pruned :-) [08:56:47] they are probably cleaned out if the build script terminates cleanly, but what we're seeing is the fallout of failed runs [08:56:59] I'll add a systemd timer to clean them out [08:57:51] We need docker image prune -a I think, because they aren't actually used on the host, but also not dangling [09:04:31] (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:07:37] we have two existing systemd timers: docker-system-prune-dangling.service runs "docker system prune --force" and docker-system-prune-all.service runs "docker system prune --all --volumes --force" [09:08:43] Neither of those will remove these images :-) [09:09:08] We need a docker image prune -a [09:09:20] another issue is that we don't have any good mechanism to retire images: "docker image ls" shows lots of unused old images dating back to three years... (PHP 7.2, golang on stretch e.g.) [09:09:23] docker system just prunes containers [09:10:08] If the build process pulls in the image on build we can just remove all of them once a week, the first build that needs one of the image will be a little slow though [09:10:28] slyngs: probably, so. can you make a patch? and let's add ServiceOps as reviewers [09:10:38] Will do [09:11:28] I'll just remove a few of the oldest images manually so we can build stuff [09:11:33] cheers [09:14:12] I may have accidentally delete all of them.... No sure way though [09:14:21] Anyway, plenty of space [09:14:43] another option could be to use debmonitor image "freshness", we do time-based GC there for images and we could check if the image exists in netbox and prune if not [09:15:30] I see just 3% free [09:15:34] On the registery sure, but on build it seems less important [09:15:57] Docker is bad commandline voodoo [09:16:10] sudo docker image ls <- List your images [09:16:26] sudo docker images ls [09:16:26] REPOSITORY TAG IMAGE ID CREATED SIZE [09:20:39] slyngs: unrelated, FYI on sso-debmon (cloud instance) puppet is failing because: [09:20:42] Error: Systemd start for debmonitor-server failed! [09:20:52] I'm not sure if you're getting the email about that [09:21:05] I do not :-) [09:21:06] if not I can add you as project admin I think [09:22:04] Nice, that would save us from using you as a relay-station :-) [09:22:28] what's your wikitech user? [09:22:35] (DiskSpace) resolved: Disk space build2001:9100:/ 5.631% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [09:22:38] found [09:22:54] Cool, and I know why it broken [09:23:12] mmmh you're already a member there [09:23:45] moritzm: Wemay have forgotten about cloud when removing the old debmonitor deployment code from Puppet [09:25:10] so I thibk you should already get an email like: [09:25:11] [Cloud VPS alert][sso] Puppet failure on sso-debmon.sso.eqiad1.wikimedia.cloud [09:25:59] Found it [09:26:28] It got filtered to a separate mail folder [09:26:52] but the cloud setup can simply use the deb as well, right? [09:27:07] Yeah, we just need to upgrade from buster [09:27:08] not sure what it' actually used [09:27:33] it was used for testing new things IIRC [09:27:36] might be just a test ground and if the test ground ues the legacy setup not in prod.... [09:28:00] so we can just as well thrash if and recreated if we acually need it in the future [09:28:11] I had one long time ago in another project, then john got this one in the sso one and we ditched my old one [09:28:25] sure, it's nice to have one working for testing things [09:29:03] Could we build a debmonitor-stage/next on Ganeti instead? [09:36:33] I agree a staging host in prod would be useful, let's just remove the old buster setup and open a task with low prio to add a staging node for debmonitor? [09:37:25] mmmh but a staging host in prod will have no data, while one in cloud will have the data of the other hosts in that project... up to you [09:38:09] *no data= no data unless running manually debmonitor with the -s SERVER, --server SERVER flag [09:43:12] slyngs: I see you've reimaged debmonitor2003 yesterday and it was a PASS, did you had to do anything manual in the debian-installer? [09:43:30] I'm investigating reimage failures and trying to pinpoint when it started or what is affected [09:43:47] No, but the envoyproxy yet again failed to build it's config file in the first puppet run [09:43:50] we could simply add a separate debmonitor-next DB [09:44:11] and if we work on bigger changes we configure clients to submit to both instances in parallel [09:44:19] A separate database is probably required, if we do schema changes [09:44:23] ack, thx [09:44:27] or if there are breaking changes we only use -s for selected tests [09:44:32] separate DB is ofc required [09:44:48] what I mean is that there will be no live data by default, or stale nyway [09:45:35] but the client can already submit to two instances, so we can e.g. have the sretest* hosts submit always to both [09:45:49] both == prod and next [09:45:51] does it? I don't thin so :D [09:46:35] I thought so? but even if not, it's useful and simple to add [09:46:51] especially now that we have a CLI only package [10:56:52] is build2001 ok for me to relaunch build-production-images? [10:57:49] dunno, at the moment there are 17GB free, not sure if simon is still clening up stuff [11:14:32] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:00:22] I'm reimging sretest1001 to verify if the reimages work again [12:22:04] claime: I'm not entirely sure if the images currently on host is required, so I only deleted the oldest [12:25:36] slyngs: I can remove like 3 I know for sure are not needed but they don't amount to much. I think what needs to happen in the near future is the build for come of the bigger images needs to move away from production-images [12:26:30] Namely the spark and spark-build images, that are between 1 and 5GB and take hours to generate [12:27:24] I've created https://gerrit.wikimedia.org/r/c/operations/puppet/+/997796 but again not entirely sure on the use case, so cleaning out the images could be annoying to some [12:27:25] I'll check if we can remove some more or if it'll just cause them to be rebuilt [13:01:10] Total reclaimed space: 23.31GB [13:01:14] That's not that much [13:01:24] but still [13:01:34] (pruned everything older than a month and a half) [13:14:16] FYI I'll have to leave earlier today, in about ~2h. So if you need anything from me let me know earlier [13:19:05] I'll try to not come up with more reimage stuff today :D [13:21:21] :D [13:46:30] FYI I'm adding a 600gb disk image to vrts1002 on the ganeti cluster, it should be temporary for ~1 week for testing. (cc moritzm, we discussed this briefly on Friday last week) [13:46:42] Context: https://phabricator.wikimedia.org/T355980 [13:50:45] +1 [14:08:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10Jhancock.wm) This rack is physically ready for tomorrow. [14:17:54] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:54:20] Is it possible to roll back a Netbox change? Some how I deleted the host record instead of just its interfaces: https://netbox.wikimedia.org/extras/changelog/?request_id=7ba73cbb-db56-44b1-974b-34bf76172a7b [14:54:38] topranks ^^ [14:54:41] inflatador: ouch, but don't worry, you are not the first one to do this [14:54:44] you need volans [14:54:58] https://wikitech.wikimedia.org/wiki/Netbox#Restore [14:55:03] * topranks looking [14:55:23] if noone is editing netbox we can restore from the hourly backup [14:55:25] inflatador: I've done this a few times myself :( [14:55:40] * volans wish the netbox upgrade will happen soon, it should prevent this from happening again [14:55:44] yeah seriously [14:55:56] indeed yeah I'll be looking at the upgrade next month [14:56:11] the potential for this is still there but they've improved the UI to make it more obvious so hopefully be the end [14:56:28] inflatador: I can help with restore from db backup [14:57:13] topranks: I will have to step out in ~15m [14:57:25] the hourly backup should be @:37 IIRC [14:57:27] thanks topranks and volans . Sorry for the trouble [14:57:30] do you think you can take care of it? [14:57:42] or I can do it quickly but I need to know it now ;) [14:57:56] volans: no probs, yes I can take care of it [14:58:14] thanks man! lmk if you hit any issue [14:58:32] psql-all-dbs-2024-02-06-14-37.sql.gz [14:58:48] but check the changelog and tell people to not modify netbox [14:58:55] or run cookbooks that touch netbox :D [14:58:58] decommission, provision [14:59:08] (reimage too but is less importnt) [14:59:13] yep will do [15:15:45] (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:07] inflatador: I think you should be ok to have another stab at that now [15:17:07] https://netbox.wikimedia.org/dcim/devices/4906/ [15:17:39] topranks: now that I know you are also the go-to guy for Netbox backups, I will keep that in mind :P [15:17:49] but I will I think not try to delete anything again [15:18:38] haha well I am tasked with upgrading netbox this time so it's on me as long as we are having these little hiccups due to the old UI :P [15:19:20] so the new version allows for easier rollbacks? [15:19:36] topranks awesome, thanks again. Should I only delete the one interface with an IP this time? [15:20:06] ah, now I get it...the "delete" in the upper right hand corner deletes everything [15:20:11] yeah that's the issue [15:20:20] upper right - is for the "device" as a whole [15:20:36] bottom of list on any tab - is for the elements on that tab [15:21:15] it's a common one with the devs, the next upgrade will display a list of "things to be deleted" in the confirm dialog, so hopefully be clearer and we won't have this [15:21:30] but all of us have done it, most several times :( [15:23:25] FWiW I updated the server lifecycle page with a warning [15:24:10] thanks [15:39:31] (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:31] (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:45] (SystemdUnitFailed) firing: (10) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:47] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1cb41722-6e24-4871-a903-cd... [15:59:17] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2349fc0-73a1-418a-b3b8-28... [16:03:20] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3c16d29-3284-4390-9f38-03... [16:13:14] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) All hosts moved successfully, all now responding to pings fine and MAC forwarding... [16:25:00] 10CAS-SSO, 10Infrastructure-Foundations: OpenID Connect logout does not log out of IdP on idp.wmcloud.org - https://phabricator.wikimedia.org/T356784 (10CCicalese_WMF) [16:28:47] topranks hit another snag on the migration, cloudelastic1009.mgmt.eqiad.wmnet doesn't have a DNS record. Tried running `sre.dns.netbox` but it says there's nothing to change? [16:29:06] https://netbox.wikimedia.org/search/?q=cloudelastic1009&obj_type= has the IP/FQDN for mgmt iface [16:29:27] hmm I was gonna say I assume it's missing from netbox [16:30:18] indeed the dns entry is there and being returned by authdns servers [16:30:36] I think it may have happened when I ran the decommission cookbook after you restored netbox? [16:30:38] inflatador: when you say it "doesn't have a DNS record" where are you seeing that? [16:31:43] the reimage cookbook is complaining when I run it on cumin2002...also `host cloudelastic1009.mgmt.eqiad.wmnet` throws NXDOMAIN...again from cumin2002 [16:32:43] ok [16:33:15] inflatador: you should wipe the caches [16:33:17] my guess is some race condition and it's cached a negative dns query for that host [16:33:22] yeah [16:33:41] sudo cookbook sre.dns.wipe-cache cloudelastic1009.mgmt.eqiad.wmnet [16:34:28] ACK, that fixed it ;) [16:35:02] cool! [16:49:31] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:56:55] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10MatthewVernon) swift backends look happy, thanks :) [17:14:31] (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:45:45] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:31] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:45] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:22:17] 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) 05Open→03Resolved a:03cmooney Closing task, all looks good following change.... [21:22:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [22:15:46] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed