[00:04:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:32:51] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 0% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[03:44:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:32:51] <jinxer-wm>	 (DiskSpace) firing: Disk space build2001:9100:/ 1.627% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[07:45:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:49:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:51:51] <slyngs>	 moritzm: Regarding build2002, maybe we add something to clear out old pbuilds
[08:52:28] <slyngs>	 Hmm, actually that's no so bad
[08:54:42] <slyngs>	 Oh, there's a ton of docker images
[08:54:43] <moritzm>	 pbuilder builds are already pruned by a job, the current excess storage is used by container builds
[08:55:12] <slyngs>	 Yeah, docker image are not pruned :-)
[08:56:47] <moritzm>	 they are probably cleaned out if the build script terminates cleanly,  but what we're seeing is the fallout of failed runs
[08:56:59] <moritzm>	 I'll add a systemd timer to clean them out
[08:57:51] <slyngs>	 We need docker image prune -a I think, because they aren't actually used on the host, but also not dangling
[09:04:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) man-db.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:07:37] <moritzm>	 we have two existing systemd timers: docker-system-prune-dangling.service runs "docker system prune --force" and docker-system-prune-all.service runs "docker system prune --all --volumes --force"
[09:08:43] <slyngs>	 Neither of those will remove these images :-)
[09:09:08] <slyngs>	 We need a docker image prune -a
[09:09:20] <moritzm>	 another issue is that we don't have any good mechanism to retire images: "docker image ls" shows lots of unused old images dating back to three years... (PHP 7.2, golang on stretch e.g.)
[09:09:23] <slyngs>	 docker system just prunes containers
[09:10:08] <slyngs>	 If the build process pulls in the image on build we can just remove all of them once a week, the first build that needs one of the image will be a little slow though
[09:10:28] <moritzm>	 slyngs: probably, so. can you make a patch? and let's add ServiceOps as reviewers
[09:10:38] <slyngs>	 Will do
[09:11:28] <slyngs>	 I'll just remove a few of the oldest images manually so we can build stuff
[09:11:33] <moritzm>	 cheers
[09:14:12] <slyngs>	 I may have accidentally delete all of them.... No sure way though
[09:14:21] <slyngs>	 Anyway, plenty of space
[09:14:43] <volans>	 another option could be to use debmonitor image "freshness", we do time-based GC there for images and we could check if the image exists in netbox and prune if not
[09:15:30] <volans>	 I see just 3% free
[09:15:34] <slyngs>	 On the registery sure, but on build it seems less important
[09:15:57] <slyngs>	 Docker is bad commandline voodoo
[09:16:10] <slyngs>	 sudo docker image ls <- List your images
[09:16:26] <slyngs>	 sudo docker images ls
[09:16:26] <slyngs>	 REPOSITORY   TAG       IMAGE ID   CREATED   SIZE
[09:20:39] <volans>	 slyngs: unrelated, FYI on sso-debmon (cloud instance) puppet is failing because:
[09:20:42] <volans>	 Error: Systemd start for debmonitor-server failed!
[09:20:52] <volans>	 I'm not sure if you're getting the email about that
[09:21:05] <slyngs>	 I do not :-)
[09:21:06] <volans>	 if not I can add you as project admin I think
[09:22:04] <slyngs>	 Nice, that would save us from using you as a relay-station :-)
[09:22:28] <volans>	 what's your wikitech user?
[09:22:35] <jinxer-wm>	 (DiskSpace) resolved: Disk space build2001:9100:/ 5.631% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=build2001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace
[09:22:38] <volans>	 found
[09:22:54] <slyngs>	 Cool, and I know why it broken
[09:23:12] <volans>	 mmmh you're already a member there
[09:23:45] <slyngs>	 moritzm: Wemay have forgotten about cloud when removing the old debmonitor deployment code from Puppet
[09:25:10] <volans>	 so I thibk you should already get an email like:
[09:25:11] <volans>	 [Cloud VPS alert][sso] Puppet failure on sso-debmon.sso.eqiad1.wikimedia.cloud
[09:25:59] <slyngs>	 Found it
[09:26:28] <slyngs>	 It got filtered to a separate mail folder
[09:26:52] <moritzm>	 but the cloud setup can simply use the deb as well, right?
[09:27:07] <slyngs>	 Yeah, we just need to upgrade from buster
[09:27:08] <moritzm>	 not sure what it' actually used
[09:27:33] <volans>	 it was used for testing new things IIRC
[09:27:36] <moritzm>	 might be just a test ground and if the test ground ues the legacy setup not in prod....
[09:28:00] <moritzm>	 so we can just as well thrash if and recreated if we acually need it in the future
[09:28:11] <volans>	 I had one long time ago in another project, then john got this one in the sso one and we ditched my old one
[09:28:25] <volans>	 sure, it's nice to have one working for testing things
[09:29:03] <slyngs>	 Could we build a debmonitor-stage/next on Ganeti instead?
[09:36:33] <moritzm>	 I agree a staging host in prod would be useful, let's just remove the old buster setup and open a task with low prio to add a staging node for debmonitor? 
[09:37:25] <volans>	 mmmh but a staging host in prod will have no data, while one in cloud will have the data of the other hosts in that project... up to you
[09:38:09] <volans>	 *no data= no data unless running manually debmonitor with the -s SERVER, --server SERVER flag
[09:43:12] <volans>	 slyngs: I see you've reimaged debmonitor2003 yesterday and it was a PASS, did you had to do anything manual in the debian-installer?
[09:43:30] <volans>	 I'm investigating reimage failures and trying to pinpoint when it started or what is affected
[09:43:47] <slyngs>	 No, but the envoyproxy yet again failed to build it's config file in the first puppet run
[09:43:50] <moritzm>	 we could simply add a separate debmonitor-next DB
[09:44:11] <moritzm>	 and if we work on bigger changes we configure clients to submit to both instances in parallel
[09:44:19] <slyngs>	 A separate database is probably required, if we do schema changes
[09:44:23] <volans>	 ack, thx
[09:44:27] <moritzm>	 or if there are breaking changes we only use -s for selected tests
[09:44:32] <volans>	 separate DB is ofc required
[09:44:48] <volans>	 what I mean is that there will be no live data by default, or stale nyway
[09:45:35] <moritzm>	 but the client can already submit to two instances, so we can e.g. have the sretest* hosts submit always to both
[09:45:49] <moritzm>	 both == prod and next
[09:45:51] <volans>	 does it? I don't thin so :D
[09:46:35] <moritzm>	 I thought so? but even if not, it's useful and simple to add
[09:46:51] <moritzm>	 especially now that we have a CLI only package
[10:56:52] <claime>	 is build2001 ok for me to relaunch build-production-images?
[10:57:49] <volans>	 dunno, at the moment there are 17GB free, not sure if simon is still clening up stuff
[11:14:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:00:22] <volans>	 I'm reimging sretest1001 to verify if the reimages work again
[12:22:04] <slyngs>	 claime: I'm not entirely sure if the images currently on host is required, so I only deleted the oldest
[12:25:36] <claime>	 slyngs: I can remove like 3 I know for sure are not needed but they don't amount to much. I think what needs to happen in the near future is the build for come of the bigger images needs to move away from production-images
[12:26:30] <claime>	 Namely the spark and spark-build images, that are between 1 and 5GB and take hours to generate
[12:27:24] <slyngs>	 I've created https://gerrit.wikimedia.org/r/c/operations/puppet/+/997796 but again not entirely sure on the use case, so cleaning out the images could be annoying to some
[12:27:25] <claime>	 I'll check if we can remove some more or if it'll just cause them to be rebuilt
[13:01:10] <claime>	 Total reclaimed space: 23.31GB
[13:01:14] <claime>	 That's not that much
[13:01:24] <claime>	 but still
[13:01:34] <claime>	 (pruned everything older than a month and a half)
[13:14:16] <volans>	 FYI I'll have to leave earlier today, in about ~2h. So if you need anything from me let me know earlier
[13:19:05] <kamila_>	 I'll try to not come up with more reimage stuff today :D 
[13:21:21] <volans>	 :D
[13:46:30] <eoghan>	 FYI I'm adding a 600gb disk image to vrts1002 on the ganeti cluster, it should be temporary for ~1 week for testing. (cc moritzm, we discussed this briefly on Friday last week)
[13:46:42] <eoghan>	 Context: https://phabricator.wikimedia.org/T355980
[13:50:45] <moritzm>	 +1
[14:08:05] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Migrate servers in codfw rack A2 from asw-a2-codfw to lsw1-a2-codfw - https://phabricator.wikimedia.org/T355861 (10Jhancock.wm) This rack is physically ready for tomorrow.
[14:17:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:54:20] <inflatador>	 Is it possible to roll back a Netbox change? Some how I deleted the host record instead of just its interfaces: https://netbox.wikimedia.org/extras/changelog/?request_id=7ba73cbb-db56-44b1-974b-34bf76172a7b
[14:54:38] <inflatador>	 topranks ^^ 
[14:54:41] <sukhe>	 inflatador: ouch, but don't worry, you are not the first one to do this
[14:54:44] <sukhe>	 you need volans
[14:54:58] <sukhe>	 https://wikitech.wikimedia.org/wiki/Netbox#Restore
[14:55:03] * topranks looking 
[14:55:23] <volans>	 if noone is editing netbox we can restore from the hourly backup
[14:55:25] <topranks>	 inflatador: I've done this a few times myself :(
[14:55:40] * volans wish the netbox upgrade will happen soon, it should prevent this from happening again
[14:55:44] <sukhe>	 yeah seriously
[14:55:56] <topranks>	 indeed yeah I'll be looking at the upgrade next month 
[14:56:11] <topranks>	 the potential for this is still there but they've improved the UI to make it more obvious so hopefully be the end 
[14:56:28] <topranks>	 inflatador: I can help with restore from db backup 
[14:57:13] <volans>	 topranks: I will have to step out in ~15m
[14:57:25] <volans>	 the hourly backup should be @:37 IIRC
[14:57:27] <inflatador>	 thanks topranks and volans . Sorry for the trouble
[14:57:30] <volans>	 do you think you can take care of it?
[14:57:42] <volans>	 or I can do it quickly but I need to know it now ;)
[14:57:56] <topranks>	 volans: no probs, yes I can take care of it 
[14:58:14] <volans>	 thanks man! lmk if you hit any issue
[14:58:32] <volans>	 psql-all-dbs-2024-02-06-14-37.sql.gz
[14:58:48] <volans>	 but check the changelog and tell people to not modify netbox
[14:58:55] <volans>	 or run cookbooks that touch netbox :D
[14:58:58] <volans>	 decommission, provision
[14:59:08] <volans>	 (reimage too but is less importnt)
[14:59:13] <topranks>	 yep will do 
[15:15:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:17:07] <topranks>	 inflatador: I think you should be ok to have another stab at that now 
[15:17:07] <topranks>	 https://netbox.wikimedia.org/dcim/devices/4906/
[15:17:39] <sukhe>	 topranks: now that I know you are also the go-to guy for Netbox backups, I will keep that in mind :P
[15:17:49] <sukhe>	 but I will I think not try to delete anything again
[15:18:38] <topranks>	 haha well I am tasked with upgrading netbox this time so it's on me as long as we are having these little hiccups due to the old UI :P
[15:19:20] <sukhe>	 so the new version allows for easier rollbacks?
[15:19:36] <inflatador>	 topranks awesome, thanks again. Should I only delete the one interface with an IP this time?
[15:20:06] <inflatador>	 ah, now I get it...the "delete" in the upper right hand corner deletes everything
[15:20:11] <topranks>	 yeah that's the issue 
[15:20:20] <topranks>	 upper right - is for the "device" as a whole 
[15:20:36] <topranks>	 bottom of list on any tab - is for the elements on that tab 
[15:21:15] <topranks>	 it's a common one with the devs, the next upgrade will display a list of "things to be deleted" in the confirm dialog, so hopefully be clearer and we won't have this 
[15:21:30] <topranks>	 but all of us have done it, most several times :(
[15:23:25] <inflatador>	 FWiW I updated the server lifecycle page with a warning
[15:24:10] <topranks>	 thanks 
[15:39:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:44:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:45:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:58:47] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1cb41722-6e24-4871-a903-cd...
[15:59:17] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b2349fc0-73a1-418a-b3b8-28...
[16:03:20] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=a3c16d29-3284-4390-9f38-03...
[16:13:14] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) All hosts moved successfully, all now responding to pings fine and MAC forwarding...
[16:25:00] <wikibugs>	 10CAS-SSO, 10Infrastructure-Foundations: OpenID Connect logout does not log out of IdP on idp.wmcloud.org - https://phabricator.wikimedia.org/T356784 (10CCicalese_WMF)
[16:28:47] <inflatador>	 topranks hit another snag on the migration, cloudelastic1009.mgmt.eqiad.wmnet doesn't have a DNS record. Tried running `sre.dns.netbox` but it says there's nothing to change?
[16:29:06] <inflatador>	 https://netbox.wikimedia.org/search/?q=cloudelastic1009&obj_type= has the IP/FQDN for mgmt iface
[16:29:27] <topranks>	 hmm I was gonna say I assume it's missing from netbox 
[16:30:18] <topranks>	 indeed the dns entry is there and being returned by authdns servers 
[16:30:36] <inflatador>	 I think it may have happened when I ran the decommission cookbook after you restored netbox?
[16:30:38] <topranks>	 inflatador: when you say it "doesn't have a DNS record" where are you seeing that?
[16:31:43] <inflatador>	 the reimage cookbook is complaining when I run it on cumin2002...also `host cloudelastic1009.mgmt.eqiad.wmnet` throws NXDOMAIN...again from cumin2002
[16:32:43] <topranks>	 ok
[16:33:15] <sukhe>	 inflatador: you should wipe the caches
[16:33:17] <topranks>	 my guess is some race condition and it's cached a negative dns query for that host
[16:33:22] <topranks>	 yeah
[16:33:41] <sukhe>	 sudo cookbook sre.dns.wipe-cache cloudelastic1009.mgmt.eqiad.wmnet
[16:34:28] <inflatador>	 ACK, that fixed it ;)
[16:35:02] <topranks>	 cool!
[16:49:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:56:55] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10MatthewVernon) swift backends look happy, thanks :)
[17:14:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) update-ubuntu-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:45:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:49:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:15:45] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:22:17] <wikibugs>	 10netops, 10Data-Persistence, 10Data-Persistence-Backup, 10Infrastructure-Foundations, and 3 others: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 (10cmooney) 05Open→03Resolved a:03cmooney Closing task, all looks good following change....
[21:22:27] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[22:15:46] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed