[01:18:06] * bd808 off
[04:06:58] * taavi paged for toolsdb
[04:07:10] <taavi>	 set it back as read-write
[11:00:38] <dhinus>	 sorry taavi for the page :( I was hoping the index that was added yesterday would help, but apparently it's not that... I will try again today tweaking other variables in the config
[13:05:52] <taavi>	 dhinus: are you aware what's up with the tools-db-2 replication alert?
[13:06:27] <dhinus>	 sorry yes, should be fixed now
[13:06:51] <dhinus>	 I tried stopping replication to see if it improved things on the primary, but I didn't expect to alert
[13:06:54] <dhinus>	 let me check why it did
[13:08:14] <dhinus>	 oh I know why, replication was broken, and stopping and restarting it actually fixed it
[13:08:34] <dhinus>	 the alert is not working correctly when replication is just down, it only alerts when it sees a long lag
[13:09:08] <dhinus>	 it will take a few hours to catch up, I'll acknowledge the alert
[13:09:31] <dhinus>	 we should make sure that alert (or a different one) triggers when replication is down
[13:10:05] <dhinus>	 I think it's the usual prometheus issue with "no data is interpreted as good data"
[13:50:59] <taavi>	 i wonder if it's easier to just export pt-heartbeat data and alert based on that, since it should notice everything
[13:56:33] <arnaudb>	 taavi: it's an idea that I second, but we should implement some of the tweaks that have been added to wmf's pt-heartbeat
[13:57:34] <arnaudb>	 I've checked our mysqld exporter version (we don't use the github releases but the debian ones) and it should be compliant with most of our use case. We could try and package it ourselves if we needed to add some more stuff, it does not seem to be too much trouble
[14:02:29] <dhinus>	 I've created T350943, please add a comment there and we can discuss the best option
[14:02:30] <stashbot>	 T350943: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943
[14:03:26] <dhinus>	 I think monitoring a couple more existing prometheus things should be enough to catch all the errors
[14:33:47] <arnaudb>	 we have most of the useful metrics already available, for some others it would require editing mysqld exporter arguments
[15:03:52] <andrewbogott>	 dhinus: reimages going ok?
[15:07:09] <dhinus>	 andrewbogott: today I'm focusing exclusively on T349695
[15:07:10] <stashbot>	 T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695
[15:07:33] <dhinus>	 because it crashed again last night
[15:07:41] <andrewbogott>	 ok!
[15:08:10] <andrewbogott>	 Want me to drain/reimage some cloudvirts inbetween my 'day off' activities?
[15:08:49] <dhinus>	 if you really are bored :) maybe we should test moving a couple VMs to one of the reimaged hosts?
[15:09:04] <dhinus>	 to test they run fine on bookworm before moving more?
[15:09:15] <andrewbogott>	 sure
[15:10:21] <andrewbogott>	 thanks for working on the db issue!
[15:11:27] <taavi>	 +1
[15:19:59] <andrewbogott>	 topranks: when you have a moment can you check the port/switch setup for cloudvirt106[2-7] and confirm that they conform to the new network setup?  Meanwhile I'm going to reimage them with bookworm...
[15:20:18] <dhinus>	 I'm hopeful that it won't alert during the weekend, and at least now I found a good graph of the available memories that we can use to "predict" if there's an OOM coming
[15:20:18] <legoktm>	 a real user crontab line:
[15:20:18] <legoktm>	 > 17 22 * * Sun /usr/bin/jsub -N cron-5 -once -quiet toolforge-jobs load /data/project/commons-delinquent/delinker_job.yaml
[15:20:19] <andrewbogott>	 ^ dhinus that's new hardware which will allow us to decom some of the lower-number cloudvirts rather than reimage.
[15:22:31] <dhinus>	 andrewbogott: nice, I saw they were there waiting :)
[15:22:34] <taavi>	 legoktm: at least that's going to fail every time, so turning off the grid won't break it even further
[15:22:58] <legoktm>	 is toolforge-jobs not installed on the grid execution hosts? :facepalm:
[15:23:24] <taavi>	 no :D
[15:23:40] <dhinus>	 maybe we should run the whole grid as a toolforge job :D
[15:24:06] <taavi>	 that's almost as good of an idea as using toolforge as the undercloud for the cloud vps openstack control plane
[15:24:13] <dhinus>	 hahahah exactly
[15:25:20] <taavi>	 legoktm: although now I'm tempted to symlink it to /bin/true and see what happens :D
[15:28:55] <taavi>	 omg I only now realized why they did that. they're used to the grid being unreliable and losing the job, and don't know that k8s simply does not do that
[15:41:23] <topranks>	 andrewbogott: just checked those and yes they were all good, looks like either me or papaul set the right vlans on them a couple of weeks back 
[15:41:41] <andrewbogott>	 great, thanks for checking!
[15:44:44] <taavi>	 lol, commons-delinquent is not the only tool that has tried running jobs framework jobs via the sge cron node, although all other tools have realized it does not work and commented the job
[15:45:05] <taavi>	 legoktm: are you in communication with the tool maintainers or should I just comment the crontab line?
[15:45:23] <legoktm>	 No I'm not
[15:45:40] <legoktm>	 I was trying to find SteinsplitterBot's code and saw it
[15:45:44] <taavi>	 ack, I'll deal with it
[15:45:56] <legoktm>	 Unfortunately I think they had it off Toolforge :(
[16:06:08] <taavi>	 andrewbogott: are the new cloudvirts already on bookworm?
[16:06:21] <andrewbogott>	 no, I'm reimaging them now
[16:06:24] <taavi>	 ah
[16:14:12] <andrewbogott>	 topranks: trying to put cloudvirt1062 into service and I get
[16:14:13] <andrewbogott>	 Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, failed to resolve 'cloudvirt1062.private.eqiad.wikimedia.cloud' (file: /etc/puppet/modules/profile/manifests/wmcs/cloud_private_subnet.pp, line: 15, column: 68) on node cloudvirt1062.eqiad.wmnet
[16:14:37] <andrewbogott>	 That means something was missed in netbox right?
[16:18:53] * andrewbogott has almost totally lost track of how dns is managed these days
[16:22:03] <topranks>	 yeah, that's a good point
[16:22:22] <topranks>	 the IPs for the private network are allocated here (for instance for rack E4 which this is in):
[16:22:23] <topranks>	 https://netbox.wikimedia.org/ipam/prefixes/655/ip-addresses/
[16:22:56] <andrewbogott>	 Is that something I should be doing, or leave to the netops folks?
[16:23:18] <topranks>	 I'll make sure to add that to my task to automate the network bits when adding servers (T346428)
[16:23:19] <stashbot>	 T346428: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428
[16:23:32] <topranks>	 andrewbogott: let me take a look at it for now
[16:23:39] <andrewbogott>	 thank you!
[16:34:03] <topranks>	 andrewbogott: ok I think you should be good now, I've allocated IPs and the dns entries have been created 
[16:34:07] <topranks>	 https://www.irccloud.com/pastebin/Oahspp3F/
[16:35:04] <andrewbogott>	 topranks: dig is working but puppet isn't, any idea what the negative cache limit is for that domain?
[16:36:39] <topranks>	 TTL on the zone is 1 hour 
[16:36:44] <topranks>	 I ran "sudo cookbook sre.dns.wipe-cache cloudvirt1062.private.eqiad.wikimedia.cloud" which might help 
[16:37:24] <andrewbogott>	 yep, better now. thanks!
[17:21:27] <andrewbogott>	 topranks: have a theory for this?
[17:21:30] <andrewbogott>	 https://www.irccloud.com/pastebin/NDPQNbPV/
[17:22:04] <andrewbogott>	 (and, sorry about all the pings, this can wait for next week if you prefer)
[17:23:19] <topranks>	 that's mad
[17:23:23] <topranks>	 https://www.irccloud.com/pastebin/p8kjbg8q/
[17:23:44] <topranks>	 Literally no clue what's happening for you 
[17:23:59] <topranks>	 can you do a "dig +nsid cloudvirt1064.private.eqiad.wikimedia.cloud" and paste it?
[17:24:21] <topranks>	 ugh, irccloud somehow puts in a "http" url when it sees a hostname that's the second time it happened 
[17:24:31] <andrewbogott>	 ummmmm
[17:24:34] <andrewbogott>	 https://www.irccloud.com/pastebin/Wqo3uEAc/
[17:24:42] <andrewbogott>	 So as soon as you looked everything started working
[17:25:23] <andrewbogott>	 that or it's alternating somehow
[17:25:57] * andrewbogott does watch -g
[17:25:58] <topranks>	 I was wondering were they missing from some of our dns servers and not others 
[17:26:06] <topranks>	 but checking manually against them all seem to be ok 
[17:26:58] <andrewbogott>	 I think you looked and that fixed it
[17:27:16] <topranks>	 my talents know no bounds it seems :)
[17:27:40] <topranks>	 I guess let's just ignore for now, if it happens again ping me.. definitely a little odd 
[17:28:14] <andrewbogott>	 yep. Thanks
[17:56:54] <dhinus>	 I tweaked our replication alerts for toolsdb, now they should catch a few additional problems
[17:57:07] <dhinus>	 details in T350943
[17:57:08] <stashbot>	 T350943: [toolsdb] no alert if replication stops because of IO error - https://phabricator.wikimedia.org/T350943
[17:57:29] <dhinus>	 I'm not resolving that task because my brain is a bit fried now, so I want to double check those on Monday :)
[17:59:13] <dhinus>	 regarding the OOM issues, I hope the db will not crash during the weekend, because I decreased the innodb buffer pool and it should take a while before all the memory is exhauted again
[17:59:42] <dhinus>	 *exhausted. but we still need to find more long-term solutions
[18:00:02] <dhinus>	 details in T349695
[18:00:02] <stashbot>	 T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695
[18:01:59] * dhinus calls it a day :)
[20:19:35] <andrewbogott>	 I'm in the middle of reimaging cloudvirt1062-67 and need to go AFK for a bit. If things alert for those hosts please ack or disregard. sorry!