[08:38:51] <dcaro>	 morning
[08:45:58] <taavi>	 o/
[09:02:20] <blancadesal>	 morning
[09:28:38] <dhinus>	 morning! I'm back, feeling mostly ok after the flu :)
[09:47:26] <dcaro>	 \o/
[13:01:05] <taavi>	 in case anyone is comfortable reviewing some striker code, https://gerrit.wikimedia.org/r/c/labs/striker/+/992131 unbreaks currently completely broken functionality. works fine locally, although setting up the environment took quite a bit of work
[14:34:29] <andrewbogott>	 If you have any outgoing info to report to the SRE meeting, please add it to the etherpad.
[14:49:30] <andrewbogott>	 taavi: I'm looking at those striker patches. When you get a second can you doublecheck the pcc results on https://gerrit.wikimedia.org/r/c/operations/puppet/+/992192 ?
[14:53:56] <dcaro>	 andrewbogott: those patches will affect all the control nodes at once?
[14:54:20] <dcaro>	 (sorry for the lack of context on my side, I have not been following the move closely)
[14:55:06] <andrewbogott>	 dcaro: yes. My experience in codfw1dev is that the change requires a total tear-down and rebuild of the galera cluster; easiest to do it once rather than three times.  So there will be an api outage during the switch
[14:55:36] <andrewbogott>	 (and when I added all three control nodes to the Hosts: line the linter rejected my commit message)
[14:58:56] <dcaro>	 okok, do you know what those prometheus grants are for?
[14:59:06] <dcaro>	 (do they have to be applied manually somewhere?)
[15:00:13] <dcaro>	 '/etc/prometheus/prometheus_performance_grants.mysql'
[15:00:21] <andrewbogott>	 I am /pretty sure/ they get applied automatically but let's see if Taavi agrees
[15:01:08] <dcaro>	 okok
[15:01:27] <dcaro>	 lgtm as it is, but yep, second pair of eyes would be nice
[15:01:59] <dhinus>	 andrewbogott: I think you can add multiple Hosts: line in a patch
[15:02:03] <andrewbogott>	 hm, no I bet I'm wrong and they have to be manually applied
[15:02:08] <andrewbogott>	 dhinus: you can until it's > 80 chars
[15:02:40] <taavi>	 andrewbogott: no, like multiple Hosts: lines that can each be up to 80 chars
[15:02:51] <dhinus>	 that's what I meant yeah :)
[15:02:56] <andrewbogott>	 Oh I see, nice
[15:03:10] <andrewbogott>	 Anyway in this case the diff will be equivalent for each cloudcontrol
[15:03:15] <taavi>	 and yeah, you need to apply those grants manually. I'm pretty sure we're (ab)using the prometheus user in a few places, otherwise I'd just change it to unix_socket authentication
[15:04:18] * andrewbogott applies those grants in codfw1dev
[15:05:39] <andrewbogott>	 I am going to forge ahead and cause a brief openstack API outage if no one has objections (want to get that change in place early in the day so anything that breaks happens while I'm awake)
[15:07:30] * dcaro ack
[15:07:44] <taavi>	 go for it
[15:19:34] <andrewbogott>	 hm, this is going poorly
[15:25:40] <dcaro>	 oops, ask for help if you want it
[15:32:36] <andrewbogott>	 possibly https://gerrit.wikimedia.org/r/c/operations/puppet/+/992204
[15:33:53] <dcaro>	 LGTM, not sure why it's needed now though, maybe something else opened it? (or depending on the network it does tcp/udp :/)
[15:38:06] <dcaro>	 just did a quick netcat test, tcp seems open for sure
[15:38:35] <andrewbogott>	 it might be a red herring, but might as well start by following the docs :)
[15:39:09] * andrewbogott waiting impatiently for jenkins
[15:41:28] <dcaro>	 from the logs on cloudcontrol1006, it was able to connect to tcp://172.20.1.3
[15:42:57] <dcaro>	 but 1007 failed trying to connect to 172.20.2.32:4567
[15:43:42] <andrewbogott>	 probably because galera wasn't accepting syncs yet?
[15:44:04] <andrewbogott>	 I'm restarting everything on 1006 now, let's see if it works this time
[15:44:09] <dcaro>	 maybe, cloudcontrol1006 stopped because it was unable to form quorum it says
[15:44:13] <dcaro>	 ack, I'll tail logs
[15:45:40] <andrewbogott>	 "Requesting state transfer: success, donor: 1" seems good
[15:45:51] <taavi>	 the new firewall rules don't seem to be there on 1006 yet
[15:46:52] <dcaro>	 got restarted on 1006?
[15:47:24] <dcaro>	 [Warning] WSREP: 1.0 (cloudcontrol1005.private.eqiad.wikimedia.cloud): State transfer to 0.0 (cloudcontrol1006.private.eqiad.wikimedia.cloud) failed: -110 (Connection timed out)
[15:47:39] <taavi>	 i think puppet just ran there
[15:47:41] <dcaro>	 now it seems it's syncing stuff
[15:47:50] <dcaro>	 rsync at least
[15:48:09] <dcaro>	 looks good now
[15:48:28] <dcaro>	 probably puppet set the firewall rules in that last run?
[15:48:46] <andrewbogott>	 yeah, lots of noise but I think it's working?
[15:49:01] <dcaro>	 I think so, I had not seen the rsync transfers before
[15:49:04] <taavi>	 seems so
[15:49:09] <andrewbogott>	 puppet is incredibly slow on this cold morning so the puppet run to reset the firewall rules is still in progress, but yeah it probably made the fw change
[15:49:45] <dcaro>	 Jan 22 15:48:46 cloudcontrol1005 mariadbd[745101]: 2024-01-22 15:48:46 0 [Note] WSREP: Member 0.0 (cloudcontrol1006.private.eqiad.wikimedia.cloud) synced with group.
[15:49:50] <andrewbogott>	 So either it was the udp thing or I just needed to try six times for it to work
[15:49:54] * andrewbogott starts 1007 now
[15:50:24] <dcaro>	 xd
[15:51:24] <dcaro>	 dhinus: reading the mysql binlog to extract the transaction that's stuck, it's taking >20min, is that expected?
[15:52:17] <dhinus>	 nope, it's usually very quick
[15:52:27] <dhinus>	 what command are you using?
[15:53:43] <dcaro>	 mysqlbinlog --base64-output=decode-rows --verbose --start-position=28815896 /srv/labsdb/binlogs/log.062775
[15:53:54] <taavi>	 that's toolsdb not galera right?
[15:54:00] <dcaro>	 yes sorry, unrelated :)
[15:55:30] <andrewbogott>	 ok, 1007 is now showing ready. So I think that's over.
[15:55:40] <andrewbogott>	 thx dcaro & taavi 
[15:55:45] <taavi>	 thank you!
[15:55:53] <dhinus>	 dcaro: that command works immediately for me
[15:56:03] <andrewbogott>	 I'm going to restart openstack services to get things unstuck that might be upset about the db outage
[15:56:22] <dcaro>	 wow... I think spicerack was using >60G ram on my laptop, and finally crashed
[15:57:03] <dcaro>	 probably the logs were too many, and was transferring the whole of it?
[15:57:14] <dcaro>	 (and spicerack was buffering them)
[15:58:27] <dhinus>	 are you talking about the mysqlbinlog command? that's expected to produce a very long output, but you only care about the first dozen lines or so
[15:58:59] <dhinus>	 running it via spicerack might definitely break
[15:59:19] <dhinus>	 maybe we could add "head -n 100" or something
[16:01:45] <dcaro>	 there's also a --stop-position, can I use that?
[16:02:17] <dhinus>	 yep, though I'm not sure if the increment is always 1
[16:03:15] <dcaro>	 in that sense, what am I looking for/
[16:03:16] <dcaro>	 ?
[16:04:20] <dhinus>	 this is where my understanding of the binlog is not deep enough :D you want to show a single "event" from the log, in theory
[16:04:55] <dhinus>	 but each event can take up more than 1 integer in the "start/stop position"
[16:05:20] <dcaro>	 I see, there's actually huge gaps in the integers between lines
[16:05:45] <dhinus>	 in this case it's a delete query that causes many lines to be deleted, so I guess each actual deleted record will bump the integer
[16:05:47] <dcaro>	 after 28815896, goes 28815938
[16:06:12] <dhinus>	 hmm no then your number doesn't match my explanation, I was expecting an even bigger gap :)
[16:06:33] <dcaro>	 oh, interesting, so the logs are per actions on the DB, not queries (we are not replicating queries I guess, should be slower)
[16:06:43] <dcaro>	 there's more numbers until the delete
[16:07:21] <dhinus>	 exactly, and that's the root cause of the replication lag
[16:07:23] <dcaro>	 the delete is at 28815981, and the next non-empty number is 28922538
[16:07:44] <dcaro>	 I'm trying to write a cookbooks to help there
[16:08:31] <dhinus>	 I imagined that was your idea, I like it :) but it's not super easy to extract something meaningful
[16:09:03] <dhinus>	 you might also explore using SQL instead of the mysqlbinlog CLI https://mariadb.com/kb/en/show-binlog-events/
[16:11:22] <dhinus>	 this yields something useful, but "3" is very arbitrary and might not always work SHOW BINLOG EVENTS IN 'log.062775' FROM 28815896 LIMIT 3;
[16:34:59] <dcaro>	 interesting
[16:35:30] <dcaro>	 I have found also that at least mysqlbinlog will fail if you try to start at a position that does not exist
[16:36:09] <Rook>	 Is this correctly formatted for getting cli access in codfw1dev? https://www.irccloud.com/pastebin/KWhbNAjy/
[16:41:48] <Rook>	 Oh found the issue, indeed that is the right format, but had an old key
[17:04:14] <taavi>	 andrewbogott: I tried to do a trove backup (testbackup in the metricsinfra project) and it doesn't seem to work
[17:04:25] <andrewbogott>	 well dang
[17:04:31] <andrewbogott>	 I wonder if it only works with fresh instances
[17:16:29] * dcaro off
[18:12:52] <andrewbogott>	 actually now that I've thought about it for a minute, I'm sure it only works on fresh instances.  I need to figure out some way to refresh the guest config