[08:38:51] morning [08:45:58] o/ [09:02:20] morning [09:28:38] morning! I'm back, feeling mostly ok after the flu :) [09:47:26] \o/ [13:01:05] in case anyone is comfortable reviewing some striker code, https://gerrit.wikimedia.org/r/c/labs/striker/+/992131 unbreaks currently completely broken functionality. works fine locally, although setting up the environment took quite a bit of work [14:34:29] If you have any outgoing info to report to the SRE meeting, please add it to the etherpad. [14:49:30] taavi: I'm looking at those striker patches. When you get a second can you doublecheck the pcc results on https://gerrit.wikimedia.org/r/c/operations/puppet/+/992192 ? [14:53:56] andrewbogott: those patches will affect all the control nodes at once? [14:54:20] (sorry for the lack of context on my side, I have not been following the move closely) [14:55:06] dcaro: yes. My experience in codfw1dev is that the change requires a total tear-down and rebuild of the galera cluster; easiest to do it once rather than three times. So there will be an api outage during the switch [14:55:36] (and when I added all three control nodes to the Hosts: line the linter rejected my commit message) [14:58:56] okok, do you know what those prometheus grants are for? [14:59:06] (do they have to be applied manually somewhere?) [15:00:13] '/etc/prometheus/prometheus_performance_grants.mysql' [15:00:21] I am /pretty sure/ they get applied automatically but let's see if Taavi agrees [15:01:08] okok [15:01:27] lgtm as it is, but yep, second pair of eyes would be nice [15:01:59] andrewbogott: I think you can add multiple Hosts: line in a patch [15:02:03] hm, no I bet I'm wrong and they have to be manually applied [15:02:08] dhinus: you can until it's > 80 chars [15:02:40] andrewbogott: no, like multiple Hosts: lines that can each be up to 80 chars [15:02:51] that's what I meant yeah :) [15:02:56] Oh I see, nice [15:03:10] Anyway in this case the diff will be equivalent for each cloudcontrol [15:03:15] and yeah, you need to apply those grants manually. I'm pretty sure we're (ab)using the prometheus user in a few places, otherwise I'd just change it to unix_socket authentication [15:04:18] * andrewbogott applies those grants in codfw1dev [15:05:39] I am going to forge ahead and cause a brief openstack API outage if no one has objections (want to get that change in place early in the day so anything that breaks happens while I'm awake) [15:07:30] * dcaro ack [15:07:44] go for it [15:19:34] hm, this is going poorly [15:25:40] oops, ask for help if you want it [15:32:36] possibly https://gerrit.wikimedia.org/r/c/operations/puppet/+/992204 [15:33:53] LGTM, not sure why it's needed now though, maybe something else opened it? (or depending on the network it does tcp/udp :/) [15:38:06] just did a quick netcat test, tcp seems open for sure [15:38:35] it might be a red herring, but might as well start by following the docs :) [15:39:09] * andrewbogott waiting impatiently for jenkins [15:41:28] from the logs on cloudcontrol1006, it was able to connect to tcp://172.20.1.3 [15:42:57] but 1007 failed trying to connect to 172.20.2.32:4567 [15:43:42] probably because galera wasn't accepting syncs yet? [15:44:04] I'm restarting everything on 1006 now, let's see if it works this time [15:44:09] maybe, cloudcontrol1006 stopped because it was unable to form quorum it says [15:44:13] ack, I'll tail logs [15:45:40] "Requesting state transfer: success, donor: 1" seems good [15:45:51] the new firewall rules don't seem to be there on 1006 yet [15:46:52] got restarted on 1006? [15:47:24] [Warning] WSREP: 1.0 (cloudcontrol1005.private.eqiad.wikimedia.cloud): State transfer to 0.0 (cloudcontrol1006.private.eqiad.wikimedia.cloud) failed: -110 (Connection timed out) [15:47:39] i think puppet just ran there [15:47:41] now it seems it's syncing stuff [15:47:50] rsync at least [15:48:09] looks good now [15:48:28] probably puppet set the firewall rules in that last run? [15:48:46] yeah, lots of noise but I think it's working? [15:49:01] I think so, I had not seen the rsync transfers before [15:49:04] seems so [15:49:09] puppet is incredibly slow on this cold morning so the puppet run to reset the firewall rules is still in progress, but yeah it probably made the fw change [15:49:45] Jan 22 15:48:46 cloudcontrol1005 mariadbd[745101]: 2024-01-22 15:48:46 0 [Note] WSREP: Member 0.0 (cloudcontrol1006.private.eqiad.wikimedia.cloud) synced with group. [15:49:50] So either it was the udp thing or I just needed to try six times for it to work [15:49:54] * andrewbogott starts 1007 now [15:50:24] xd [15:51:24] dhinus: reading the mysql binlog to extract the transaction that's stuck, it's taking >20min, is that expected? [15:52:17] nope, it's usually very quick [15:52:27] what command are you using? [15:53:43] mysqlbinlog --base64-output=decode-rows --verbose --start-position=28815896 /srv/labsdb/binlogs/log.062775 [15:53:54] that's toolsdb not galera right? [15:54:00] yes sorry, unrelated :) [15:55:30] ok, 1007 is now showing ready. So I think that's over. [15:55:40] thx dcaro & taavi [15:55:45] thank you! [15:55:53] dcaro: that command works immediately for me [15:56:03] I'm going to restart openstack services to get things unstuck that might be upset about the db outage [15:56:22] wow... I think spicerack was using >60G ram on my laptop, and finally crashed [15:57:03] probably the logs were too many, and was transferring the whole of it? [15:57:14] (and spicerack was buffering them) [15:58:27] are you talking about the mysqlbinlog command? that's expected to produce a very long output, but you only care about the first dozen lines or so [15:58:59] running it via spicerack might definitely break [15:59:19] maybe we could add "head -n 100" or something [16:01:45] there's also a --stop-position, can I use that? [16:02:17] yep, though I'm not sure if the increment is always 1 [16:03:15] in that sense, what am I looking for/ [16:03:16] ? [16:04:20] this is where my understanding of the binlog is not deep enough :D you want to show a single "event" from the log, in theory [16:04:55] but each event can take up more than 1 integer in the "start/stop position" [16:05:20] I see, there's actually huge gaps in the integers between lines [16:05:45] in this case it's a delete query that causes many lines to be deleted, so I guess each actual deleted record will bump the integer [16:05:47] after 28815896, goes 28815938 [16:06:12] hmm no then your number doesn't match my explanation, I was expecting an even bigger gap :) [16:06:33] oh, interesting, so the logs are per actions on the DB, not queries (we are not replicating queries I guess, should be slower) [16:06:43] there's more numbers until the delete [16:07:21] exactly, and that's the root cause of the replication lag [16:07:23] the delete is at 28815981, and the next non-empty number is 28922538 [16:07:44] I'm trying to write a cookbooks to help there [16:08:31] I imagined that was your idea, I like it :) but it's not super easy to extract something meaningful [16:09:03] you might also explore using SQL instead of the mysqlbinlog CLI https://mariadb.com/kb/en/show-binlog-events/ [16:11:22] this yields something useful, but "3" is very arbitrary and might not always work SHOW BINLOG EVENTS IN 'log.062775' FROM 28815896 LIMIT 3; [16:34:59] interesting [16:35:30] I have found also that at least mysqlbinlog will fail if you try to start at a position that does not exist [16:36:09] Is this correctly formatted for getting cli access in codfw1dev? https://www.irccloud.com/pastebin/KWhbNAjy/ [16:41:48] Oh found the issue, indeed that is the right format, but had an old key [17:04:14] andrewbogott: I tried to do a trove backup (testbackup in the metricsinfra project) and it doesn't seem to work [17:04:25] well dang [17:04:31] I wonder if it only works with fresh instances [17:16:29] * dcaro off [18:12:52] actually now that I've thought about it for a minute, I'm sure it only works on fresh instances. I need to figure out some way to refresh the guest config