[01:24:31] PROBLEM - MariaDB sustained replica lag on s4 on db1248 is CRITICAL: 33 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [01:25:31] RECOVERY - MariaDB sustained replica lag on s4 on db1248 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1248&var-port=9104 [05:06:11] I am going to switch s2 codfw master [08:00:06] volans: we have a 1:1 arnaudb and me now, feel free to jump in to listen in the background for debugging & error handling. also goes for marostegui [08:00:19] thanks jynus <3 [08:01:34] link? :) [08:02:16] volans: shared it privatelly [08:16:29] wow revision in wikidata is 419G [10:16:07] marostegui: do you know how long does it take the bigint schema change per host more or less? [10:16:24] on s8 I don't know yet [10:16:33] s2 I am more interested [10:16:41] Actually I am doing it right now there [10:16:43] Let me check [10:17:10] 2024-07-03 00:05:24.218632 db-mysql db1225:3312 [10:17:12] that was the start [10:17:17] yeah, sadly backups cannot run at the same time- so waiting for it to finish to retry them [10:17:24] 2024-07-03 09:24:47.182336 db-mysql db1225:3312 [10:17:29] it is now running on the last wiki [10:17:36] so 10 hours or so? [10:17:40] So I guess it is safe to assume that around 11 or 12 [10:17:41] I see thanks [10:17:54] will then retry the backup when I see it has finished [10:17:58] thank you [10:18:05] I will ping you when it is done [10:18:20] ^ arnaudb I know you are having lunch but this is something you may have to handle [10:18:36] (we discussed this as a typical example of a backup failing, marostegui [10:18:37] ) [10:18:47] jynus: excellent! I will also be gone after you, so :) [10:19:12] it retried on its own, but the schema change also was ongoing, so in this case needed human intervention [10:19:24] yeah, these revision schema changes are long :( [10:19:28] although it would run anyway tomorrow, so not a big deal if not [10:19:56] this was more of an example for the discussion around dbbackups, so it was great (helping debugging) [10:20:04] good! [10:38:31] jynus: db1225:3312 finished [10:41:10] great [10:41:30] running "remote-backup-mariadb s2" at cumin1002 [10:43:52] k [10:44:22] it is on a screen session [10:46:49] volans: I forgot to mention that a currently running backup is also considered as a failed backup according to monitoring, so prometheus errors may go down [10:47:08] as running != terminated successfully [10:47:56] got it, thx [11:15:53] * volans lunch [11:25:55] marostegui: are you still busy with the old s8 eqiad master? [11:27:41] it was pooled, so not I assume :D [12:15:36] Amir1: you can go for it [12:18:17] on the meeting I said that all logs for bacula are on the director- allow me to qualify that statement with: "most likely you will only need to check the director" [12:18:43] sure sure [12:18:45] there will be logs for the client and the storage daemons, but only on very specific cases you will need to look at those [12:19:19] e.g. if network is not working, you may have to look at those , or if something purely disk based or a bug [12:19:33] but for the most part the errors are sent to the director if they can contact it [15:23:59] jynus: o/ if you have time, really low priority - IIUC we don't backup /srv/private on puppetmasters, is there an historical reason beside the fact that we replicate it to multiple nodes? [15:25:18] I am reviewing all the steps needed to move our usage of the private puppet repo to the new pupperserver nodes, this is why I am asking [15:26:46] elukey: are you sure that is true? I belive someonese told me the same and upon inspection something else happened (e.g. all of /srv is backed up or something else) [15:31:22] jynus: nono I think this is the case from a quick inspection, but I am pretty sure I could be wrong [15:33:18] did you change the role or profile of those servers recently? [15:33:29] Because I remember explicitly a discussion about backing up those [15:33:57] nono no changes [15:36:04] do you have the manifest of the profile for a quick blame? [15:40:44] So this is what I remember: moritz asked about how encryption worked to backup private stuff, and after some discussion, given encryption at rest and encryption on the wire was used, that regular backups were ok, but not perfect because backup uses key stored there [15:41:07] so we said to find an additional way for backups, but for now, backing up that was better than nothing [15:41:10] that is what I remember [15:41:14] many years ago [15:41:19] in profile::puppetmaster::frontend I see two backup::set, var-lib-puppet-ssl and var-lib-puppet-volatile [15:42:27] or maybe it was the pw repo and I am confusing it? [15:42:58] totally new to the details of the puppet infra, I don't have any context :( [15:43:23] I can try to ask to Joe and Alex [15:46:28] I still think it is a good thinkg you have brought out [15:46:44] because personally I thought it was being backed up [15:47:29] I'll try to ask around and report back, thanks! [15:47:30] is it a local git repo? [15:47:47] becase if it is cloned from somewhere else, maybe it is backed up somewhere else? [15:48:22] in theory no, we have two authoritative repos on puppetmaster1001/2001 and the other puppetmaster nodes (backends) have bare repos [15:48:39] on puppetserver nodes (puppet 7) all hosts have a writable repo [15:48:50] but I think we never tested committing in there [15:49:04] then a post-commit hook pushes to the other nodes to keep them in sync [15:49:06] mmm, maybe the 7 migration moved things around? [15:49:38] but it srv private the authoritative location, or is it pushed somewhere? [15:49:55] e.g. some of the other locations backed up? [15:50:54] not afaics [15:51:00] on puppetserver nodes (the new ones) we have [15:51:00] Backed up on this host: etc-puppet-puppetserver-ca [15:51:00] Backed up on this host: srv-puppet_fileserver-volatile [15:51:12] but we don't push elsewhere [15:51:20] (brb) [15:51:29] and how does the private stuff integrates into the rest? [15:57:18] there is 4 copies of that, and 2 probably shouldn't be there