[12:13:49] s1 codfw snapshot wrong_size 7 hours ago 1.1 TB -10.5 % The previous backup had a size of 1.3 TB, a change larger than 5.0%. [12:40:53] Should I ignore mail from: (I am sorry, I may have lost the log if discussed already) [12:45:12] I don't think you should ignore it jynus [12:45:23] I don't recall it being discussed before [12:45:25] I will, I am not a DBA :-D [12:45:31] but flagging it [12:45:34] aha [12:45:39] anyway, I think I found the reason [12:45:49] I think it is a schema change [12:46:31] possible in the realm of Amir1 (not sure)- the table may need just some allowlist changes [12:49:16] another case of not jumping into it because I see no immediate alarm by itself, but worth creating a ticket [12:49:30] let me take a look [12:50:08] jynus: arnaudb: I fixed it [12:50:14] thanks! [12:50:29] regarding backup size, that's enwiki pagelinks PK change [12:50:41] nice, great job, Amir1 [13:24:04] elukey: o/ [13:25:45] urandom: o/ [13:26:49] urandom: ready for restbase? [13:27:01] as read as I'll ever be :) [13:27:46] ok so overall process that I have in mind: [13:27:51] 1) disable puppet on restbase* [13:28:00] 2) merge the truststore change [13:28:17] 3) force puppet run on a restbase2* node, and restart the instances [13:28:24] 4) sanity check [13:28:38] if all goes fine, we force puppet on all nodes and start the codfw restart via cookbook [13:28:41] does it sound good? [13:29:20] elukey: sounds good [13:29:45] all right! [13:29:50] puppet already disabled, merging [13:33:56] running puppet on 2021 [13:35:52] urandom: ok so puppet ran fine, and I checked the content of /etc/ssl/localcerts/wmf-java-cacerts [13:36:01] Your keystore contains 2 entries [13:36:02] rootca, Apr 24, 2024, trustedCertEntry, [13:36:02] Certificate fingerprint (SHA-256): E9:65:10:FE:10:6C:5C:53:D6:64:D6:3E:78:71:AA:A3:39:19:82:DC:E2:04:A4:9A:A0:01:EB:49:8E:BB:6F:A0 [13:36:04] wikimedia_internal_root_ca, Apr 24, 2024, trustedCertEntry, [13:36:07] Certificate fingerprint (SHA-256): A4:FA:4E:BD:7A:7D:26:DE:FB:92:78:19:67:51:C5:3B:03:56:30:C2:AA:1F:99:9C:70:3A:8B:34:97:FF:B5:2F [13:36:10] that is good [13:36:29] I'll use the cookbook to restart the instances [13:37:30] elukey: yeah, looks good to me [13:37:54] lovely the cookbook broke when stopping cassandra-a [13:38:09] was the unit interrupted ? [13:38:20] was puppet disabled for the restart? [13:38:36] * urandom wants to fix this [13:38:46] 2024-04-24 13:37:16,972 INFO [a] Stopping service cassandra-a [13:38:49] 2024-04-24 13:37:20,047 ERROR [a] b'Job for cassandra-a.service canceled.' [13:38:52] 2024-04-24 13:37:20,048 ERROR [a] systemctl command returned exit code 1 [13:38:54] yes [13:38:55] Traceback (most recent call last): File "/usr/bin/c-foreach-restart", line 60, in [13:39:12] that happens if the schedule puppet run hits during the cookbook [13:40:02] never seen it before, does the cookbook need to be changed to disable puppet? [13:40:07] yes [13:40:16] also it would not completely avoid these cases [13:40:23] why so? [13:40:43] you can disable puppet but one run can be in progress, and if you hit the node then the failure would happen [13:40:55] or is it if it starts after the restart? [13:41:04] (anyway, restarted the cookbook for 2021) [13:41:59] right, c-foreach-restart does a start and stop, and if the puppet run fires between them, it's own attempt to start cancels the scripts start (or perhaps the other way) [13:42:57] sorry...c-foreach-restart does a stop and start, and you might expect :) [13:43:06] s/and/as/ [13:43:10] * elukey nods [13:43:19] * urandom doubles down on his morning coffee [13:44:50] anyway... it doesn't prevent the node from starting, but it does cause the restart script to exit [13:45:06] so it can be pretty annoying [13:48:25] urandom: 2021 ready for a sanity check! [13:50:58] elukey: lgtm [13:51:35] all right running puppet in codfw and kicking off the roll restart [13:57:03] urandom: codfw roll restart started [14:13:00] urandom: the cookbook failed :( [14:13:12] same error? [14:13:33] yep [14:13:43] has it happened with this frequency in the past? [14:13:44] was puppet enabled? [14:13:45] yes [14:14:12] I mean, I've had times when I was lucky, and others when I wasn't [14:14:35] I just disable puppet before the run now [14:14:50] okok sorry I didn't get this part, doing it [14:15:04] which is not great because you have to also remember re-enable it, which I recently forgot [14:15:24] we can add it to the cookbook, should be one line [14:15:27] forgetting is obviously going to happen, which is why this needs to be integrated into the cookbook [14:16:00] yeah, pretty sure I left myself a ticket open on this, I just haven't followed through :( [15:18:46] urandom: filed a change to fix the cookbook (we'll need another one later on), atm we are halfway through the codfw restarts and nothing exploded [15:19:06] due to time constraints, would it be ok if I leave to you the eqiad roll restart? [15:24:47] elukey: yes I can do them [15:37:41] greetings, data-persistence! following up on my post from the 9th: we're now targeting the mediawiki infra deployment window (UTC late) on Tuesday 4/30 for the etcd maintenance I mentioned. [15:37:41] to recap: this is a 30m-1h window where non-emergency conftool actions (e.g., dbctl, confctl) are discouraged. [15:37:41] would that be acceptable for you folks in terms of long-running operations such as schema changes? [15:38:20] urandom: ok! So I ran puppet on restbase1* and checked the new truststore, all good [15:38:55] elukey: awesome [15:39:22] hopefully with the new cookbook [15:41:47] arnaudb: FYI, as it was recommended I coordinate with you at the time :) [15:49:47] urandom: new cookbook in, should be ready when you start the roll restart later on [15:50:15] 👍 [16:44:05] urandom: codfw restarted! [16:44:28] elukey: great, thanks; I'll get started on eqiad [16:44:48] super, lemme know how it goes [16:45:44] (going afk for the evening, ttl!) [16:50:32] enjoy your evening