[09:43:48] I thought cumin was installed properly on cumin01 on mariadbtest (cloud). It is not. Can I install it following: https://wikitech.wikimedia.org/wiki/Help:Cumin_master and give it access to db hosts ? [09:44:22] marostegui: checking the patch, s3 is now compressing, it is the last snapshot job [09:44:37] jynus: no problem from my side regarding cumin [09:44:47] And thanks, let me know when the snapshot is done, I am fully ready from my side [09:49:08] I will stop bacula at :05 [09:49:28] no problem [10:18:28] looks like s3 finished \o/ [10:18:37] can you confirm jynus? [10:19:52] yeah [10:19:57] I confirm [10:20:05] So we are good to go from your side? [10:20:08] yes [10:20:11] Ok! [10:20:34] Moving to -operations [10:30:18] m2 will be easier as no bacula or dbbackups involved [10:33:23] yeah, I need to check all the services and see which really required coordination [10:33:43] as I don't think like 5 seconds of RO or less is worth coordinating with many stakholders anymore [10:34:25] I don't think the ro is the problem [10:34:34] but the restart of services if they get stuck [10:35:16] for example, I think the fork of otrs is there, but I don't have much visibility [10:35:35] otrs was never a problem before for switchovers [10:35:42] ok [10:36:13] i think it has always been mostly etherpad [10:36:16] (on m1) [10:37:00] And I think debmonitor required it, once, but I cannot remember [10:43:22] marostegui: required what? [10:43:28] :D [10:43:29] a restart [10:44:03] I think it should reconnect on failure, but if the db is still up but RO maybe it doesn't [12:15:20] marostegui: Amir1: are you able to suggest some canary db's hosts i can migrate to puppet7 as a smoke test [12:15:33] jbond: you can use db1124 [12:15:43] marostegui: thanks [12:38:30] marostegui: that one has been migrated when yuou have some time can yuo check it over, incuding a mysql restart and if all is ok let me know which hosts/roles to proceed with, thanks [12:38:47] jynus: do you have a canary backup host(s) i can test the migration on [12:38:54] jbond: wilco [12:39:19] thanks :) [12:39:32] jbond: can you also test db1133 which is bullseye? (db1124 is bookworm) [12:39:48] marostegui: sure thing [12:39:55] ta [12:59:46] jbond: there is really no canaries for backups, just apply it on backup1001 and if something breaks we will know it instantly [13:00:31] maybe dbprov2001 also for variety [13:00:56] and backup2010 for bookworm [13:02:15] jynus: ack will do thanks [13:13:06] marostegui db1133 is done now as well [13:13:14] great thanks [13:13:15] I will test it [13:13:22] thx [13:40:55] jynus: those three machines have been migrate plase have a poke and let me know if you see any issues [13:42:43] sure, thank you [13:47:33] run-puppet-agent seems very slow, normally it takes much less time [13:47:40] "Waited 120 seconds and a preceding puppet run is still ongoing, aborting" [13:50:11] backup2010 last run was 36s, early runs were 38s, dbprov2001 01:25 early runs 39s [13:50:51] this is backup1001 [13:50:52] db1133 is now 01:13, was 35 [13:51:31] it is still running, well over a minute [13:51:38] backup1001 last unchanged is 02:48, was 47s this night [13:51:59] jbond: ^^^ as I have no idea what are the performance of p7 vs p5 [13:53:35] volans: thanks i think that the caching CR that jesses is working on should help fix this [13:53:40] (cc jynus ) [13:53:48] Notice: Applied catalog in 290.09 seconds [13:53:55] that's 5 minutes almost [13:53:57] but ftr when i ran puppet on backup2001 it was 45 secs [13:54:09] jynus: what host and is it repeatable [13:54:21] this is backup1001 which reads lots of exported resources [13:54:28] but it was almost instant before [13:54:28] both dnsprov and backup finsished in 45 secs for me [13:54:45] jbond: last 3 runs https://puppetboard.wikimedia.org/node/backup1001.eqiad.wmnet [13:54:49] backup2001 != backup1001 [13:57:14] https://puppetboard.wikimedia.org/report/backup1001.eqiad.wmnet/7e6c40b1242b6ce4982be9f448ec4b607c1eb10d [13:58:38] compared to the 25 seconds of: https://puppetboard.wikimedia.org/report/backup1001.eqiad.wmnet/ddaff01f3910e19772f79e0a14642b52ee5d6e10 [13:58:48] jynus: im looking [13:59:33] it seems there was a peek on puppetdb over the last 30 mins https://grafana.wikimedia.org/d/Ii5pUfqMk/puppetdb-thanos?orgId=1&from=now-3h&to=now [14:02:19] I think it is the "Info: Loading facts" what is slow [14:02:45] but that shouldn't be, because client hasn't been changed (?) [14:02:49] i dont think so as `sudo facter -p ` runs quite quickly [14:03:36] anyway, this is not something to be immediatelly be worried, but please look if there is something that could explain it [14:04:57] I think 25 seconds -> 290 is worth researching (even if we don't revert at the moment) [14:05:10] jynus: yes i agree im looking at it [14:06:39] everything else looks fine, backup worked, prometheus exported metrics worked ,etc [14:06:48] jynus: ack thats good [14:07:01] db1124 is also taking ages to run [14:07:18] so mostly sayting to pause and evaluate [14:07:39] maybe it is something as simple as "new puppet master doesn't have enough resources" [14:07:43] Notice: Applied catalog in 163.17 seconds [14:07:59] or a missing index on postgress, or whatever [14:14:43] hmm looks like there is a bit of pressure on the puppet servers [14:17:09] this is a good thing, we can react before extending it to more servers :-) [14:29:13] btullis: another day, and aqs1012 is still down. I'm sure dcops will get around to that machine eventually, but there is probably a threshold where an ideal outcome there (which in this case would be getting the partitions back fully intact), is actually worse than having it done messier/sooner. :( [14:29:53] we've been down two days, if we had it back up today, I'd be looking to see if we could rerun the last couple of days worth of imports. if not, we'd have to do a full repair [14:56:59] (PuppetFailure) firing: Puppet has failed on thanos-be1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:59] (PuppetFailure) resolved: Puppet has failed on thanos-be1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:39:03] jynus: marostegui: puppet runs should be quick again [15:40:15] let me see [15:41:09] jbond: Notice: Applied catalog in 24.27 seconds [15:41:11] \o/ [15:41:40] \o/ [15:46:37] jbond: what was it? [15:47:10] jynus: we still need to dig into it but we added caching which has given a massive boost [15:47:17] * jbond in meeting now so cant expand [15:47:29] yeah, sorry. Good job! [15:47:33] no thx