[10:03:34] good morning [10:05:13] it seems all backups are failing? [10:05:55] yeah I created https://phabricator.wikimedia.org/T351617 [10:06:32] they may be running but not reporting, I will check [10:06:59] yeah I didn't spend much time on it given that you were coming back today [10:07:46] last s1 dump is from 2023-11-21--00-00-05 [10:08:14] and last snapshot is from 2023-11-20--00-00-01 [10:08:49] so the backups are happening but with 0 monitoring [10:09:38] Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)" [10:09:47] :-( [10:11:07] it is not reading the new config file, despite the new config file being correct [10:11:21] is that related to the puppet migration? [10:11:35] I don't know yet [10:11:59] I am an idiot [10:12:08] stats_file: '/etc/wmfbackups/statistics.cnf' [10:12:27] I changed the mysql connection file [10:12:37] but I am still pointing to the old one [10:12:57] so it loads 0 config and tries connecting with default parameters (localhost) [10:13:07] so yeah, just a puppet fix will work [10:26:11] I am applying https://gerrit.wikimedia.org/r/c/operations/puppet/+/976158 which should solve the monitoring issue [10:26:20] great! [10:27:15] or wait [10:27:26] maybe I should stop hardcoding such a variable [10:55:26] jynus: am i abl to progress with migrating the backup roles? [10:55:42] jbond: I am fixing right now the last issue [10:55:53] jynus: ack thanks [10:56:04] marostegui: any db stuff i can migrate roles or hosts? [10:56:31] jbond: No, I need to time to check stuff [10:56:47] marostegui: ack [11:09:06] jbond: not sure if important, but I got a 500 error on cumin2002 while running puppet [11:09:53] ill double check but its probably transient [11:10:12] yeah, not worried about that [11:10:26] just in case there were still tuning needed for load or something [11:10:53] jynus: i think the load is good now but we hav an issue when puppet-merge runs [11:11:07] ah, interesting [11:11:34] knowing that is already useful to me [11:20:14] jynus: the task is https://phabricator.wikimedia.org/T350809 (just reopened it) [11:21:09] thank you, again- not worried about it, just knowing it can happen and why is already useful [11:58:50] https://phabricator.wikimedia.org/T351617#9348521 [13:21:11] See the recoveries ongoing now on -operations [14:24:36] Hi folks, could I get a +1 to expand our envoy rollout to one more codfw node and one eqiad node, please? I actually aim to deploy tomorrow morning assuming the existing envoy node behaves itself overnight. https://gerrit.wikimedia.org/r/c/operations/puppet/+/976229 [14:25:03] As well as the existing swift monitoring you can see the new envoy graphs for the one codfw-swift node - https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-destination=All&var-origin=swift&var-origin_instance=All [14:37:13] Emperor: done! [14:46:18] Emperor: lgtm, but out of curiosity, why does it require a reimage? [14:55:20] because the nginx puppetry doesn't have a present/absent parameter you can use to remove all the nginx puppet resources [14:55:42] (so rather than trying to remove them all by hand, just reimage to start from a clean slate) [14:56:54] ah, gotcha [15:00:01] urandom: you may be the wrong person to ask, but it came up in a CR you sent my way, so: why is profile::installserver::preseed::preseed_per_hostname: set in hieradata/role/common/apt_repo.yaml ? [15:02:09] there was an email to ops@ about a week ago, netboot.cfg is now being generated to make it less error prone [15:02:53] OK, but why in the apt_repo hiera file? [15:03:04] oh, right... yeah, I wondered about that myself [15:03:36] does the install server and apt repo run on the same machine? [15:04:00] that wouldn't necessarily justify that, but might explain it? [15:05:08] no - role(installserver) is install[12]004,3003,[456]002 apt_repo is apt[12]001] and apt1002 [15:06:08] brouberol: do you know why apt_repo.yaml was used for this hiera stuff rather than an installserver hiera file?