[08:19:56] jayme effie: same question than last time. There's an undeployed admin_ng change on dse-k8s-eqiad related to cert-manager-webhook-calico. Is that safe to deploy? [08:19:59] Thanks! [08:57:01] brouberol: hmm. AIUI that has been rolled out. Let me double check [08:59:11] brouberol: ah, I see it has only been deployed to wikikube last week. Safe to deploy to dse [08:59:23] gotcha, thanks [08:59:44] I'll take care of ml and aux [08:59:45] which reminds we, we still need to build the system that reminds us of unreleased diffs [08:59:53] at some point [08:59:57] yeah... [09:00:06] re deployment: thanks! [09:41:17] slyngs: would you mind reverting|fixing https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026790 :? [09:41:30] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028774 [09:41:38] No I would not :-) [09:42:13] I'm guessing that instance is ready to get requests, right? [09:43:04] Ah, there's some firewalling. I'll just rollback and do the patch correct once that's fix [09:43:37] slyngs: yeah, please revert and merge with a valid URL as soon as the instance is ready to get requests AND cloudtestidm.wikimedia.org is on the SAN list of the instance certificate :) [09:44:15] vgutierrez: Would do a quick sanity check on the revert: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028574 [09:44:18] please run PCC the next time :) [09:44:25] Absolutely :-) [09:47:45] vgutierrez: Rolled back and Puppet works on the cp hosts again [09:47:54] Sorry about that [09:48:41] thx :) [09:53:32] anyone did anything on some cloudceph* hosts from cumin2002 a few minutes ago? [09:54:15] dcaro: that should be available on cumin logs [09:54:23] it seems it broke the user@0.service (T364376) [09:54:23] T364376: cloudcephosd: the service unit user@0.service is in failed status - https://phabricator.wikimedia.org/T364376 [09:54:25] looking [09:56:55] cumin logs don't seem to have anything related [10:00:36] dcaro: I've been installing glibc security updates on buster hosts and there's still some cloudceph ones among them [10:00:51] but that only updates the glibc packages, I didn't restart anything or similar [10:01:22] it's a weird failure on the user@0 service [10:01:52] which host is that, I can have a look [10:02:05] it does not seem to have any lasting effect (ceph is up, hosts seem ok, everything running as expected) [10:02:12] but root failed to start a session at some point [10:02:20] cloudcephosd1031 <- one of the 6 hosts affected [10:03:53] ah, I think I know what that is, let me confirm [10:07:05] not, it's not that, but might be somewhat related: we had seen that one hosts with high I/O load (we apply a bandaid on swift backends), the systemd session creation sporadically fails, this got triggered by debdeploy in particular [10:07:25] which is the ingestion tool run under a system user after updates [10:07:28] https://phabricator.wikimedia.org/T199911 [10:08:06] in this case it affected the root user, though, so while it might have been triggered by the libc update, it's not the same issue [10:08:34] did this only happen on the buster nodes, or also on some of the bullseye ones? [10:44:35] looking (/me got in a meeting) [10:44:54] some bullseye ones too [10:45:08] coludcephosd1031 is bullseye [10:48:33] possibly the cloudceph could use a similar toil classs like T199911, then, not sure. for the current servers it should be enough to simply run "systemctl reset-failed" on the failed units, there's no real impact otherwise [10:48:34] T199911: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 [11:06:01] install7001 failed to be backed up tonight- checking why [11:06:56] Fatal error: bsockcore.c:208 Unable to connect to Client: install7001.wikimedia.org-fd on install7001.wikimedia.org:9102. ERR=Connection refused [11:07:12] so either tcp or firewall issue (or daemon not running) [11:08:18] uf, I have 221 ms ping to that host [11:13:46] jynus: install7001 was reinstalled this morning [11:15:00] let me then retry it [11:16:52] it worked now, so just bad timing [11:19:25] ack [11:25:53] moritzm: will go with the reset-failed for now, will consider the toil class if it happens again/often, thanks! [11:26:11] sounds good [12:57:23] hello on-callers! As FYI in a bit me and Empe*ror are going to move ms-fe1009's TLS cert to PKI. We are going to depool, deploy, check etc.. but since this is Swift I just wanted to mention it [12:57:42] (sorry, in a bit == in a couple of hours) [12:58:18] More info in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026927 if you want to chime in [13:03:28] or run away :-P [13:03:28] elukey: for how low and you leaving ms-fe cluster in this split case? [13:03:37] *how long [13:09:13] vgutierrez: I hope for one/two days max, then the plan would be to move either eqiad or codfw entirely then wait a bit and complete. Does it sound ok? Otherwise I am open to suggestions [13:20:48] elukey: sounds good, just take a look to ATS backends dashboard and see that doesn't impact connection reuse [13:21:50] that would be https://grafana.wikimedia.org/goto/8N1sK5LIR?orgId=1 [13:23:26] vgutierrez: ack! Are you worried about something in particular? [13:24:07] That for some reason ATS would treat it as a different origin server [13:24:34] And refused to reuse connections against ms-fe1009 [13:28:00] ah wow okok [14:56:59] elukey: I'm around as and when you want to work on ms-fe1009 [14:57:08] Emperor: o/ [14:57:12] just depooled ms-fe1009 [15:00:48] Cool. I'm not expecting drama, just here in case needed :) [15:01:20] Emperor: anything that I'd need to check to verify traffic is drained? Or can I proceed? [15:02:46] elukey: go ahead (there are a few options, but "top" is the quickest and dirtiest) [15:04:03] all right running puppet now [15:06:26] done! Envoy is up and running with the new cert [15:06:36] I checked with openssl's s_client and it looks good [15:06:47] X509v3 Subject Alternative Name: [15:06:48] DNS:swift-rw.discovery.wmnet, DNS:upload.wikimedia.org, DNS:swift.svc.eqiad.wmnet, DNS:ms-fe.svc.eqiad.wmnet, DNS:swift.discovery.wmnet, DNS:swift-ro.discovery.wmnet [15:07:54] not sure if there is anything to verify the new config, otherwise we can repool [15:08:58] Emperor: --^ [15:09:16] just looking myself [15:09:22] super thanks [15:09:36] we've changed CN I see from swift_eqiad to swift.discovery.wment [15:10:07] and Signature type from sha256WithRSAEncryption to ecdsa-with-SHA512 [15:10:07] yes in theory it shouldn't matter, swift_eqiad was the name of the cergen's config [15:10:32] yep yep, the latter is better/more-secure [15:10:50] it is a default for discovery IIRC, we use it elsewhere [15:11:31] looks good to me, probably time to repool and look for 🔥 [15:12:15] super doing it [15:13:15] {done} [15:15:46] Emperor: I see a lot of tcp conns in established for port :443, coming from k8s and cpXXXX nodes [15:16:11] not an indication that we are doing 100% good but it is a good sign :D [15:17:01] nothing also pops up in https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-site=All&var-cluster=upload&var-origin=swift.discovery.wmnet&from=now-3h&to=now [15:17:15] I'd say we are good :) [15:18:17] huh, grafana is saying 503 to me [15:19:02] right, it's back, and yes that looks OK to me [15:19:05] thanks! [15:19:43] thank you for the help! I'll file patches to move the rest of eqiad and then codfw later on [15:19:59] we can probably do the rest of eqiad on thursday? [15:20:53] 👍 [16:29:03] jhathaway: can you visit https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026682 sometime today? I'm still putting out the periodic fires that that change is meant to prevent. [16:29:31] andrewbogott: yup, sorry for the delay [16:29:45] no worries! I'm not 100% sure that my comments make sense anyway [17:31:16] does anyone know if production pybal hosts any nodes outside of the 'wmnet' or 'wikimedia.org' domains? [17:33:35] you mean backend nodes, not incoming public services, right? [17:34:45] bblack correct. Context is T363702 , just wanna make sure I get all the backend nodes [17:34:46] T363702: LVS hosts: Monitor/alert on when pooled nodes are outside broadcast domain - https://phabricator.wikimedia.org/T363702 [17:36:59] I can't think of any, but I don't see why we'd arbitrarily restrict/filter, either [17:39:30] Agreed. Once T364037 is done, we should be able to stop caring about domains [17:39:31] T364037: Investigate why pools.json does not match https://config-master.wikimedia.org/pybal/${datacenter}/${service} T363702 - https://phabricator.wikimedia.org/T364037 [17:40:17] Technically we could read the config straight from etcd, but that seems expensive compared to config-master [18:49:38] anyone know if acmechief can be configured to supply its certs in a specific order, ideally private key before public key? [18:50:04] ? [18:50:15] What do you mean? [18:50:34] You should get all the tls material on the same puppet run [18:51:07] I want to use postfix's smtpd_tls_chain_files setting, https://www.postfix.org/postconf.5.html#:~:text=smtpd_tls_chain_files%20(default%3A%20empty) [18:51:25] which insists that the private key must come before the public key in the file [18:51:33] Ugh [18:51:52] indeed [18:52:19] We provide .chained.crt.key but you cannot pick the order inside the file [18:52:37] I assumed so, but thought I would ask [18:52:47] perhaps I'll just use a single cert for now [18:53:03] and put it on the old endless TODO list [18:53:35] Open a task and tag acme-chief please and I'll take a look [18:53:45] vgutierrez: will do [18:53:58] Cheers [19:14:04] jhathaway: puppet's `concat` should be able to make a file like what you need [19:14:32] hmm, true, I guess I could use a source of the existing files on disk [19:15:12] can't you use as source acmechief and concat private+cert in the target file? [19:29:54] Yeah [19:30:07] That would be definitely faster [19:34:45] vgutierrez: would I always request the live route? [19:48:21] new & live always point to the same dir, so I suppose that doesn't matter [19:49:40] I wonder how hard it is to spin up a test acmechief instance pointed to a local pebble server [20:24:45] Hello team, I just migrated the logstash certificates to CFSSL. Thing seem to be working well from my side but due to it being a mission-critical service I'd greatly appreciate if someone else can confirm that the dashboards load after the changes: logstash.wikimedia.org [20:25:09] stuff loads for me [20:25:17] zabe: Thanks for taking a look! [20:26:31] yw [21:26:06] jhathaway: we use pebble for acmechief tests [21:26:35] And yes.. new and live point to the same directory after the staging time has passed [21:26:55] nod, I would like to get acmechief wired up inside of dcl, so I could test this more easily [21:27:16] but I'm not sure how much effort it would take to get acmechief + pebble + dns working correctly [21:27:33] first, make a puppetized install of pebble so you can run it all inside dcl ;) [21:28:00] yeah, that would be the goal [21:28:40] I wonder why you're going down that route instead of what volans suggested though:) [21:28:54] I guess some of us enjoy some pain [21:30:34] vgutierrez: I am going to try volans suggestion, but it would be nice to be able to test the puppet code [21:32:23] acmechief test instances issue the same set of certificates against LE staging environment [21:32:38] Maybe that could be useful [21:33:50] hmm, that might be, how do I reach a test instance? [21:36:47] We got a hiera variable that points to the acmechief that's gonna be used to retrieve the tls material [21:37:05] It just needs to be set to the fqdn of the test one [21:38:35] is the test one available publicly, can I hit its endpoint from WMCS or my laptop? [21:43:41] nope [21:43:50] It's on the production realm [21:45:18] But maybe it's easier for you to set something like that.. an acmechief instance against LE staging environment rather than mocking LE entirely with pebble [21:46:32] Anyways.. we could follow this asynchronously on some task. E_TOOLATE here [21:48:07] yup sorry, night