[07:00:01] hello folks [07:00:07] reimaging kubernetes1012 [07:09:33] my plan is to do 1012/1013 today, then 1014/1017 tomorrow [07:09:51] (that should complete the reimages) [07:09:58] (excluding the control plane nodes) [07:17:32] <_joe_> great [07:17:47] <_joe_> the kube masters are still on stretch? [07:23:33] buster IIRC [07:43:03] 1012 done [08:12:49] started with 1013 [08:30:19] masters are on buster, but ml is already experimenting with masters on bullseye. I would assume there are no big issues hidden there [08:32:27] <_joe_> yeah me too, and it's also less urgent ofc [08:33:23] true. But I would really like having everything on the same base (kernel + docker wise especially) [08:37:33] the partman recipe is already configured, and we don't need to change the vm's virtual disks since we are already using overlayfs in there, and the kube-api packages are alraedy on bullseye [08:37:53] so it should just be an in place reimage for all control plane nodes [08:43:02] <_joe_> oh sure I was no suggesting not to do it [08:47:31] understood :) [08:55:22] 1013 done, 2 nodes left [08:55:27] * elukey sees the light at the end of the tunnel [09:55:01] jayme: since I am very close to finish, I'd do kubernetes1014 and 1017 today as well. Anything against it? Too much? [09:59:35] no objections. Pull though I'd say :) [10:10:09] started 1014 :) [10:45:28] 1014 done, 1017 started [10:45:30] last oneeee [11:10:42] _joe_: can you kind of confirm rendering.svc.*.w is dead? (https://phabricator.wikimedia.org/T304237#7790839) ..from what I found it was maybe the predecessor of thumbor [11:10:56] <_joe_> I deny everything [11:11:17] <_joe_> I will not be held accountable for the horror that the imagescalers were [11:11:23] <_joe_> but yes, it's gone [11:11:25] <_joe_> for years :) [11:12:24] ack. Thanks :) [11:29:18] aaand all workers on bullseye! [11:29:29] \o/ [11:30:01] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Move kubernetes workers to bullseye and docker to overlayfs - https://phabricator.wikimedia.org/T300744 (10elukey) [11:30:28] yay 🎉 [11:31:05] great! Thank you so much elukey [11:31:57] 👏👏👏 [11:35:45] jayme: one last nit, probably not important, but I saw alerts.w.o firing on IRC for the calico pod not running on kubernetes1014, but afaics everything is good and I don't see the alarm anymore on the UI [11:35:56] (I didn't see the recovery for it) [11:36:38] when you have a moment can you double check that everything is ok? (just to be paranoid) [11:37:30] yeah, I was looking as well .. there was a recovery right before the alert fired again [11:38:05] that's not correct. Actually some minutes before ..hmm [11:38:12] I'll double check the node [11:39:50] thanks a lot [11:40:03] I need to run afk in a bit, lemme know if anything pops up [11:40:10] wilco [11:46:13] I think it's fine [12:05:11] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) >>! In T304237#7790839, @Volans wrote: > ` > root@puppetmaster1001:~# for file in $(ls /var/lib/puppet/server/ssl/ca/signe... [15:14:28] cleanup patch for k8s - https://gerrit.wikimedia.org/r/c/operations/puppet/+/773520 [15:26:50] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) - https://phabricator.wikimedia.org/T303184 (10JMeybohm) [15:28:55] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Implement POC for istio ingress - https://phabricator.wikimedia.org/T290966 (10JMeybohm) [15:28:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) - https://phabricator.wikimedia.org/T303184 (10JMeybohm) 05Resolved→03Open Reopen as this is not resolved an will most likely hit us again until we update istio/cert-manager/kubernetes [15:29:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) - https://phabricator.wikimedia.org/T303184 (10JMeybohm) p:05Triage→03Medium [15:34:40] 10serviceops, 10Security-Team, 10GitLab (CI & Job Runners), 10Patch-For-Review, and 2 others: Setup GitLab Runner in trusted environment - https://phabricator.wikimedia.org/T295481 (10Jelto) [15:36:46] hi, someone familiar with mediawiki k8s here? I have a backup problem: https://gerrit.wikimedia.org/r/c/operations/puppet/+/773559 [15:39:33] in a meeting currently, can take a look in ~30m [15:39:39] thanks! [16:05:20] jynus: commented on CR [16:07:29] sadly, that wouldn't work [16:09:35] 10serviceops, 10GitLab (Infrastructure): GitLab minor version upgrade: 14.9.x - https://phabricator.wikimedia.org/T304622 (10Jelto) [16:09:42] hm, then I guess I did not understand what the problem is [16:09:44] 10serviceops, 10GitLab (Infrastructure): GitLab minor version upgrade: 14.9.x - https://phabricator.wikimedia.org/T304622 (10Jelto) p:05Triage→03Medium [16:10:13] <_joe_> jynus: ah sorry that's on me [16:10:23] jayme: the problem is: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=backup1001&service=puppet+last+run [16:10:26] <_joe_> i was used to just needing to add the definition in one place [16:10:33] <_joe_> and btw [16:10:43] <_joe_> the first puppet run went fine after the change... [16:10:52] yeah, it is through a resource [16:11:07] <_joe_> yeah i realize that now [16:11:10] that breaks :-) as it is a missing reference to a non defined fileset [16:11:19] <_joe_> jynus: it's ok if i fix it? [16:11:19] can we just comment it for now? [16:11:24] if you want it, sure [16:11:26] <_joe_> it will take ~ 20 minutes [16:11:44] yeah, I was only proposing this as I am not confortable with that codebase [16:11:58] so I tried to give a quick temporary fix [16:12:09] as I thouth that was code not into proper produciton yet [16:12:54] I think I do get what the error is, but not why removing slashes from the resource name does not fix it [16:13:14] jayme: the parameter there must be an identifier, not a path [16:13:30] it fails because such an identifier is not defined [16:13:44] and that identifier has to be defined somewhere else? [16:13:47] yes [16:13:54] that was what I was missing :) [16:13:56] thanks [16:14:02] it says backup using config X [16:14:14] but config X didn't exist :-) [16:14:52] yeah..makes total sense. I was wondering (from other usages in puppet) where it would get context from (like the path to backup) [16:15:12] a fileset is a list of included and excluded dirs [16:15:26] yeah [16:15:52] yes [16:17:03] summary of the 3 steps to setup a new backup is at: https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client (setup a fileset (or use an existing one), add the profile for the software, and the backup set for the definition (collected through a resource) [16:21:01] ack [16:21:45] <_joe_> jynus: https://gerrit.wikimedia.org/r/c/operations/puppet/+/773581 should fix our problem [16:29:49] <_joe_> jynus: patch merged, can you check if this unbreaks your situation? [16:30:05] <_joe_> if not just merge that commenting out for now, I have a meeting [16:30:35] thank you! [16:39:46] Notice: Applied catalog in 21.53 seconds with no errors, thank you for your help! [17:05:20] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05In progress→03Resolved [17:05:28] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [17:12:02] 10serviceops, 10Analytics, 10Data-Engineering, 10Event-Platform, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10BTullis) [17:17:40] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10EChetty) [17:20:40] 10serviceops, 10Data-Engineering-Radar, 10MW-on-K8s: IPInfo MediaWiki extension depends on presence of maxmind db in the container/host - https://phabricator.wikimedia.org/T288375 (10BTullis) Could we deploy the GeoIP databases to the kube-workers and then mount it to the mw pods as a readonly hostpath volum... [17:29:51] hey folks, reporting in here as well [17:30:00] there seems to be a big pack of nodes failing puppet [17:30:11] the Exec verify envoy config fails [17:30:16] could it be https://gerrit.wikimedia.org/r/c/operations/puppet/+/773364 ? [17:30:35] I see the following running the check manually [17:30:37] [2022-03-24 17:26:55.856][30486][critical][main] [source/server/config_validation/server.cc:62] error initializing configuration '/tmp/.envoyconfig/envoy.yaml': Proto constraint validation failed (field: "upstream_protocol_options", reason: is required): common_http_protocol_options { [17:30:42] idle_timeout { [17:30:45] seconds: 4 [17:30:47] } [17:30:50] } [17:34:06] jayme, _joe_ --^ [17:34:11] and rzl [17:34:37] <_joe_> rzl mostly [17:35:41] fyi puppet is running successfully so the widespread puppet alert will clear but the bad confi is still inplace it just that verify-envoy-config failed [17:36:27] <_joe_> yes [17:36:38] in place in te staging area that is the running config shuld stil be good [17:36:43] <_joe_> so given that failed, the config is still good [17:36:54] yep this was my impression as well [17:37:00] <_joe_> yes thankfully we have protections in place [17:37:03] no immediate issue but non great nonetheless [17:40:27] yes agree [17:42:12] <_joe_> so the good thing is that even if envoy restarts, the good config will still be used [17:42:23] <_joe_> so we are actually safe [17:42:44] <_joe_> I'd say we wait for rzl to take a look, maybe he has a quick solution [17:44:46] yes its all safe. however it worth noting that the only current indicate for this is the widespread puppet faliures alert which will clear after ~30 mins so could go unoticed. [17:45:34] would almost be better to have puppet fail on every run, but i dont think the validat_cmd functionality is good enough as it cant handle config fragments/includes [17:46:07] as everything ultimatly ends up in one file concat may actully be a good use case here [17:46:35] <_joe_> jbond: we can just save a state file every time build-envoy runs [18:01:17] back sorry -- yeah that's me, will roll back [18:04:20] revert is https://gerrit.wikimedia.org/r/c/operations/puppet/+/773532, will merge as soon as jenkins passes [18:05:45] must have just missed you before leaving a comment :) [18:06:33] back from vacation. now going through backlog of mails and phab and all that [18:06:34] part of why I let my guard down is that we use that field in the tests for build_envoy_config.py -- I expected the tests to fail [18:07:05] ohh but I guess we don't have envoy installed in the CI environment so the actual validation gets mocked out :( [18:07:11] mutante: wb! [18:07:19] thanks Reuven [18:08:30] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Resolved→03In progress [18:36:01] oops, the "good" test configs haven't validated in a while either [18:36:11] I'll fix that before rolling forward [21:26:33] 10serviceops, 10SRE, 10envoy: Better automated validation of Puppet-generated Envoy configs - https://phabricator.wikimedia.org/T304660 (10RLazarus) p:05Triage→03Medium [21:55:07] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Dzahn) [22:12:43] mutante: ^ did you mean to check that off, on T304237? I don't think those reverts are done [22:14:57] rzl: Yea, i did that because I followed the link given and it was merged [22:15:20] ah - no, the item is to *revert* those :) [22:15:47] oh, i'm sorry. reverting my edit [22:15:50] we merged them at the time but it's a temporary fix, we opened the task to make sure we didn't leave it forever [22:16:02] no worries! thanks [22:16:17] somehow interpreted that as what needs to be merged. ack [22:16:34] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Dzahn) [23:46:30] 10serviceops, 10SRE, 10Znuny, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) @Arnoldokoth Are you already aware of this change?