[07:42:21] 10serviceops, 10serviceops-collab, 10Release-Engineering-Team (Radar): give releng access to logs to debug buildkit-to-wmf-registry publishing - https://phabricator.wikimedia.org/T322579 (10Joe) >>! In T322579#8377279, @dduvall wrote: > Thanks for filing this! > > This is what would be helpful for us in deb... [08:12:21] headsup: I'm going to temporarily (less than 30 mins each) sequentially switch kubetcd100[46] to DRBD to migrate them off their current nodes (for the Bullseye reimages) [08:16:35] moritzm: thanks. have fun! [08:20:17] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10JMeybohm) >>! In T321201#8366250, @Clement_Goubert wrote: > [...] > @JMeybohm @Joe @akosiaris If this seems like the right way, I will start writing the "Kubernetes/Remove_a_service" wiki... [08:53:10] Heya [08:54:27] jayme: It did, although I had to run it twice for codfw, for some reason. Maybe a helmfile sync would have helped. [08:54:49] hm, that's odd [08:54:57] so the first run did not change anything? [08:55:16] _joe_: jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/854059 < Wondering about this approach or just marking the service directory recurse+force [08:56:51] jayme: COMBINED OUTPUT: [08:56:53] Error: UPGRADE FAILED: an error occurred while rolling back the release. original upgrade error: failed to refresh resource information: Get "https://kubemaster.svc.codfw.wmnet:6443/apis/rbac.authorization.k8s.io/v1/namespaces/eventgate-analytics/rolebindings/deploy": dial tcp 10.2.1.8:6443: connect: connection refused: no Namespace with the name "mwdebug" found [08:57:35] May have just been a blip though [08:57:51] connection refused sounds like a blib indeed [08:57:54] <_joe_> claime: I guess related to moritzm's work [08:58:09] _joe_: That was last night [08:58:29] <_joe_> ah wait [08:58:43] <_joe_> no namespace with the name "mwdebug" found [08:58:56] <_joe_> that's before you removed it right [08:59:10] <_joe_> I thought you were trying to apply something now [08:59:53] _joe_: It was removed from staging-codfw and staging-eqiad [09:00:06] I was trying to remove it from codfw [09:00:48] <_joe_> then yeah, network blip I guess [09:01:01] <_joe_> or kube master blip [09:01:25] That's what I figured because a retry 2 minutes later worked [09:03:02] <_joe_> https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s&from=now-24h&to=now&viewPanel=6 something happened around 6 pm [09:03:33] <_joe_> 6 pm our TZ I mean [09:03:49] <_joe_> err 5 pm, DST is over [09:05:09] not completely uncommon. that pattern is seen on changes in primary scheduler/kube-controller-manager [09:05:24] That was right around when I did the applies, 1600UTC [09:05:37] (which might be the result of the blip) [09:06:02] <_joe_> brb [09:16:26] kubetcd100[46] are back on "plain" disks [09:34:06] 10serviceops, 10Dumps-Generation, 10SRE, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10ArielGlenn) [09:47:51] effie: Sorry I was wrong yesterday about T322360, it was incident related, but not to the swift one we discussed yesterday [09:48:35] alright [09:58:04] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): WMF container registry does not accept a manifest list (aka OCI manifest index, or "fat" manifest) - https://phabricator.wikimedia.org/T322453 (10JMeybohm) I took a quick look and AIUI our registry does support `appl... [10:05:41] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10SRE: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10Joe) After a discussion with @fgiunchedi - given we're going to stream apache logs in json format to kafka, we can just use benthos to... [11:19:46] 10serviceops, 10SRE, 10Thumbor, 10Security: Filter potentially harmful PostScript commands in Commons upload/thumbor - https://phabricator.wikimedia.org/T210833 (10jijiki) [11:20:59] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Roll out remote-DC gutter pool for /*/mw-wan/ - https://phabricator.wikimedia.org/T258779 (10jijiki) p:05Triage→03High [11:44:30] <_joe_> hnowlan, jayme https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/837495 has all your comments addressed as far as I can tell [14:01:18] elukey: calico docs are pain sometimes... https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/854520 [14:02:35] I wonder if we should move to the operator at some point as the manifest installation is kind of unsupported it seems [14:06:19] jayme: o/ I see some diffs for admin though, are they expected? (outside fixtures) [14:08:06] hmm, no. Thats unexpected [14:08:45] maybe the version pinning thing does not work in CI... [14:10:18] at least we have not yet silently upgraded calico :) [14:15:25] jayme: Mind if I disable puppet on deploy hosts so I can merge this https://gerrit.wikimedia.org/r/c/operations/puppet/+/854059 as safely as possible? [14:15:34] Actually I'll wait until the end of the deploy window [14:15:49] yeah, looks like the version pinning does not work in CI. hmpf [14:16:10] claime: yeah, fine by me after the window [14:16:15] ack [14:17:47] ...there are no words for how eager I am to go down the deployment-charts CI rabbithole (again) [14:18:14] You know you want to :p [14:19:34] you better prepare as someone has to review the mess I'm about to make :-) [14:20:02] 10serviceops, 10Shellbox, 10SyntaxHighlight: Install pygments in Shellbox container with pip, not a Debian package - https://phabricator.wikimedia.org/T320848 (10akosiaris) @Legoktm, technically speaking, are we talking about `pip install --no-binary pygments` ? Not sure this is supported in blubber, at leas... [14:20:03] * claime braces [14:33:47] 10serviceops, 10MW-on-K8s, 10Observability-Logging, 10SRE: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10akosiaris) [14:43:42] <_joe_> jayme: what doesn't work? [14:44:20] _joe_: pinning of specific chart versions in admin_ng - something like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838134 [14:44:33] so https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838137/1 should not produce a diff [14:44:42] (for admin_ng deployments) [14:45:06] <_joe_> ugh [14:45:12] <_joe_> you're templating out charts [14:45:16] <_joe_> yes ofc it won't work [14:45:24] <_joe_> we pattern-extract the chart IIRC [14:45:50] I'm templating in versions - not remplating out charts [14:45:52] <_joe_> uhm [14:45:57] <_joe_> actually it should work [14:46:05] <_joe_> what "doesn't work" then? [14:46:08] yes. I do think it should [14:46:46] if oyu look at the output of https://integration.wikimedia.org/ci/job/helm-lint/8245/console [14:46:58] there should not be a diff for admin AIUI [14:47:08] (and there is none in prod) [14:47:08] <_joe_> there is none in the gate and submit job [14:47:18] <_joe_> https://integration.wikimedia.org/ci/job/helm-lint/8190/console [14:47:22] <_joe_> it's a rebase issue [14:48:14] <_joe_> uh wait [14:48:25] <_joe_> that job you linked is not for the change you just showed me [14:48:35] <_joe_> ah the one after [14:48:50] that is correct and I was not implying so :) [14:49:32] <_joe_> why are you committing the tgz of the chart to the repo, btw? [14:49:43] veeerry different storry [14:49:59] because https://gerrit.wikimedia.org/r/c/operations/docker-images/docker-report/+/826859 [14:50:39] <_joe_> ok, why not merge that first? [14:50:51] <_joe_> ah the current chart is broken? [14:50:59] yes, kinda [14:51:03] <_joe_> can't we just remove it from chartmuseum [14:51:47] as said, different story. I not sure (it was some time ago) if CI is even ready for dependencies [14:52:29] because we use the local versions, e.g. we would need to rewrite dependencies from https URIs to file URIs etc [14:53:01] pushing the tar was the easy way around at the time [14:53:21] <_joe_> I don't have the headspace to dive into this, but I'm pretty convinced that tar is the source of the issue somewhat [14:53:37] <_joe_> there's 5 levels of helm chart caching involved [14:55:01] I'm not so sure. The tar contains just the CRDs. If that is to create a strange diff, it would be a diff in just the CRDs [14:55:49] there are other changes (that don't include a tar) that also create this kind of diff (like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/854520) [14:57:41] <_joe_> so, this seems simpler to check [14:57:42] I guess the clue is somewhere in the patching charts area [14:58:11] rephrase: pathing the "chart:" key in helmfiles [14:58:31] <_joe_> sorry, what is wrong with this second change? [14:58:38] <_joe_> the diff seems reasonable to me [14:58:57] <_joe_> you're bumping the calico chart [14:59:12] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/854520 you mean? Same thing. calico and calico-crds are pinned the same way for admin deployments [14:59:16] <_joe_> and the diffs seem a bit awkward as in it's mixing stuff [14:59:33] <_joe_> what is changing and shouldn't? [14:59:49] there should not be a diff in admin [14:59:49] <_joe_> the diffs I see in the console log seem to check out with the changes you made [14:59:55] <_joe_> uh [15:00:08] because the version is pinned [15:00:11] <_joe_> lol ok, this is your fault actually [15:00:21] yes, it probably is :) [15:00:24] <_joe_> you did change behaviour to always use the local chart in the path [15:00:32] <_joe_> and not a remote one [15:00:37] yes ...said so a minute ago :) [15:00:39] content.gsub!(%r{^(\s*chart:\s+["']{0,1})wmf-stable/}, "\\1#{charts_dir.chomp('/').concat('/')}") [15:00:41] <_joe_> you can't have both [15:01:11] <_joe_> else you might have to have two versions of the chart in our charts repo [15:01:20] <_joe_> or, you can check if there is a pinned version [15:01:24] <_joe_> and skip the gsub [15:01:30] <_joe_> that's probably easiest? [15:01:58] IIRC the gsub was invented for services [15:02:10] maybe it's not wise to reuse it for admin_ng... [15:02:15] not sure [15:02:19] 10serviceops, 10Arc-Lamp, 10Performance-Team (Radar): Expand RAM on arclamp hosts and move them to baremetal - https://phabricator.wikimedia.org/T316223 (10akosiaris) [15:02:21] <_joe_> no it was invented so that we would never miss a chart change in our diffs [15:02:32] <_joe_> which applies to admin too [15:02:40] hm..indeed [15:02:50] <_joe_> and the issue of the version pinning would be there in services as well one day [15:02:56] <_joe_> I can look into this later [15:03:06] <_joe_> as in, in like half an hour [15:03:09] funny/sad that helmfile does simply ignore the "version:" parameter in that case [15:03:25] <_joe_> "hey I'll get what I find, sorry" [15:03:53] yeah...I think it should probably fail then [15:04:51] <_joe_> as a once brilliant engineer turned excel wrangler once said, helm is "gen 0 tooling" [15:04:53] <_joe_> and it shows [15:05:39] 10serviceops, 10Service-Architecture: Standards and health score for existing and new services - https://phabricator.wikimedia.org/T88643 (10jijiki) 05Open→03Declined Given that we are actively doing adding SLOs to services (which is indeed addressing some things as already stated), I am bluntly closing th... [15:07:26] <_joe_> damn you and your templating version, it's harder than I hoped :P [15:07:36] yeah... [15:07:47] <_joe_> I could in theory just get "version: ..." and that would be ok [15:08:11] but there might be multiple releases [15:08:47] <_joe_> I'm not saying you shouldn't do this [15:09:08] I did not understand that...also I kind of have to :) [15:10:37] <_joe_> I'm just cursing how complex this becomes [15:10:43] I would also say it's mainly Lucas fault because he noticed :D [15:11:58] <_joe_> nope, it's fully up your alley dude [15:12:25] worth a try 🤷 [15:16:49] as the helmfile is not valid YAML at patch point..could we "helmfile build" it, then patch, then lint/template? [15:17:27] in that case we could at least properly parse it as it's YAML (at least I think it is) [15:18:28] 10serviceops, 10Observability-Metrics, 10SRE, 10Patch-For-Review: Strongswan Icinga check: do not report issues about depooled hosts - https://phabricator.wikimedia.org/T148976 (10jijiki) 05Open→03Declined Strongswan is going away because we do not need it anymore. We were using it for redis_sessions T... [15:19:01] <_joe_> jayme: yeah but sadly we don't know what values to feed it [15:19:04] <_joe_> basically [15:19:34] <_joe_> I'm looking specifically to admin, we don't even pass it the calico helmfile directly, so not sure it will even be patched [15:19:57] helmfile_glob = File.join(dir, '**/helmfile*.yaml') [15:20:36] AIUI that should to the trick for all helmfiles, no? [15:21:03] <_joe_> ah, right [15:21:10] <_joe_> ok, but when we run patch_helmfile [15:21:32] <_joe_> do we know which values files to feed each helmfile? [15:21:47] <_joe_> that's what I was trying to find and I think the answer is "no" [15:21:52] <_joe_> so some refactoring will be needed [15:24:04] that should be just passing environment down to patch_helmfile or am I wrong? [15:25:12] <_joe_> I'm not sure it's enough [15:25:38] <_joe_> ~/Code/WMF/operations/deployment-charts/helmfile.d/admin_ng/calico (master =)$ helmfile -e codfw build [15:25:39] <_joe_> in ./helmfile.yaml: error during helmfile.yaml.part.0 parsing: template: stringTemplate:5:27: executing "stringTemplate" at <.Values.chartVersions>: map has no entry for key "chartVersions" [15:25:57] <_joe_> just to make an example [15:26:05] but that is not what happens [15:26:16] in admin_ng we only use the "master" helmfile [15:26:36] ~/Code/WMF/operations/deployment-charts/helmfile.d/admin_ng$ helmfile -e codfw build [15:26:56] <_joe_> ok [15:27:30] 10serviceops, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10Vgutierrez) [15:27:34] <_joe_> so how do you propose we find that a specific helmfile has a version then? [15:28:40] <_joe_> given how we're patching dependent helmfiles [15:30:02] <_joe_> tbh, I think if we just check for "version: {{ $version }}" and in that case don't patch the chart path [15:30:15] <_joe_> will cover 99.9% of cases and be less error prone [15:30:50] <_joe_> the complexity added by trying to do that heuristics correctly would probably mean we should just ditch how that whole thing works [15:31:15] re: 99.9%: I agree. But it will also create a strange edge case [15:31:25] <_joe_> which we can document? [15:31:30] <_joe_> but yeah the alternative is [15:31:41] <_joe_> we run helmfile -e build [15:31:53] <_joe_> for each chart there, check if the version is pinned [15:32:00] <_joe_> keep a list of such charts [15:32:10] <_joe_> and then skip the patching for those [15:33:21] I'm gonna merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/854059 so I'll be disabling puppet on deploy hosts for a bit [15:33:42] Just so I can test safely on deploy2002 [15:34:23] <_joe_> ack [15:34:38] <_joe_> claime: as long as the disable message has your signature, just do it [15:34:46] ack [15:34:49] <_joe_> if anyone will need puppet to run they will contact you [15:35:21] <_joe_> jayme: I'm trying to take a stab at it [15:35:41] _joe_: not sure the plan is correct [15:37:19] <_joe_> jayme: why? [15:38:42] 10serviceops: Add IRC SRE bot for SAL !log actions to #wikimedia-serviceops - https://phabricator.wikimedia.org/T213196 (10jijiki) 05Open→03Declined Given that there has not been any update in this task from #serviceops, I will bluntly close it. We'll reopen if there is interest :) [15:38:56] if we "helmfile -e build" a helmfile, we're getting a complete artefact for that which we can YAML.safe_load. Then "for release in yaml.get('releases',[])" and patch the "chart:" key if the version is not pinned [15:39:23] <_joe_> yes? [15:39:30] <_joe_> (that's not what I said, btw) [15:40:01] <_joe_> also I love you write python and this is ruby lol [15:40:22] then write the patched artifact back as helmfile.yaml and run "helmfile -e template -f helmfile.yaml" [15:40:53] <_joe_> that's not what I said [15:41:02] wasn't trying to say this is what you proposed. I think that would be enough and we don't need to keep a list or something [15:41:18] <_joe_> the artifact is not a valid helmfile [15:41:23] <_joe_> it's an helmfile status file [15:41:29] fuu...it's not? [15:41:33] <_joe_> I don't think you can feed it directly to template [15:41:37] <_joe_> yeah, nope :P [15:41:43] oh hell [15:41:51] <_joe_> yes [15:45:38] <_joe_> well turns out with some time to think it's simpler than we thought [15:45:40] 10serviceops, 10Parsoid (Third-party): parsoid apt repo rolled back breaks updates - https://phabricator.wikimedia.org/T264546 (10jijiki) 05Open→03Declined This is for parsoidJS I reckon, which we have moved away from. [15:46:21] it is? [15:46:35] <_joe_> yes [15:47:02] uh [15:47:12] read the values files instead [15:47:22] <_joe_> no [15:47:29] <_joe_> how do you know what the values file is? [15:47:39] <_joe_> you need to build helmfile first :P [15:47:45] hrhr, indeed [15:47:58] <_joe_> don't worry, I found a good way to wire this in [15:48:39] sure, but I obviously want to know *now* :) [15:49:20] <_joe_> oh simply @fixtures.values.each |env| helmfile build, find pinned charts, stash in a dict [15:49:53] 10serviceops, 10Parsoid, 10RESTBase: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10jijiki) @Dzahn If there is anything left here, I reckon we can mark it as resolved? Thank you! [15:53:16] okay, cool. AIUI that would then still pinn all releases of said chart (assuming we have >1) even if only one is pinned, right? [15:53:37] (not saying this is a problem we have right now) [16:00:19] <_joe_> can we have multiple versions of the same chart under the same helmfile? [16:00:27] <_joe_> that doesn't seem probable? [16:00:46] absolutely [16:00:55] different releases [16:01:21] All done, we can now absent services in hieradata/common/profile/kubernetes/deployment_server.yaml [16:02:32] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Allow absenting profile::kubernetes::deployment_server::services - https://phabricator.wikimedia.org/T322298 (10Clement_Goubert) 05In progress→03Resolved [16:02:34] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) [16:06:10] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Shared-Data-Infrastructure, and 2 others: Update Kubernetes clusters to v1.23 - https://phabricator.wikimedia.org/T307943 (10JMeybohm) [16:08:30] <_joe_> jayme: I think we can outmaneuver that too [16:30:26] that would be nice ofc [16:52:28] https://wikitech.wikimedia.org/wiki/Kubernetes/Remove_a_service feedback and additions welcome [16:59:35] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) 05In progress→03Resolved All cleaned up. [17:04:15] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) I lied. [17:21:09] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): WMF container registry does not accept a manifest list (aka OCI manifest index, or "fat" manifest) - https://phabricator.wikimedia.org/T322453 (10dduvall) Thanks for debugging this further, @JMeybohm and @hashar. In... [17:22:12] 10serviceops, 10MW-on-K8s, 10SRE: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Joe) 05Open→03Resolved a:03Joe This task can be considered resolved given we've deployed shellbox. [17:26:15] <_joe_> jayme: I hate you [17:28:13] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) [17:31:37] I'm off, see you tomorrow [17:50:20] 10serviceops, 10Continuous-Integration-Infrastructure, 10Datacenter-Switchover, 10Release-Engineering-Team (Priority Backlog 📥): Create a runbook for switching CI master - https://phabricator.wikimedia.org/T256396 (10jijiki) [17:55:14] * _joe_ too [17:56:21] 10serviceops, 10Parsoid, 10RESTBase: Decommission Parsoid/JS from the Wikimedia cluster - https://phabricator.wikimedia.org/T241207 (10Dzahn) 05Open→03Resolved a:03Dzahn @jijiki There is a small detail left but I am not going to work on it. It requires changes to scap config first. I don't mind if it... [18:12:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Drop the use of nonexisting groups in kubernetes infrastructure_users - https://phabricator.wikimedia.org/T290963 (10jijiki) [18:12:45] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10jijiki) [18:13:15] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import istio 1.1x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T322193 (10jijiki) [18:46:51] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10jijiki) p:05High→03Unbreak! [18:47:05] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10jijiki) p:05Unbreak!→03High [19:42:46] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) @JMeybohm can you provide the nginx access log entries from that ti... [20:15:52] 10serviceops, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Priority Backlog 📥): Buildkit erroring with "cannot reuse body, request must be retried" upon multi-platform push - https://phabricator.wikimedia.org/T322453 (10dduvall) I enabled debug logging for buildkitd on the gitlab-runner hosts an... [20:38:58] 10serviceops, 10GitLab, 10Release-Engineering-Team (Priority Backlog 📥): Build and import new release of jwt-authorizer (1.1.0) - https://phabricator.wikimedia.org/T322691 (10dduvall)