[00:32:38] 10serviceops, 10MW-1.40-notes (1.40.0-wmf.6; 2022-10-17), 10PHP 7.4 support, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 (10tstarling) I reviewed the list of p... [01:38:46] 10serviceops, 10MW-1.40-notes (1.40.0-wmf.8; 2022-10-31), 10PHP 7.4 support, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 (10tstarling) The script is now runnin... [02:48:58] 10serviceops, 10MW-1.40-notes (1.40.0-wmf.7; 2022-10-24), 10PHP 7.4 support, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 (10tstarling) [02:57:12] 10serviceops, 10MW-1.40-notes (1.40.0-wmf.7; 2022-10-24), 10PHP 7.4 support, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 (10tstarling) [02:57:53] 10serviceops, 10MW-1.40-notes (1.40.0-wmf.7; 2022-10-24), 10PHP 7.4 support, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 (10tstarling) [02:58:42] 10serviceops, 10MW-1.40-notes (1.40.0-wmf.7; 2022-10-24), 10PHP 7.4 support, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 (10tstarling) 05Open→03Resolved a:... [02:58:44] 10serviceops, 10Performance-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432 (10tstarling) [08:34:29] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Settling on `mw-web` as there's been no contrary opinion in a week. [08:36:23] hello folks [08:36:39] I added labels to coredns resources, and I am rolling them out to ml-serve clusters [08:36:46] (no pod restarts so far) [08:36:56] do you want me to roll it out also to your clusters? [08:45:35] elukey: yeah, feel free to do so <3 [08:51:54] ack! [09:13:05] all rolled out [09:14:54] the last step is to archive https://gerrit.wikimedia.org/r/admin/repos/operations/debs/coredns, since we are now building coredns via a multi-stage build. Ok for everybody? [09:15:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Import coredns 1.8.x (k8s 1.23 dependency) - https://phabricator.wikimedia.org/T321159 (10elukey) Last step is to archive https://gerrit.wikimedia.org/r/admin/repos/operations/debs/coredns [09:19:39] elukey: should we wait with that until we have rolled out the new coredns image as well? [09:19:56] I would not suspect that we need to change the old one, but who knows [09:20:23] jayme: yes yes it is more conservative for sure, I'll let the ticket pending then [10:43:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [10:44:57] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [10:45:09] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:14:14] jayme: or _joe_: or elukey: (anyone, really) - Are there restrictions on writing to the file systems of our kubernetes pods in production? I seem to remember something about it, but can't find a reference right now. [11:20:06] btullis: we don't have that restriciton right now, but I would really encourage you to avoid that [11:21:16] we should really switch to read only root fs at some point plus writing to overlayfs is usually slower than expected. If you need to write somewhere, use an emptyDir [11:23:01] jayme: Thanks. Unfortunately, I have come across this behaviour in the spark-operator. A spark-driver pod is submitted with a particular jar file containing the application. Then the driver starts spark-executor pods and the jar is transferred to each executor and written to /opt/spark/work-dir/ before being executed. [11:24:01] emptyDir should be fine for that! [11:27:32] <_joe_> ^^ [11:28:10] <_joe_> yes my recommendation is a) avoid high-throughput I/O in general and b) use emptyDirs whenever you still need low-bandwidth i/o [11:30:39] OK. I can try that. Thanks. I've discovered that how upstream does it is to run the spark jobs with a gid of 0, and they have an interaction with `su` that I still haven't figured out 100% yet. [11:30:40] https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile#L40 [11:32:52] They use this gid 0 to write the jar file to `/opt/spark/work-dir/` which is owned by `root:root` mode `0775` [11:34:39] This makes me pretty uneasy, but until I know why they need pam_wheel and su support, I don't know if a workaround to drop these privileges is going to work in the long run. Does that make sense? [11:55:35] <_joe_> yes [11:55:53] <_joe_> but I don't think we're going to start allowing running payloads as root on k8s [11:56:24] was about to say. That's not going to work anyways [11:57:06] and the owning group of the emptyDir you can specify with secrityContext.fsGroup - if that helps [12:00:19] btullis: bear in mind that most upstream Dockerfiles tend to be of bad quality and not ready to use in production. It's not the fault of upstream projects alone ofc, it's notoriously hard to write a Dockefile correctly. Which is why with releng we decided to codify how to build a correct dockerfile with blubber. What you see there is a prime [12:00:19] example of why we want down that road. [12:00:53] e.g. ln -s /lib /lib64 ? that reeks of cargo culting [12:01:32] or the fact they have an entrypoint.sh (a bad, but very popular pattern) [12:05:30] they do have ARG spark_uid=185 though and a USER directive, why on earth do they need to run spark jobs with gid of 0 ? [12:06:04] rm /bin/sh && \ [12:06:04] ln -sv /bin/bash /bin/sh && \ [12:06:10] AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAARGGGGGGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHH [12:06:57] I think I might have been heard 2 towns over in every direction [12:07:10] lol [12:07:20] quality™ [12:08:19] Thanks all. Yes, the more I work with this operator code, the less I like it. I have asked a question on the Kubernetes Slack Spark channel to see if anyone can enlighten me as to why they need root privileges: https://kubernetes.slack.com/archives/CALBDHMTL/p1666870923061949 [12:09:23] <_joe_> akosiaris: entrypoint.sh is a bad pattern if you don't know how to write it :P [12:09:36] <_joe_> (e.g. using exec, for instance) [12:10:02] _joe_: yup, which is true for the majority of people writing it [12:10:10] I certainly heard the scream all the way here i nathens from where you are :-P [12:10:23] <_joe_> apergos: that's pretty far away! [12:10:30] indeed! [12:10:51] The only thing that occurs to me is that maybe it something to do with user impersonation for HDFS. I don't know why need to use bash instead of sh either. [12:11:13] <_joe_> btullis: more interestingly, why you would need to remove it and relink it [12:14:04] <_joe_> uh that dockerfile is something [12:14:19] <_joe_> akosiaris: add "use a true init" as part of the antipatterns [12:14:26] OK, so I think it's fair to say that getting spark jobs to work on the dse-k8s cluster is going to be a bigger job than expected. [12:15:34] <_joe_> btullis: I would say you mostly need to trim some fat [12:15:45] <_joe_> for instance, stuff like https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L30-L37 [12:17:08] for the love of... [12:18:01] adding a user in the entrypoint....people are writing configuration management systems in bash all over again just to run them in containers... [12:19:09] _joe_: Most certainly, but I need to understand why they thought this pattern was the best to use in the first place. It still doesn't explain why they need `su` at all. I've searched for it in the codem but can't explain it yet. [12:19:38] <_joe_> btullis: I usually assume the people who write dockerfiles in repos are devs [12:22:00] I am reading "in case of Openshift environments, where arbitrary UIDs are used to run containers" and realizing their needs and our needs ain't the same [14:30:40] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) - [ ] check deployment window - [ ] pause the deployment script on deploy1002: `sudo touch /var/lib/deploy-mwdebug/pause` - [ ] `cd /etc/helmfile-defaults/mediawiki/relea... [14:31:26] Someone to review https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/849501 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/849502/2 as well as stand by while I do the mw-debug on k8s switcheroo? [14:47:14] jayme, _joe_ ? Sorry to ping, but I'd like to do this today if possible [14:48:21] <_joe_> claime: sorry, in a meeting but I'll be free in 13 mins [14:48:26] <_joe_> didn't I give you a +1? [14:49:01] On the script only, it got erased when I changed the hardcoded file name [14:49:40] But 13 minutes from now is more than ok [15:04:18] <_joe_> you got two +1's [15:04:37] Thanks, let's go :p [15:05:04] sorry, was in a meeting as well [15:05:25] No problem at all [15:12:52] <_joe_> I have another meeting in 18 minutes btw [15:18:38] ack [15:23:59] _joe_: All donw [15:24:03] done* [15:24:16] Checked a page with the debug extension, all seems ok [15:24:32] <_joe_> it works yes [15:24:39] I'll prepare the removal patch for the old deployment [15:24:41] <_joe_> welldone! [15:24:44] ty [15:27:28] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Deploy new mw-debug service - https://phabricator.wikimedia.org/T321201 (10Clement_Goubert) Switch done, only cleanup in deployment-charts left. [15:30:13] wow yall seen this dyff yaml differ? might be really nice for helmfile apply difffs: https://github.com/homeport/dyff#use-cases-and-examples [15:31:16] <_joe_> it's slightly nicer than my 4 lines of python, indeed :D [15:31:27] <_joe_> ottomata: we could use it in CI [15:31:42] would also be nice if integrated in CI somehow to show diffs to make them more easily reviewable [15:31:52] <_joe_> we already have diffs in CI [15:32:01] i was looking for something to make reviewing event schema changes easier and came across this [15:32:15] _joe_: oh? semantic ones like this or just git file diffs? [15:32:17] <_joe_> ottomata: look at the CI output [15:32:26] * ottomata digging up a change... [15:32:32] <_joe_> no they're better than just git diffs [15:32:50] <_joe_> they're on the produced yaml and have anchors that should be understandable [15:32:57] <_joe_> not as nice as those though [15:33:35] aye [15:33:36] https://integration.wikimedia.org/ci/job/helm-lint/8023/console ? [15:34:08] <_joe_> https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/.rake_modules/tester/asset.rb#139 [15:34:31] Oh btw that diff command isn't portable :P [15:34:42] GNU diff on macos doesn't have --color=always :') [15:35:10] <_joe_> that's why you have rake run_locally [15:35:16] ;p; [15:35:23] For those who don't know [15:35:35] I ran rake run_locally for helm-linter on my M1 [15:35:38] 3h runtime [16:00:31] 10serviceops, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Upgrade our php-xdebug package for php7.2 - https://phabricator.wikimedia.org/T234418 (10Tgr) [16:15:15] 10serviceops, 10MW-on-K8s, 10SRE: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Krinkle) [16:34:47] hi all anyone able to take a look and +1 https://gerrit.wikimedia.org/r/c/operations/puppet/+/843001/1/hieradata/common/mediawiki.yaml [16:34:56] adding a new wiki to media wiki [16:35:39] Amir1: possibly as i se you commented on the task [17:05:22] <_joe_> we can probably reduce the resource allocation for sessionstore https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-90d&to=now&viewPanel=80 [17:06:08] 10serviceops, 10Performance-Team, 10Patch-For-Review, 10Performance-Team-publish: Re-evaluate need for "cool-off bounce" in WANObjectCache - https://phabricator.wikimedia.org/T321634 (10Krinkle) p:05Triage→03Medium From talking with Aaron and Tim, we'd like to quantify how much bandwidth ParserOutput's... [17:58:45] jbond: it looks good, first we need to remove the redirect, then the second patch [17:59:50] that would also be big so also worth cleaning up after words [17:59:58] ~. [18:00:22] ignore that hung terminal [18:00:38] Amir1: thanks ill merge the redirect one now and the other in an ohur or so [18:00:53] thanks [18:01:26] bp [18:01:29] np [18:01:52] (cc zabe) [18:04:23] nice, thanks [19:03:29] zabe: both changes merged should roll out over the next 30 mins [19:54:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Cmjohnson) I added these to netbox but when I ran the dns script and home, nothing changed. [23:19:59] 10serviceops, 10RESTBase, 10Wikipedia-iOS-App-Backlog, 10iOS-app-feature-Performance, and 3 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Krinkle) >>! In T319365#8289873, @daniel wrote: > Jdlrobson just pointed me to {T214000} and {T21...