[07:08:53] <wikibugs>	 10serviceops: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10Majavah) Puppet is failing on deployment-parsoid12: ` Mar 03 07:07:58 deployment-parsoid12 php[31713]: PHP Warning:  PHP Startup: Unable to load dynamic library 'wddx.so' (tried: /usr/lib/p...
[07:42:55] <elukey>	 hello folks
[07:43:02] <elukey>	 kubernetes2018 looks good afaics
[07:43:16] <elukey>	 if you are ok I can add the remaining 4 new nodes
[07:44:27] <elukey>	 (code reviews already +1ed)
[07:47:06] <jayme>	 elukey: I'm in for it!
[07:47:28] <elukey>	 ah! the laptop works :D
[07:48:07] <jayme>	 nono, the tech will be here at some point from 8.00 to 10.00Z :)
[07:48:12] <elukey>	 ahahah okok
[07:48:30] <elukey>	 all right then, going to do the last checks on the nodes and I'll merge my patches
[07:49:01] <jayme>	 great!
[08:04:01] <jayme>	 off for tech
[09:00:32] <elukey>	 starting to add new nodes to codfw
[09:03:36] <elukey>	 lovely I found a bug in site.pp
[09:04:26] <elukey>	 re-running pcc with a broader testbed
[09:04:28] <apergos>	 o.O
[09:04:49] <elukey>	 (basically I included also the new eqiad nodes in site.pp, instead I wanted only the codfw ones)
[09:04:58] <elukey>	 I blame jayme for not catching it
[09:05:00] <elukey>	 :D
[09:05:02] <apergos>	 woops!
[09:05:59] <elukey>	 it would have worked anyway
[09:06:24] <apergos>	 whew!
[09:06:37] <elukey>	 I had /kubernetes20(1[89]|2[0-2])\.(eqiad|codfw)\.wmnet/
[09:06:51] <elukey>	 so the 20xx was safe enough
[09:06:54] <apergos>	 right
[09:07:00] <apergos>	 you're ok :-D
[09:07:15] <apergos>	 a bit of early morning copy-paste eh? 
[09:07:38] <elukey>	 I did it yesterday afternoon soooo no excuse :D
[09:07:52] <elukey>	 messing up with site.pp is always a little risk
[09:08:11] <elukey>	 at least for me, I haven't really get comfortable in years modifing that file
[09:09:35] <apergos>	 that's one of the things where I'm not at all risk averse
[09:09:49] <apergos>	 I'm just "lah dee dah let's plop this change in"
[09:10:07] <apergos>	 weird how different things are in someone's comfort zone for different people
[09:10:23] <apergos>	 anyways, no harm no foul :-)
[09:16:02] <elukey>	 my fear is that a wrong regex can cause other nodes to start running different puppet configs
[09:21:01] <elukey>	 all right running puppet on the new nodes
[09:24:08] <apergos>	 go go go :-)
[09:25:17] <elukey>	 so weirdly, kubelet and kubeproxy need another restart 
[09:25:24] <elukey>	 I mean after the first puppet run
[09:25:32] <apergos>	 huh
[09:25:38] <elukey>	 I tried to check logs and journal etc.. but didn't find much
[09:25:43] <apergos>	 some ordering thing not quite right?
[09:25:49] <elukey>	 puppet triggers a refresh after config but they fail to start
[09:25:51] <elukey>	 seems so yes
[09:29:32] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: setup/install kubernetes20[1(89)|2(012)] - https://phabricator.wikimedia.org/T302208 (10elukey) The first puppet run seems to always end up in:  ` Notice: /Stage[main]/Profile::Kubernetes::Node/K8s::Kubeconfig[/etc/kubernetes/kubelet_config]/File[/etc/kube...
[09:38:34] <jayme>	 elukey: hmm...I had seen the "(eqiad|codfw)" but I thought it's just for your convenience so that you can change for kubernetes(1|2)... later
[09:38:46] <elukey>	 jayme: yes yes all excuses! 
[09:38:48] <elukey>	 :D
[09:39:00] <elukey>	 you are definitely right, I thought it would have been a problem but it wasn't
[09:39:00] <jayme>	 but what was the problem with it?
[09:39:06] <jayme>	 ah, ok
[09:39:20] <elukey>	 I added only "codfw" that seems more correct, and merged
[09:39:35] <elukey>	 I am not running puppet on 2022
[09:39:45] <elukey>	 will do some checks and then run homer
[09:40:02] <elukey>	 kube-proxy and kubelet fail to start, when refreshed during the first puppet run
[09:40:06] <elukey>	 I added some info in the task
[09:40:12] <elukey>	 but can't find why
[09:46:50] <apergos>	 does puppet claim it is starting them, when you look at the puppet logs?
[09:47:20] <elukey>	 it is refreshing them after changing their config
[09:47:38] <apergos>	 huh
[09:47:56] <apergos>	 but after that there must be additional changes which then are sufficient that when you restart manually it is ok
[09:47:59] <apergos>	 irritating
[09:50:15] <elukey>	 ok all new nodes up and running!
[09:52:06] <elukey>	 so a total of 5 nodes on bullseye in codfw
[09:52:37] <jayme>	 hmm...kubelet not starting is indeed weird
[09:52:40] <jayme>	 but \o/
[09:54:03] <jayme>	 I'd say we cordon the nodes that are due to be decommed already to move workload off of them with consecutive deployments
[09:55:40] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Patch-For-Review: setup/install kubernetes20[1(89)|2(012)] - https://phabricator.wikimedia.org/T302208 (10elukey) All new nodes up and running!  I didn't spot anything weird, all bgp sessions seem to be up, pods scheduled on the new nodes (for the moment only calico/istio).
[09:57:20] <elukey>	 jayme: seems good yes, are those 200[1-4]?
[09:57:46] <jayme>	 elukey: yep
[09:58:08] <elukey>	 is there a task for the decom or should I use the new nodes one to !log etc..
[10:00:27] <jayme>	 I don't know of a decom task (just https://phabricator.wikimedia.org/T302208) - I'd say use that for log of the cordon and we create a decom parent task later when we're done with updateing all nodes (apart from the to-decom-ones) and nothing catched fire
[10:03:57] <elukey>	 ok sounds good. I am proceeding with the cordon then, is there more that needs to be done as prep-step?
[10:04:47] <jayme>	 prep step for cordon? No. That will do nothing apart from preventing the sheduler to place any new workload onto the nodes
[10:06:07] <elukey>	 ah yes so we don't drain, just avoid new pods
[10:06:09] <elukey>	 okok
[10:07:22] <jayme>	 yep, that's what I would do.
[10:16:13] <wikibugs>	 10serviceops, 10decommission-hardware: decommission rdb100[56].eqiad.wmnet - https://phabricator.wikimedia.org/T273139 (10akosiaris) Taking over T281217 to finish this.
[10:18:04] <elukey>	 jayme: done :)
[10:18:11] <jayme>	 yay...new mainboard, now my external display turns black from time to time
[10:18:14] <jayme>	 elukey: thanks
[10:18:50] <akosiaris>	 elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/767732 for ores. 
[10:19:16] <akosiaris>	 I can do the puppet apply celery restart uwsgi restart dance, just lmk if it is ok
[10:19:24] <akosiaris>	 thanks!
[10:19:37] <akosiaris>	 and yes, this is ... oooooold
[10:20:33] <elukey>	 akosiaris: o/ so this will move the celery queue on another redis node right? Not related to the score cache
[10:20:46] <elukey>	 also we have cookbooks to roll restart ores daemons if you want
[10:21:15] <akosiaris>	 I think it's both
[10:21:28] <akosiaris>	 oh, +1 on the cookbooks, I 'd love to use them
[10:21:53] <akosiaris>	 rdb1011 is already syncing from rdb1005, so the impact should be too big
[10:21:58] <akosiaris>	 shouldn't*
[10:22:09] <elukey>	 ahhh this I didn't know
[10:22:14] <elukey>	 yes then super good
[10:23:13] <elukey>	 yes score cache on the same node, different redis instance
[10:23:33] <elukey>	 akosiaris: just to be sure, all instances have been syncing to 1011 right?
[10:24:32] <akosiaris>	 rdb1012 is being switched right now to sync from rdb1011 (that will be the new master), rdb1011 has been syncing from rdb1005 for >1 year by now
[10:24:47] <elukey>	 jayme: I'll leave you to close the task, not sure if you want to do a last check before closing etc..
[10:24:59] <elukey>	 akosiaris: lol
[10:25:00] <akosiaris>	 I am just wrapping up the last few loose ends
[10:25:12] <elukey>	 okok
[10:25:47] <jayme>	 elukey: ack. I will at least create the decom followup before closing. Thanks
[10:40:51] <wikibugs>	 10serviceops, 10Parsoid: Move testreduce to nodejs 12 - https://phabricator.wikimedia.org/T301303 (10LSobanski) Is this a blocker for any production work? If yes, what are the time expectations for this to happen?  Side question, has using Docker been considered to solve this going forward?
[10:42:58] <elukey>	 one question related to istio
[10:43:36] <elukey>	 the more I am using istio + knative + kserve the more I see that most of the nice things (like circuit breaking, service catalog, etc..) all assume that there is a mesh
[10:43:41] <elukey>	 (even without mtls)
[10:44:27] <elukey>	 I hoped to use the egress gateway as intermediate solution for egress traffic (to control / rate-limit it)
[10:44:48] <elukey>	 but IIUC all the nice configs are for the istio-proxy sidecars
[10:45:05] <elukey>	 so now I am starting to check what it takes to deploy https://github.com/istio/istio/tree/master/cni
[10:46:08] * jayme runns
[10:46:36] <elukey>	 that is similar to calico's use case, so I'd create a gerrit repo and build a debian package
[10:47:05] <elukey>	 istioctl should have a way to point its config to binaries on the node
[10:47:19] <elukey>	 (there is a istio-install-cni daemonset that seems doing it but it looks weird)
[10:47:37] <jayme>	 yeah, we should really not do that with daemonsets
[10:48:05] <elukey>	 I am reading https://istio.io/latest/docs/setup/additional-setup/cni/#hosted-kubernetes-settings
[10:48:06] <jayme>	 but packaging the CNI is fine I guess. You'll have to take a look on how to chain them, though
[10:48:10] <elukey>	 and it looks relatively easy
[10:48:45] <elukey>	 chain with calico's?
[10:48:59] <elukey>	 (asking due to a huge ignorance related to cnis)
[10:49:30] <jayme>	 yes, exactly
[10:50:07] <elukey>	 also afaics istioctl allows to specify the install dir, it doesn't seem to allow bypassing install-cni
[10:50:10] <elukey>	 uff
[10:50:15] <elukey>	 will read the helm charts
[11:34:48] <wikibugs>	 10serviceops, 10Product-Infrastructure-Team-Backlog, 10SRE, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10MSantos)
[11:36:26] <elukey>	 akosiaris: I am stepping out for lunch, feel free to roll restart ores if you want anytime
[11:36:34] <elukey>	 (I'll be back in ~2h)
[11:44:40] <akosiaris>	 istio ships a cni plugin? 
[11:44:45] <akosiaris>	 that's news to me 
[11:45:13] <akosiaris>	 ah wait, that's the thing doing the iptables redirection? 
[11:45:16] <_joe_>	 *shivers*
[11:45:38] <akosiaris>	 yeah, that's it
[11:46:13] <akosiaris>	 it used to use initcontainers but now moving to that. I am not surprised. I really don't like initContainers much
[11:46:45] <akosiaris>	 but it also feels weird for a service mesh to mess with pod networks
[12:04:44] <_joe_>	 well if you want applications to just be transparently redirected to your local service proxy, that is probably a good idea though
[12:05:04] <_joe_>	 we could write our own CNI plugin and remove the need for calling localhost from the applications 
[12:05:07] <_joe_>	 :P
[12:39:08] <jayme>	 uff...lenovo support send me back to runnning the diagnosis tool for 4h because the "new" mainboard has a different defect (which will potentially not be detected by diag because it's a slack joint)
[13:02:37] <_joe_>	 what's your new defect?
[13:11:31] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Done by Feb 23 🧟): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10Joe)
[13:12:40] <jayme>	 the external monitor is switching off and on randomly when connected via the primary USB-C/thunderbold port
[13:14:47] <jayme>	 not via the secondary - that's why I'm assuming a slack joint
[13:23:45] <wikibugs>	 10serviceops, 10GitLab (Infrastructure): Automate setup of GitLab test instance - https://phabricator.wikimedia.org/T302976 (10Jelto)
[13:23:56] <wikibugs>	 10serviceops, 10GitLab (Infrastructure): Automate setup of GitLab test instance - https://phabricator.wikimedia.org/T302976 (10Jelto) p:05Triage→03Low
[13:24:35] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto)
[13:32:39] <wikibugs>	 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) 05In progress→03Resolved a:03Jelto I created a dedicated task to automate the test instance creation: T302976  I also changed the flavor of `gitlab-p...
[13:47:06] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE-tools, 10Patch-For-Review: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) a:03Joe
[13:47:18] <wikibugs>	 10serviceops, 10Infrastructure-Foundations, 10SRE-tools, 10Patch-For-Review: Add a kubernetes module to spicerack - https://phabricator.wikimedia.org/T300879 (10Joe) p:05Triage→03Medium
[14:44:28] <elukey>	 akosiaris: about the cni plugin - I am reading https://istio.io/latest/docs/setup/additional-setup/cni/ and there seem to be not a lot of space for a solution that doesn't involve the install container bits
[14:44:56] <elukey>	 IIUC istioctl/operator can create a daemonset that deploys  the istio-cni binaries on the worker nodes
[14:45:15] <elukey>	 setting up their config etc..
[14:45:31] <elukey>	 and also managing upgrades for you transparently
[14:45:42] <elukey>	 it doesn't look great though
[16:14:21] <akosiaris>	 hnowlan: I was trying to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/767742 and I see that https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/764409 isn't deployed. Should I deploy it? It looks like the only change is https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/741937
[16:14:52] <akosiaris>	 elukey: a daemonset messing with the cni binaries on the node? yeah you are right, it doesn't look great for our env
[16:16:07] <elukey>	 akosiaris: IIUC it deploys the cni binaries on /opt/blabla (basically what our calico package does IIUC)
[16:16:17] <elukey>	 the main problem is that I don't see an alternative
[16:16:47] <akosiaris>	 how often do those CNI binaries change?
[16:16:54] <elukey>	 (namely istioctl seems to allow only to specify what dirs are the target)
[16:17:09] <elukey>	 in theory they should be coupled with the istio release
[16:17:10] <akosiaris>	 we could try to include the cni binaries in our package 
[16:17:50] <akosiaris>	 wow, every release? that is fast
[16:18:14] <elukey>	 I mean potentially
[16:18:26] <elukey>	 I was reading https://istio.io/latest/docs/setup/additional-setup/cni/#hosted-kubernetes-settings
[16:18:45] <elukey>	 initially I thought that it would have allowed us to deploy the binaries using our package
[16:19:09] <elukey>	 but it seems, from the description, that it is only to specify where istio-cni-install should deploy stuff
[16:20:17] <elukey>	 trying to check https://github.com/istio/istio/tree/release-1.9/manifests/charts/istio-cni
[16:20:28] <elukey>	 (this is what istioctl executes in theory)
[16:22:43] <elukey>	 https://github.com/istio/istio/blob/release-1.9/manifests/charts/istio-cni/templates/daemonset.yaml
[16:24:35] <hnowlan>	 akosiaris: if it'd be okay I'll revert that change, it's not behaving as I'd like
[16:24:44] <hnowlan>	 one sec
[16:24:57] <akosiaris>	 hnowlan: fine by me :-). Thanks for the quick response
[16:25:33] <hnowlan>	 oh, if I revert a chart I'll need to bump the version number too right? 
[16:29:06] <_joe_>	 yes
[16:29:43] <elukey>	 mmm maybe I could do something like this:
[16:30:00] <elukey>	 1) create a debian package that deploys istio cni binaries (like we do for calico) + configs
[16:30:25] <elukey>	 2) add to our docker registry a install-cni istio image that basically does nothing
[16:30:39] <elukey>	 but yeah not great either
[16:32:59] <akosiaris>	 hnowlan: yup. 
[16:35:02] <elukey>	 this is the docker image https://github.com/istio/istio/blob/release-1.9/cni/deployments/kubernetes/Dockerfile.install-cni
[16:35:06] <elukey>	 err docker file
[16:37:22] <jayme>	 elukey: I think I have questions :)
[16:38:20] <jayme>	 but I need to leave for today. If you have some time tomorrow, we can discuss if you like :)
[16:38:47] <elukey>	 jayme: I have a lot of questions too :) yeah tomorrow is fine
[16:39:01] <jayme>	 cool
[16:39:03] <jayme>	 o/
[17:07:48] <hnowlan>	 akosiaris: deployed your change 
[17:07:59] <akosiaris>	 hnowlan: thanks!
[17:26:44] <wikibugs>	 10serviceops, 10Product-Infrastructure-Team-Backlog, 10SRE, 10Maps (Geoshapes), and 2 others: New Service Request geoshapes - https://phabricator.wikimedia.org/T274388 (10akosiaris) >>! In T274388#7744335, @MSantos wrote: >> Set up the traffic layer to send traffic to the service (if needed). This is a bit...
[17:51:18] <wikibugs>	 10serviceops, 10Parsoid: Move testreduce to nodejs 12 - https://phabricator.wikimedia.org/T301303 (10Arlolra) > Is this a blocker for any production work? If yes, what are the time expectations for this to happen?  No, testreduce is only used for the large scale testing we do (roundtrip testing, visual differe...
[19:13:45] <wikibugs>	 10serviceops, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10odimitrijevic)
[19:16:40] <wikibugs>	 10serviceops, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, and 2 others: Kafka 2.x Upgrade Plan - https://phabricator.wikimedia.org/T302610 (10odimitrijevic) @elukey I updated the task description as I ask for it :)
[19:32:00] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Done by Feb 23 🧟): Build MediaWiki images for kubernetes on the deployment servers - https://phabricator.wikimedia.org/T297673 (10dancy)
[19:36:19] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Done by Feb 23 🧟): Build MediaWiki images for kubernetes on the deployment servers - https://phabricator.wikimedia.org/T297673 (10dancy)
[23:16:32] <subbu>	 mutante: https://parsoid-rt-tests.wikimedia.org/ has reverted back to its old state ... where the css isn't served because of the "static" name and it is also back to being cached.
[23:50:04] <mutante>	 subbu: I have not changed anything about it. It must be changes on traffic side. And I'm going on vacation in like an hour. Could you please create a ticket and ask them about it. the fix last time was to add it to their list of alternate_domains. https://gerrit.wikimedia.org/r/c/operations/puppet/+/749574 and it's still on that list.
[23:50:38] <subbu>	 will do. enjoy your vacation! :)
[23:50:39] <mutante>	 I highly suspect a varnish change 
[23:50:48] <mutante>	 because that whole thing was about the special treatment of   "if (req.url ~ "^/static/")
[23:50:52] <mutante>	 thanks subbu 
[23:51:27] <mutante>	 I also suspect it would work if CSS would not happen to be under ./static/
[23:58:23] <subbu>	 ya.