[00:44:21] 10serviceops, 10Release-Engineering-Team, 10Datacenter-Switchover: Switch deployment server to codfw (July 2021) - https://phabricator.wikimedia.org/T285820 (10Legoktm) [00:45:25] 10serviceops, 10Release-Engineering-Team, 10Datacenter-Switchover: Switch deployment server to codfw (July 2021) - https://phabricator.wikimedia.org/T285820 (10Legoktm) [01:52:35] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) Bugs filed as a result of today's switchover: * {T285802} * {T260297} * {T285806} * {T285804} * {T285803} * {T285800} [02:11:00] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Legoktm) @wiki_willy I think we can do this ASAP now that we've switched over to codfw. [02:20:28] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Legoktm) Today's summary: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/ENL3P5SA7RSOHPN4ILMXQ2BGBF5XR776/ [02:55:20] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10wkandek) [05:13:01] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Joe) A few points: * Anyone who works in the technical wikimedia community should be subscribed to wikitech-l * Anyone who releases software should... [05:40:31] legoktm: I had a quick look around, I didn't find any error messages I hadnt seen before, I pooled it back, monitored the number of free workers for a bit, and put it back in the pool [05:41:12] if it will happen again on this or another server, we can further investigate [07:09:58] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Release-Engineering-Team (Radar): CI pipeline/job to build and release helm chart artifacts - https://phabricator.wikimedia.org/T257333 (10JMeybohm) >>! In T257333#7184596, @thcipriani wrote: > Is this task superseded by the cronjob on the deployment machine... [08:25:11] folks I have a problem with the istiod pod that it is a bit strange: [08:25:39] pkicaFailed to write secret to CA (error: Post "https://10.64.77.1:443/api/v1/namespaces/istio-system/secrets": x509: cannot validate certificate for 10.64.77.1 because it doesn't contain any IP SANs). Abort. [08:26:10] so 10.64.77.1 is the kube api endpoint, that of course doesn't have IPs among SANs [08:26:34] I tried to follow up with upstream on slack but the few answers that I got were related to "check you LB config" [08:26:47] has anybody of you already encountered a similar issue? [08:27:04] (a pod trying to contact the kubeapi via https and failing to do so) [08:34:17] for example [08:34:17] elukey@ml-serve-ctrl1001:~$ kubectl exec calico-node-58d4r -n kube-system -- env | grep KUBERNETES_SERVICE [08:34:20] KUBERNETES_SERVICE_PORT=6443 [08:34:23] KUBERNETES_SERVICE_HOST=ml-ctrl.svc.eqiad.wmnet [08:34:24] (for calico) [08:34:26] KUBERNETES_SERVICE_PORT_HTTPS=443 [08:34:29] this looks good [08:34:30] so I suppose it is not a misconfig of the cluster [08:34:53] maybe istiod wrongly populates KUBERNETES_SERVICE_HOST [08:37:12] elukey: my bouncer is wracked, so I probably missed something [08:37:36] the question part before your example I guess :D [08:40:03] jayme: all my spam in https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-serviceops/20210630.txt :D [08:43:20] ah, indeed :) [08:43:33] * jayme reading spam [08:44:03] ah [08:44:23] so, puppet IP SAN limitation [08:44:57] you'll need to rewrite the api-server address [08:45:36] let me look up where we did it [08:45:50] ah ok so it already happened! [08:46:08] I was trying to override the KUBERNETES_SERVICE_HOST env var of the istiod pod [08:46:14] failing miserably of course [08:48:03] elukey: we have .kubernetesApi in helmfile.d/admin_ng/values/common.yaml [08:48:25] and use that to override the environment variables [08:48:54] see charts/eventrouter/templates/_helpers.tpl for example [08:49:09] or charts/calico/templates/configmap.yaml [08:49:09] ahhh KUBERNETES_SERVICE_HOST: "{{ .Values.kubernetesApi.host }}" [08:49:18] ack [08:49:23] but be aware [08:49:26] okok so I am on the right track, I just need to find how [08:49:33] there ofc is a special case :) [08:50:10] the default value from helmfile.d/admin_ng/values/common.yaml will only start working when coredns is available [08:50:42] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Kormat) >>! In T285806, @Legoktm wrote: > Personally, I (@Legoktm) don't really understand why people aren't subscribed wikitech-l given that's wher... [08:50:45] and my use case is even more lovely since I use istioctl [08:51:09] that's why we use the "external" address for stuff like coredns and calico, see helmfile.d/admin_ng/values/eqiad/coredns-values.yaml [08:51:35] makes sense yes [08:51:47] but I guess you deploy istio after having coredns and calico already running, right? [08:51:52] exactly [08:52:04] but for some reason it picks up the svc of the kubeapi [08:52:09] in that case, the cluster internal name from from helmfile.d/admin_ng/values/common.yaml should be fine [08:52:12] the IP of the svc [08:52:32] yeah, you get that as default from kubernetes in every pod [08:53:12] because it is the right thing ... as long as your CA supports IP SAN :P [08:54:10] in my head having IPs in SANs is confusing, I thought our setup was good, but kubernetes thinks otherwise :D [08:54:37] (but I have little experience with IP SANs so it may be better than what I have in mind) [08:57:39] 10serviceops, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Volans) I guess that the important part here is to make an explicit decision if the setup for heml-charts (discovery record without svc re... [08:57:56] 10serviceops, 10Traffic, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Volans) [09:02:59] jayme: I had to manually kubectl edit deployment -n istio-system but it worked! [09:05:59] I mean one pod out of three works, istioctl is still not happy, but progress :) [09:31:51] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10LSobanski) [09:44:27] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10LSobanski) We are not yet at the point where DC switch is a non event and even when we get there, it's still an operation that can cause broad impac... [09:52:12] 10serviceops, 10SRE, 10Datacenter-Switchover: Document communication expectations around planning a DC switchover - https://phabricator.wikimedia.org/T285806 (10Kormat) [10:33:04] Hi all! [10:34:16] We (WMSE) have to repos in gerrit that due to legacy of third parties use python2.7. Blubber attempts at installing pip-21 that since january no longer support python2. [10:34:23] What can I do to get around this? [10:34:43] Step 6/20 : RUN python2.7 "-m" "easy_install" "pip" && python2.7 "-m" "pip" "install" "-U" "setuptools" "wheel" "tox" ---> Running in ab213147fadb [10:34:46] Searching for pip [10:34:49] Reading https://pypi.org/simple/pip/ [10:34:51] Downloading https://files.pythonhosted.org/packages/4d/0c/3b63fe024414a8a48661cf04f0993d4b2b8ef92daed45636474c018cd5b7/pip-21.1.3.tar.gz#sha256=b5b1eb91b36894bd01b8e5a56a422c2f3838573da0b0a1c63a096bb454e3b23f [10:34:55] Best match: pip 21.1.3 [10:35:38] From https://pypi.org/project/pip/ : Note: pip 21.0, in January 2021, removed Python 2 support, per pip’s Python 2 support policy. Please migrate to Python 3. [10:36:25] /s/to/two [10:43:00] 10serviceops, 10SRE: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10MoritzMuehlenhoff) So to summarise; the plan is to reimage mwmaint1002 now that eqiad is passive and the reimage mwmaint2002 once eqiad is primary again? [11:08:15] Oh, am I on an old version of blubber? [11:09:49] I might be, but the prebuilt binaries are from 2019 :C https://releases.wikimedia.org/blubber/ [11:13:40] Built a new version locally. Let's see if that works out. [11:13:45] Found this line of code https://gerrit.wikimedia.org/g/blubber/+/459234d2acb785a0831e12f653a72df3e3c34272/config/python.go#198 [11:14:06] So it shouldn't be selecting a newer pip [11:16:45] Best match: pip 20.3.4 [11:17:09] Yeah, so a new blubber binary should be deployed to releases.wikimedia.org/blubber [11:20:44] Oh, this should of course have all been posted in #releng [11:20:47] Going there now [11:35:31] 10serviceops, 10MW-on-K8s, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Seen): Blubber needs to check if a user is present before creating it as part of its runs stanza - https://phabricator.wikimedia.org/T268819 (10Jdforrester-WMF) Is this done? Does the new image need to be deployed to blubb... [13:10:52] docker-registry is now also switched to using the "light" variant of nginx. it was either noop (just package install) or on registry1004 nginx got reloaded because puppet regenerated config with the same content but just some "deny" lines in a different order from before [13:12:28] fancy [13:13:01] eqiad registries are passive anyways [13:14:34] mutante: I'll go cleanup the now unused mods packages tomorrow (along with other services which moved to -light) [13:14:36] ACK.. next I am going to reimage the now passive mwmaint eqiad [13:14:57] also various libs, like libxslt, which gets pulled in by nginx-mod-xslt [13:14:59] moritzm: ok cool, in this case I did not bother to do manual service restart on all machines ..so far [13:15:10] but on one of them at least it got refreshed and now issue [13:17:19] the service restart is happening automatically by puppet, no need for a manual restart [13:18:57] I could not confirm this on all of them. most just did the package nginx-light install but did not trigger a refresh while on another the content of the nginx site config changed and triggered the refresh. the change happens when the order of config lines changes [13:22:28] well, you can't exchange the nginx package without a service restart :-) the postrm of -full will stop it and the postinst of -light will start it [13:26:25] ACK, yes, if the package does it makes sense it's not in puppet output [14:30:09] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10wiki_willy) Thanks @Legoktm, much appreciated! [14:32:48] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) Yes, we can start with the lower hanging fruit like canaries here: https://gerrit.wikimedia.org/r/c/operatio... [14:41:33] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Jelto) [14:42:14] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Jelto) [14:42:16] 10serviceops, 10GitLab, 10SRE, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Jelto) [14:43:02] 10serviceops, 10GitLab, 10SRE, 10vm-requests: codfw: 1 of VMs requested for gitlab - https://phabricator.wikimedia.org/T285456 (10Jelto) 05Open→03Resolved [14:43:04] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Jelto) [14:43:08] qq - is it ok to have packages like net-tools on docker images (like istio ones) to help debugging pods? [14:43:26] i'd say no :) [14:43:56] if you need tooling, nsenter from the node your container is running on I'd say [14:44:28] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/702126 prevents that backups will be created twice. Bacula will only backup the active server, not the replica. [14:44:52] why not? (asking because of ignorance, I don't see the harm of having netstat for example) [14:45:47] 10serviceops, 10GitLab: GitLab replica in codfw - https://phabricator.wikimedia.org/T285867 (10Dzahn) [14:45:58] also no idea about nsenter [14:46:00] :) [14:46:46] 10serviceops, 10GitLab: request service IP / DNS name for gitlab-failover, apply puppet role on gitlab2001 - https://phabricator.wikimedia.org/T285870 (10Dzahn) [14:56:12] 10serviceops, 10Release-Engineering-Team, 10Datacenter-Switchover: Switch deployment server to codfw (July 2021) - https://phabricator.wikimedia.org/T285820 (10brennen) Offhand, I think that's likely to be fine, but cc @thcipriani for awareness around train planning. (And adding to the team etherpad.) [14:56:27] 10serviceops, 10Datacenter-Switchover, 10Release-Engineering-Team (Radar): Switch deployment server to codfw (July 2021) - https://phabricator.wikimedia.org/T285820 (10brennen) [15:00:01] 10serviceops, 10GitLab, 10Infrastructure-Foundations: request service IP / DNS name for gitlab-failover, apply puppet role on gitlab2001 - https://phabricator.wikimedia.org/T285870 (10Jelto) [15:04:41] tomorrow EU morning we will start with decom of eqiad appservers, starting with canaries and it will be knowledge transfer, how to do decoms [15:21:11] elukey: every additional package increases the attack surface of a container/system. It could happen that a security vulnerability is found in net-tools in the future (especially because this is network stuff). So I would also say we should not add additional packages just for debugging. [15:21:16] elukey: like jayme said, using nsenter would be a better solution. Every container lives inside a namespace on the host machine. So you have to take a look on which node the pod is running, find the container on the node and use linux command line tool nsenter to enter the container namespace from the node. [15:24:59] 10serviceops, 10SRE, 10Thumbor, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10JoKalliauer) [15:26:09] jelto: thanks for the context, I am playing with nsenter right now, TIL for today. I have to say that I doubt net-tools could cause any attack vector, but I agree that if there are other tools there is no reason to add more things to a docker image. I hoped that debugging pods would have been less annoying, but I am probably too new to kubernetes to judge [15:31:01] 10serviceops, 10Datacenter-Switchover, 10Release-Engineering-Team (Radar): Switch deployment server to codfw (July 2021) - https://phabricator.wikimedia.org/T285820 (10thcipriani) >>! In T285820#7187345, @brennen wrote: > Offhand, I think that's likely to be fine, but cc @thcipriani for awareness around trai... [15:32:11] ok found the right combination of parameters of nsenter, and netstat worked fine [15:32:45] not sure how soon I'll forget how to use it but it was definitely a good tip, thanks to both! [15:34:48] 10serviceops, 10MW-on-K8s: MW container image build workflow vs docker-registry caching - https://phabricator.wikimedia.org/T282824 (10dancy) [15:37:54] 10serviceops, 10MW-on-K8s, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Seen): Blubber needs to check if a user is present before creating it as part of its runs stanza - https://phabricator.wikimedia.org/T268819 (10dancy) 05Open→03Resolved >>! In T268819#7186716, @Jdforrester-WMF wrote: >... [15:40:20] elukey: you could put an alias into modules/admin/files/home/elukey/.bash_profile so you never have to remember it [15:41:56] mutante: ahahah yes I can try, I try to keep useful stuff all on wikitech so I can check when needed [15:42:09] my brain acts as LRU cache, sometimes I forget things :D [15:43:46] elukey: hehe, same here. I rely solely on bash history and browser history and without them.. lost :) [15:47:25] 10serviceops, 10Datacenter-Switchover, 10Release-Engineering-Team (Radar): Switch deployment server to codfw (July 2021) - https://phabricator.wikimedia.org/T285820 (10Legoktm) >>! In T285820#7187576, @thcipriani wrote: > For clarity, @Legoktm should we halt deployments for this switchover? No one should be... [15:52:22] created https://wikitech.wikimedia.org/wiki/User:Elukey/MachineLearning/kfserving#General_Debugging :) [16:46:44] hey folks, I just seen https://phabricator.wikimedia.org/T249929 [16:47:03] after istio tells me [16:47:03] horizontal-pod-autoscaler unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server could not find the requested resource (get pods.metrics.k8s.io) [16:47:20] I guess that we don't have HPA anywhere and I am the first one right? [16:51:24] I remember us not having the metrics server setup that is needed for HPA [16:53:08] wkandek: thanks! [19:21:10] 10serviceops, 10observability, 10Performance-Team (Radar), 10Wikimedia-maintenance-script-run: Ingest logs from scheduled maintenance scripts at WMF in Logtash - https://phabricator.wikimedia.org/T285896 (10Krinkle) [19:25:34] 10serviceops, 10Wikimedia-General-or-Unknown, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logtash - https://phabricator.wikimedia.org/T285896 (10Krinkle) [19:29:35] 10serviceops, 10Wikimedia-General-or-Unknown, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Ingest logs from scheduled maintenance scripts at WMF in Logstash - https://phabricator.wikimedia.org/T285896 (10Aklapper)