[01:05:44] oops, we probably shouldn't have both https://wikitech.wikimedia.org/wiki/Kubectl and https://wikitech.wikimedia.org/wiki/Kubernetes/Kubectl [01:05:59] too late in my day but I'll merge and redirect tomorrow, if no one beats me to it [01:26:36] done! pages merged / redirected [01:26:41] afk [02:24:00] mutante: ah thanks! <3 [06:08:42] good morning! [06:09:07] I found more interesting infos about the tcp i/o timeout that I mentioned yesterday [06:09:43] if I nsenter the knative controller (the container that tries to fetch the docker image's digest/sha) I see: [06:09:46] elukey@ml-serve1004:~$ sudo nsenter -t 2183 -n telnet docker-registry.discovery.wmnet 443 [06:09:49] Trying 10.2.1.44... [06:09:52] elukey@ml-serve1004:~$ sudo nsenter -t 2183 -n telnet docker-registry.wikimedia.org 443 -4 [06:09:55] Trying 208.80.154.224... [06:09:57] Connected to dyna.wikimedia.org. [06:10:06] that explains the issues that I am seeing [06:14:19] (the ml-serve cluster still has GlobalNetworkPolicies: {}, we have a task for it but so far no restrictions) [06:14:41] (s/still has/has) [06:21:50] <_joe_> elukey: yes that was what I told you already [06:22:09] <_joe_> I think the firewall is on the side of the registry [06:22:24] <_joe_> let me finish to deal with clinic duty and I'm with you [06:23:55] no rush, I was confused by the fact that most of our images are already pulled from the docker registry's discovery endpoint, but now I realize that in this case the connection is made from pod ips [06:24:01] not the underlying hosts [06:26:08] but the docker registry's profile says DOMAIN_NETWORKS [06:28:30] <_joe_> that also includes the pod IPs [06:29:45] I'll try to narrow down where I get blocked :) [07:04:44] to keep archives happy - I need to add the pod/svc ips to network's data.yaml (TIL) [07:05:12] <_joe_> yeah that file is what posterity will remember akosiaris for [07:05:24] <_joe_> his biggest contribution to open knowledge [07:06:25] * _joe_ imagines akosiaris staring at network::constants' code and declaring "Exegi monumentum aere perennius" [07:21:52] https://gerrit.wikimedia.org/r/c/operations/puppet/+/724933 :) [08:16:18] wait what? [08:16:30] the pod is trying to fetch a docker image ? [08:16:38] why would it do that? [08:17:22] akosiaris: no no only its digest, it is needed by knative's controller to couple it with a new revision [08:18:41] https://knative.dev/docs/developer/serving/tag-resolution/ [08:19:00] heh, you know, when that cluster has an outage, I would love to see who is going to be able to debug anything in it. The more I hear about kf, the more I paint a picture of badly boiled spaghetti with a lot of cream sauce to hide the fact [08:19:11] you can bash that btw [08:20:47] I don't think it is that bad, the various gears are not easy to pick up at the beginning but so far everything looks legit (even if very new and changing of course). I mean I expected way worse, and my k8s experience is low, so this is probably why I am a little positive about the stack :) [08:21:34] in this case what knative is trying to do, IIUC, is just to get a more precise reference of what docker image is being used for a revision so that it can rollback to it etc.. if needed [08:22:54] now you are free to tell me to shut up and undeploy the kf stack now :D [08:23:34] that won't be me ;-) [08:24:18] too kind :) [08:25:02] in any case, I left some doubts/question-marks to https://gerrit.wikimedia.org/r/c/operations/puppet/+/724933, if you have time to review it later on I'd be sooo grateful :) [08:25:28] and already reviewed, super quick :D [08:47:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/724940 and followup elukey: [08:47:10] I think my intent is clear there [08:47:19] it is yes! [08:47:26] it should allow you to add the mlserve cluster in a way more easy way [08:47:41] let me run a PCC though first [08:48:06] do you think that I should add a slice for the ml-serve cluster? For the moment I'd just need it to be part of domain_networks, no other real needs [08:48:26] (it can be added on a later stage too) [08:49:18] definitely add a slice for later, but the ml-serve cluster (node and pods) should already be part of $domain_networks anyway [08:49:41] it's slicing per $::realm, so the moment you just add them, they are going to be there [08:49:54] em, I am actually contradicting myself, I just saw that [08:50:15] in my pcc they get added to DOMAIN_NETWORKS, that is what I needed [08:50:24] I meant it will be part of $domain_networks once you merge your change. But yes please, do add a slice specifically for mlserve pods/svcs [08:50:38] ack perfect, will update my patch after yours [08:53:46] <_joe_> akosiaris: do I remember incorrectly or we were supposed to feed all that data to puppet from netbox? [08:56:50] you remember correctly [08:57:13] <_joe_> that would be quite better indeed [09:00:32] elukey: change merged, feel free to rebase and amend yours on top of it [09:00:40] akosiaris: I need to run a pcc on maps* hosts [09:01:42] effie: go ahead [09:01:50] yes yes [09:08:00] akosiaris: please revert [09:08:17] let me run puppet on maps, but it might fail [09:09:03] yeah, it fails [09:09:13] I need to run, I can have a look later [09:10:39] effie: I got a followup specifically for maps ready [09:10:49] ah ok, let me abandon the revert [09:10:50] https://gerrit.wikimedia.org/r/c/operations/puppet/+/724941 [09:11:00] I 'll merge that one instead [09:11:47] alright, if pcc agrees, great [09:11:48] bbl [09:24:20] yeah, I should have coded the first patch a bit more defensively. I did create an old compat var for ferm but not for puppet. I needed one there too [09:24:31] my mistake, but the followup fixes it. I 'll merge it [10:52:03] 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10akosiaris) [Docs](https://docs.projectcalico.org/getting-started/kubernetes/self-managed-onprem/onpremises#install-c... [11:49:10] 10serviceops, 10Scap, 10Patch-For-Review: Scap error when deploying kartotherian - https://phabricator.wikimedia.org/T291990 (10Jgiannelos) I just run a deployment in one of the maps nodes and looks like it works with 3.17.1. Thanks @jijiki. [11:49:20] 10serviceops, 10Release-Engineering-Team, 10Scap, 10Patch-For-Review: scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10hashar) The [[ https://logstash.wikimedia.org/app/dashboards#/view/1c3a4d80-35c2-11e7-b186-d1bc9cbdde4c | scap canary dashboard ]] is essent... [11:50:27] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10jijiki) @dancy it would be lovely if we can speed this up, right now we have `deploy1002` and `maps*` on version 3.17.1, and the rest on version 4.0.0. [11:54:43] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10hashar) There are some more Python 3 related issues that need to be addressed such as {T291990} - https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/724527 [12:09:16] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) wikidiff 1.13.0 is now installed on the beta cluster. [12:09:28] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) [12:16:20] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Joe) @hnowlan please remember to also rebuild the corresponding docker image when rolling out to p... [12:17:53] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10hnowlan) [12:47:58] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) Thanks @hnowlan. @dom_walden, @imaigwilo you should now be able to test the related ticke... [13:20:39] elukey: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724957 [13:21:05] you probably will want a similar change for mlserve. But you got 4 nodes IIRC, some something like 2? 4 btw would be detrimental. [13:34:20] quick q: swift and swift-ro on codfw are depooled on the DNS discovery level for quite some time now. Should we repool them ? [13:41:36] 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10akosiaris) p:05Medium→03Low `services/eqiad` and `services/codfw` clusters are now running... [13:41:54] akosiaris: ack yes! [14:03:19] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) After much deliberation, @akosiaris and I decided we'll go the following way: * Install an rsyslogd sidecar that will be used by mediawiki... [15:50:36] akosiaris _joe_ about https://phabricator.wikimedia.org/T288851 [15:50:54] would it make sense to have rsyslog as a daemonset ? [15:51:00] <_joe_> no [15:51:32] why ? [15:51:33] we already have rsyslog running on the nodes [15:51:35] <_joe_> it would need to accept logs from any pod, and logging to it would anyways be tricky, the usual issues with daemonsets [15:51:50] <_joe_> we would've rather used that, yes [15:52:02] alright then [15:52:11] <_joe_> I even thought of having the on-host rsyslogd open a socket, and then export it via hostpath [15:52:21] <_joe_> it would be in many ways the simplest solution [15:53:16] btw we need to increase the pod memory limits, I think we are almost maxed out [15:53:24] since we are adding another sidecar [15:53:24] <_joe_> which pods? [15:53:33] mediawiki pods [15:53:35] <_joe_> oh mediawiki [15:53:37] <_joe_> possibly [15:53:37] limitranges [15:53:42] they are right on the limit IIRC [15:53:43] <_joe_> yeah I got it now [15:53:56] sorry, didn;t phrase it properly [15:54:20] max: [15:54:20] memory: "4Gi" [15:54:20] cpu: "9" [15:54:26] yeah that 4 needs a bump [15:54:29] <_joe_> akosiaris: we could also use the hostpath :D [15:54:35] but that's a job for tomorrow [15:54:41] <_joe_> yeah definitely [15:55:31] I like the idea of using the host's rsyslog, but I have no idea of the complexity that comes with it [15:59:09] it's quite a bit [15:59:15] it's not fully off the table [15:59:47] but it is possible having a dedicated one is saner. [16:02:35] 10serviceops: Productionise thumbor1005, thumbor1006, thumbor2005 and thumbor2006 - https://phabricator.wikimedia.org/T285477 (10Papaul) [16:03:37] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [16:04:37] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [16:05:00] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) 05Open→03Resolved complete [16:05:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) [16:05:51] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) 05Open→03Resolved complete [16:13:37] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy) >>! In T271736#7373538, @Reedy wrote: >>>! In T271736#7335188, @tstarling wrote: >> Reading https://github.com/ruflin/Elastica/issues/1913 , it looks like the way out of that infinite regression is... [17:12:57] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform: Enable envoy tls proxy logging from eventgate - https://phabricator.wikimedia.org/T291856 (10odimitrijevic) p:05Triage→03High [17:16:07] 10serviceops, 10Analytics-Radar, 10Data-Engineering, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10odimitrijevic) [20:06:41] 10serviceops, 10SRE: Remove libvips-tools from mediawiki appservers - https://phabricator.wikimedia.org/T290802 (10Legoktm) 05Stalled→03Open This is unblocked now that Special:VipsTest has been disabled. [20:30:43] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) >>! In T240685#7392652, @gerritbot wrote: > Change 721626 **merged** by jenkins-bot: > %%%[mediawiki/core@master] Metrics: Implement statsd-exporter... [21:00:13] 10serviceops, 10Release-Engineering-Team, 10GitLab (Infrastructure), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10brennen) [21:00:38] 10serviceops, 10Release-Engineering-Team, 10GitLab (Infrastructure), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10brennen) p:05Triage→03Medium