[06:58:05] good morning! [06:58:41] I am able to query the istio ingress gw with HTTPS and target the various backends with ML models \o/ [06:59:19] we are almost ready to get a VIP for inference.discovery.wmnet [07:19:00] 10serviceops, 10Prod-Kubernetes, 10Toolhub, 10Kubernetes: Maintenance environment needed for running one-off commands - https://phabricator.wikimedia.org/T290357 (10JMeybohm) [07:20:33] elukey: yay [07:21:26] I kind of rebuild that in my minikube on friday I guess. :) Do I remember correctly that you included the puppet-ca deb package in the istio ingress container? [07:23:10] jayme: o/ I am using the buster seed image, it is included in it? Otherwise I don't recall to have add it [07:23:22] if it is not then I am a little confused why TLS termination works :D [07:23:32] checkign [07:23:47] I don't think it's included in the buster images tbh [07:24:23] see my day started happily and now I am sad again [07:24:35] oh..I'm so sorry :| [07:24:43] ahahahahha [07:24:55] nono please I was joking, this is a great point, I am checkign now [07:26:04] nope wmf-certificates is not included [07:26:22] hmm..interesting [07:26:33] maybe you don't tls from ingress to service? [07:27:44] yes I was about to say that, caffeine level too low.. the istio ingress gw is not a TLS client in my case, it just terminates TLS and then use HTTP to connect to the backend nodes [07:28:46] that makes things clear :) [07:28:46] if we were to use the full mesh TLS etc.. then yes we'd need the cert packages on the images (or probably cert-manager handling this for us) [07:29:01] jayme: thanks for the brainbounce, I am happy again :D [07:38:25] jayme: the kfserving stack seems to be doing a "nice" work in configuring istio - knative takes care of the base Gateway config (so HTTP/HTTPS etc..) and kfserving adds the VirtualService backends. In Knative's config there is a setting (among the istio configs) to select the target backends / virtualhosts to enable, and when deployed it creates istio routes [07:39:07] that can be inspected via istioctl [07:39:26] (I am using "nice" since I can't really be happy about k8s on monday morning) [07:40:51] so the idea is to have different groups of ML models/backends in various namespaces (like grouping some ORES models, etc..) and then a single istio gateway to target them (using knative to select which ones) [07:41:19] how'd you inspect with istioctl? [07:41:23] this start to look like something understandable (after months of desperation) [07:42:39] bd808_: regarding logstash: Your logs should totally show up there but I'm unable to gather them via the "App Logs" dashboard for some (probably kibana) reason. Please fall back to Kibana Discover mode in that case (https://logstash.wikimedia.org/goto/7a29c4467206534cd66d8caa9c0cad3f) [07:42:46] jayme: https://phabricator.wikimedia.org/P17224 [07:43:11] sweet [07:43:58] left over Q from friday: Is the istio 1.6 binary still needed for something (shipped in the istio package [07:46:17] nono I think we can remove it, the idea was to leave it there as "supported" version until we proved that 1.9.5 worked fine [11:26:13] 10serviceops, 10SRE: Cloud VPS alert][packaging] Puppet failure on builder-envoy-03.packaging.eqiad.wmflabs - https://phabricator.wikimedia.org/T290430 (10JMeybohm) 05Open→03Resolved p:05Triage→03Medium [13:09:10] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) >>! In T251305#6886431, @JMeybohm wrote: > helm test annotations changed a bit: > >> Note that until Helm v3, the job definition needed to contain one of these helm test hook ann... [13:42:43] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10JMeybohm) >>! In T251305#7334407, @Jelto wrote: > I checked all charts for deprecated and removed helm annotations. We don't use `"helm.sh/hook": test-failure`. This annotation is remove... [13:44:10] 10serviceops, 10MW-on-K8s, 10SRE: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10akosiaris) With @jijiki we went ahead and create some percentiles comparisons between `mw2254` and `pinkunicorn`. We chose to have the exact same number of php fpm workers (96) as an inv... [13:53:10] 10serviceops, 10GitLab, 10Release-Engineering-Team (Next), 10User-brennen: GitLab major version upgrade: 14.x - https://phabricator.wikimedia.org/T289802 (10MoritzMuehlenhoff) I've uploaded 14.0.10, we can bump the import hook after the initial update is complete. [14:04:34] 10serviceops, 10Lift-Wing, 10Kubernetes, 10Machine-Learning-Team (Active Tasks): Discussion: dedicated directory in the deployment-chart repository for ML services - https://phabricator.wikimedia.org/T286791 (10elukey) Coming back to this task :) For the `admin_ng` directory Joe came up with this trick to... [16:36:10] anybody familiar with wikifeefs? [16:36:14] *wikifeeds? [16:36:21] we are trying to debug an alarm but it is very weird [16:59:55] looks like shellbox is generating a lot of logs these days [17:29:07] some notes from the debugging that we did: [17:30:56] hnowlan: I see "The node was low on resource: ephemeral-storage" as cause of eviction [17:58:51] 10serviceops, 10SRE: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10elukey) p:05Triage→03High [18:00:49] 10serviceops, 10SRE: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10hnowlan) My running theory on this is that shellbox is currently generating a lot of logs (dozens of lines a second) - the file is 12GB on kubernetes2017 atm but could easily be other se... [18:09:06] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10elukey) [18:12:18] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10hnowlan) [18:19:25] 10serviceops, 10Prod-Kubernetes, 10Shellbox, 10Kubernetes: Docker container logs (stdout, stderr) can grow quite large - https://phabricator.wikimedia.org/T289578 (10JMeybohm) Discovered again by our dear colleagues @hnowlan and @elukey in T290444 Official CRI docs suggest setting max-size in docker config... [18:35:01] 10serviceops, 10SRE: Pods in evicted state for various namespaces in k8s main - https://phabricator.wikimedia.org/T290444 (10JMeybohm) Evictions actually happened this morning: ` # kubectl -n wikifeeds get po --field-selector=status.phase=Failed -o custom-columns="NAME:.metadata.name,STATUS:.status.reason,TIME... [20:05:42] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10JMeybohm) https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?viewPanel=17&orgId=1&from=now-7d&to=now&var-datasource=thanos&var-site=codfw&var-prometheus... [20:43:16] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10JMeybohm) In logstash there is a huge amount of 503/504 upstream errors reported by wikifeeds (the app, not tls-proxy) (https://logstash.wikimedia.org/goto/6f2cd8f9fe... [21:22:10] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10JMeybohm) I'm pretty tired already, but I kind of feel stuck at the point of wikifeeds envoy keep failing with UF **to restbase** (if I'm not reading this wrong) and...