[09:51:37] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) p:05Triage→03Medium [11:14:12] hello folks [11:14:30] I tried to fix the calico,cfssl-issuer,knative-serving charts with https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/768681 [11:14:46] but I am getting an error from CI (added more info to the error msg) [11:15:06] all things like [11:15:07] "Error: found in Chart.yaml, but missing in charts/ directory: calico-crds" [11:15:29] not sure if this requires a helm dependency update or similar [11:15:37] or if I am missing something in the new config [11:42:52] (need to go, will keep working on it later) [12:44:35] elukey: hm, yea. I think with proper dependencies defined this now needs a "helm dependency build" prior to linting [12:45:05] if we don't want to check in a copy of the dependency at least... [13:26:40] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint): Investigate increase and fluctuation in max CPU for linkrecommendation-internal container - https://phabricator.wikimedia.org/T303177 (10kostajh) [13:27:04] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint): Investigate increase and fluctuation in max CPU for linkrecommendation-internal container - https://phabricator.wikimedia.org/T303177 (10kostajh) [13:29:47] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint): Investigate increase and fluctuation in max CPU for linkrecommendation-internal container - https://phabricator.wikimedia.org/T303177 (10kostajh) It seems to start when this change is deployed: > 08:08 urbanecm@deploy1002: Synchronized wmf-config/Pro... [13:35:38] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint): Investigate increase and fluctuation in max CPU for linkrecommendation-internal container - https://phabricator.wikimedia.org/T303177 (10kostajh) p:05Triage→03Medium [14:13:40] jayme: thanks, I tried to add dependency build just before lint and it worked, going to find a good place to add it and the I'll update the CR [14:26:36] 10serviceops, 10Add-Link, 10Growth-Team (Current Sprint): Investigate increase and fluctuation in max CPU for linkrecommendation-internal container - https://phabricator.wikimedia.org/T303177 (10JMeybohm) This change seems to also come with a lower avg latency. The values from [[ https://grafana-rw.wikimedia... [14:35:46] thanks for taking care of that elukey [14:36:46] jayme: qq - do we want to add the Chart.lock file? (generated via dependency update) [14:37:12] IIUC if not present helm dependency build will do it, but the other charts that we have atm all have the Chart.lock file [14:38:25] yeah, it's vendoring the exact dependency version AIUI. So in my understanding it's wise to have that in git [14:39:42] ofc. that would require a manual "helm dependency update" to build a new lock file if one changes the dependencies... [14:39:49] I am wondering if it is enough to make the actual lint to succeed, otherwise we'd see errors for the other charts that have deps no? [14:40:02] (trying) [14:40:32] no weird still failing [14:43:07] I think the other charts with dependencies (and without typo) do have proper Chart.lock files in place already [14:43:48] one thing that the other charts do is to ship the charts dir [14:44:02] containing stuff like calico-crds-0.1.0.tgz [14:44:25] trying with those as well (it is generated by helm dependency update) [14:51:24] yeah it works [14:51:27] updating the CR [14:57:10] hmm..weird. I thought that would not be needed as helm dependency build will download them again in case thei are not there [15:10:30] jayme: yes exactly it is my understanding as well, but the other deps in other charts are managed "manually" [15:16:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 (10JMeybohm) [15:24:12] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 (10elukey) Two interesting things logged on one of the istiod pods: ` {"level":"error","time":"2022-03-07T14:24:56.360678Z","scope":"klog","msg":"error ret... [15:47:09] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 (10JMeybohm) >>! In T303184#7757122, @elukey wrote: > Two interesting things logged on one of the istiod pods: > > ` > {"level":"error","time":"2022-03-07T... [16:01:08] 10serviceops, 10SRE: enhance otrs alerting - https://phabricator.wikimedia.org/T303190 (10Arnoldokoth) [16:02:25] 10serviceops, 10SRE: investigate otrs database grants - https://phabricator.wikimedia.org/T303191 (10Arnoldokoth) [16:13:26] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 (10JMeybohm) Killing `istiod-69d679d8b5-hm64j` actually brought the latency down again [16:31:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 (10JMeybohm) [16:36:32] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: High API server request latencies (LIST) for istio API groups - https://phabricator.wikimedia.org/T303184 (10JMeybohm) [17:03:37] Another resource bump for jobqueue if anyone has a sec https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/768760 [17:03:48] s/bump/change/ I guess [17:03:56] Do any k8s services run without CPU limits? [17:12:13] I don't think so. At least no "workload services" [20:06:50] skimming the Envoy release notes in prep for the 1.15 -> 1.18 upgrade, and this is tickling something in the back of my head: [20:06:56] > * http: no longer adding content-length: 0 for requests which should not have bodies. This behavior can be temporarily reverted by setting `envoy.reloadable_features.dont_add_content_length_for_bodiless_requests` false. [20:07:47] did we have an issue with envoy involving `Content-Length: 0` on empty-body responses? anybody remember? I feel like there was something but that's all I've got [20:08:01] scuse me, empty-body requests I mean [20:11:41] I don't recall that, but it triggered a related memory, which the breadcrumbs of might offer some insight into why this ever matters [20:11:43] ohh I think I was thinking of https://phabricator.wikimedia.org/T288815 [20:11:45] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#1147 [20:11:54] -> https://phabricator.wikimedia.org/T64245 [20:12:12] (about us *adding* CL:0 to empty-body responses to fix some remote proxy, way back when) [20:12:29] ha, cool [20:13:39] T288815 is resolved and was also on the response side (rather than the request side which is what this update touches) so I think we're good [20:14:09] but that's neat to read about, what a mess [20:22:57] 10serviceops, 10SRE, 10Znuny: investigate otrs database grants - https://phabricator.wikimedia.org/T303191 (10Peachey88) [20:23:06] 10serviceops, 10SRE, 10Znuny: enhance otrs alerting - https://phabricator.wikimedia.org/T303190 (10Peachey88) [21:11:44] 10serviceops, 10SRE, 10Znuny: enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Aklapper) [21:12:00] 10serviceops, 10SRE, 10Znuny: enhance Znuny (otrs) alerting - https://phabricator.wikimedia.org/T303190 (10Aklapper) Hi @Arnoldokoth, the lack of a task description makes is hard for others to help or contribute, for a triager/tester to figure out at some point in the future whether this is still a valid tas... [21:48:20] neat, this is probably worth looking into https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#envoy-v3-api-msg-config-cluster-v3-cluster-preconnectpolicy [21:49:58] I doubt the connection overhead is a major performance drag the way we have things configured, but [22:53:49] 10serviceops, 10SRE, 10Traffic, 10envoy: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) [22:55:33] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [22:55:41] 10serviceops, 10SRE, 10Traffic, 10envoy: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Open→03Stalled p:05Triage→03Low [23:02:34] 10serviceops, 10SRE, 10Traffic, 10envoy: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) [23:05:08] 10serviceops, 10SRE, 10Traffic, 10envoy: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) p:05Triage→03Medium