[06:10:09] 10serviceops, 10DBA, 10Phabricator, 10serviceops-collab, and 2 others: sort out mysql privileges for phab1004/phab2002 - https://phabricator.wikimedia.org/T315713 (10Marostegui) @Aklapper this is now fixed: ` aklapper@phab1001:~$ mysql Reading table information for completion of table and column names You... [08:26:38] hello folks [08:26:57] helm is not happy for staging-eqiad and eventstreams-internal [08:26:58] 6 Tue Sep 20 20:02:32 2022 superseded eventstreams-0.5.0 Upgrade "main" failed: failed to create resource: Service "eventstreams-main-tls-service" is invalid: spec.ports[0].nodePort: Invalid value: 4992: provided port is already allocated [08:27:02] 7 Tue Sep 20 20:02:33 2022 failed eventstreams-0.4.1 Rollback "main" failed: no NetworkPolicy with the name "eventstreams-internal-main" found [08:27:56] seems related to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/833447 [08:28:47] ah ok https://phabricator.wikimedia.org/T310721 mentions the issue [09:01:40] yeah, that change really messed up the release :/ [09:07:30] jayme: o/ [09:08:14] I did ack the alert for now [09:45:18] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) `gitlab-prod-1001` had failing puppet runs (root@ mail): ` Sep 24 08:12:47 gitlab-prod-1001 systemd[1]: Starting OpenBSD Secure She... [09:45:31] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) 05Open→03Resolved [09:53:07] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Migrate gitlab-test instance to bullseye - https://phabricator.wikimedia.org/T318521 (10Jelto) [10:29:23] 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity as this open task has been assigned for more tha... [11:28:43] 10serviceops, 10serviceops-collab, 10Patch-For-Review: move micro sites from ganeti to kubernetes - https://phabricator.wikimedia.org/T300171 (10Jelto) ### Usage of micro sizes @Dzahn as discussed last week, a short Turnilo overview about usage of the micro sites mentioned above: https://w.wiki/5kCz Keep in... [12:02:42] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche) 05Open→03Resolved Scap is already generating Helmfile configuration files as described here... [12:02:54] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche) [12:03:24] 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹ī¸): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche) [12:15:38] 10serviceops, 10SRE: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541 (10MoritzMuehlenhoff) [12:15:45] 10serviceops, 10SRE: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:58:07] * inflatador reports for duty [13:21:22] I'm shadowing service ops for the next 2 wks, so if anyone has a ticket or something they wanna work on together, reach out! [13:33:14] hi inflatador o/ [13:34:04] unfortunately we won't be having a team meeting today [13:34:38] aiui you've already been pointet towards flink-ish tasks, right? :) [13:37:27] jayme yeah, I think that's the priority from gehel (aka my boss) [14:37:11] <_joe_> inflatador: welcome! [14:37:34] 10serviceops, 10serviceops-collab, 10Patch-For-Review: move micro sites from ganeti to kubernetes - https://phabricator.wikimedia.org/T300171 (10JMeybohm) Sounds good to me! [14:40:44] <_joe_> I was taking a look at the mediawiki-errors logstash dashboard, it seems like the errors are evenly distributed between php versions if we exclude the new more stringent notices [14:48:30] <_joe_> and I stumbled upon T313973 [14:48:45] <_joe_> jayme: did you ever try to find out what was wrong there? [14:49:26] _joe_: yes. Actually I wanted to chat with you about it last week...but time [14:50:06] <_joe_> jayme: ok, we have time now [14:50:08] my current level of understanding is that connections get terminated in flight [14:50:18] <_joe_> and not due to timeouts? [14:50:28] <_joe_> I see a lot of conflicting info on that task [14:50:44] <_joe_> do we have evidence that failed requests take 10 seconds? [14:50:56] <_joe_> actually, scratch that, I see now UC [14:50:58] <_joe_> not UT [14:51:10] <_joe_> ok, lol, I think I know what the problem is [14:51:14] some terminate due to timeouts but most do not AIUI [14:51:36] please elaborate :) [14:51:39] <_joe_> yeah I think there is a problem either between the backend envoy and the ingress [14:51:47] <_joe_> or between the ingress and the service [14:52:03] <_joe_> we usually set keepalive to 4 seconds for nodejs services [14:52:23] <_joe_> because the damn nodejs service has a keepalive timeout of ~ 5 seconds [14:52:45] <_joe_> so let me try this the easy way [14:53:31] ah...and ingress does not know, keeping connections open [14:53:53] that's why we see 1h connection lenght in service-proxy on appservers? [14:54:20] <_joe_> yeah so we need one in the chain of envoys to close connections before that damn nodejs [14:54:40] (see recently added connection metrics in https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry) [14:54:59] <_joe_> jayme: https://gerrit.wikimedia.org/r/835205 [14:55:19] đŸ¤Ļ [14:55:50] <_joe_> I am 70% sure it will fix the issue [14:55:55] <_joe_> minus the timeouts [14:56:10] but...doesn't the service-proxy on image-suggestion side do that already? [14:57:19] <_joe_> the tls terminator? no we didn't make the keepalive configurable there IIRC [14:57:24] <_joe_> at least not on k8s [14:57:39] <_joe_> anyways, the important part is that the downstream client and the upstream service "agree" [14:58:18] I remember some kind of comment along those lines ... [14:58:19] <_joe_> given how envoy works, 1 client connection -> 1 upstream connection, always [14:58:38] <_joe_> well 1 client established connection I mean [14:59:58] we do have an idle timeout of 4.5s set for the service-proxy [15:01:24] <_joe_> on k8s? [15:01:27] yep [15:01:29] <_joe_> you mean I did it there too? [15:01:35] <_joe_> I was sure I did not lol [15:01:57] <_joe_> but still, as explained, sadly what counts is the final envoy in the chain I fear [15:02:48] <_joe_> anyways, let's see if this change has any effect [15:03:03] <_joe_> I want to clean house before making the final php 7.4 push tomorrow [15:03:19] I fear I fail to follow. This https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/common_templates/0.4/_tls_helpers.tpl#238 is what ends up in the final envoy or am I wrong? [15:04:45] <_joe_> no sorry with "final" I meant the client [15:04:51] ah [15:05:17] <_joe_> one day we will need to understand why envoys can't talk tls 1.3 to each other btw [15:05:52] but then, still...image-suggestion is not a nodejs service [15:07:02] or I'm totally off the track here [15:07:23] but AIUI it's just the cassandra gateway thing (and that's golang) [15:07:54] <_joe_> ok, yeah then uhm [15:07:56] <_joe_> :P [15:08:00] eheh [15:08:02] <_joe_> kask has no such issues [15:08:23] <_joe_> nor any other net/http based golang service I've experimented with [15:10:27] I've not seen that as well [15:12:43] so what looks odd in the metrics is the destroyed connections as well as the session length [15:12:44] https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=image-suggestion&from=now-24h&to=now [15:13:05] appservers (envoy) -> ingress [15:16:30] ingress sees quite regular downstream connection terminations (DC) from image-suggestion https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s&var-namespace=image-suggestion&var-backend=All&var-response_code=All&var-quantile=0.95&var-quantile=0.99&from=now-24h&to=now [15:16:44] 10serviceops, 10Discovery-Search (Current work): Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10TJones) a:03Gehel [15:18:33] and the service-proxy (on image-suggestion side) reports quite some destroyed connections as well https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=cassandra-http-gateway&var-destination=All&from=now-24h&to=now [15:28:04] <_joe_> case in point, the errors on the mediawiki side ended with my change it seems [15:28:40] <_joe_> yes [15:28:59] <_joe_> still too early to call, but it looks somewhat promising [15:29:06] <_joe_> we'll know tomorrow morning [15:29:48] let's resync then. I would like to figure out why (in both cases) [15:30:27] *than..as in tomorrow [15:37:34] <_joe_> as I said, it's a chain of race conditions for sure, in this case I fear the ingress plays a role [15:40:59] yeah...probably :/ [15:43:32] <_joe_> so what we obtain limiting the duration of connections is to make the race condition improbable enough that it is practically gone from our systems [15:48:29] <_joe_> I would say we can talk more tomorrow [16:21:04] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Jdforrester-WMF) [16:24:52] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10taavi) [16:31:46] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Jdforrester-WMF) [17:39:57] 10serviceops, 10SRE, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Krinkle) [18:06:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) a:05Papaul→03None [18:06:17] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) [20:22:27] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Krinkle) [20:22:40] 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Krinkle) [20:22:50] 10serviceops, 10MediaWiki-extensions-WikimediaEvents, 10Performance-Team, 10MW-1.39-notes (1.39.0-wmf.26; 2022-08-22): Allow assigning each user to a specific php engine by setting a PHP_ENGINE cookie - https://phabricator.wikimedia.org/T311388 (10Krinkle) 05Open→03Resolved a:03Krinkle [20:22:58] 10serviceops, 10SRE, 10Traffic, 10Performance-Team (Radar): Split edge caches between php versions - https://phabricator.wikimedia.org/T311479 (10Krinkle) a:05Krinkle→03Joe Given the rollout of the PHP74 cookie campaign, I assume this has since been resolved. [20:23:31] 10serviceops, 10SRE, 10Traffic, 10Performance-Team (Radar): Split edge caches between php versions - https://phabricator.wikimedia.org/T311479 (10Krinkle) 05Open→03Resolved a:03Krinkle [21:27:25] 10serviceops, 10Release Pipeline: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? - https://phabricator.wikimedia.org/T307797 (10bd808) The #toolhub and #wikimedia-developer-portal projects are both also publishing an image for each merged commit and wi...