[06:10:09] <wikibugs>	 10serviceops, 10DBA, 10Phabricator, 10serviceops-collab, and 2 others: sort out mysql privileges for phab1004/phab2002 - https://phabricator.wikimedia.org/T315713 (10Marostegui) @Aklapper this is now fixed: ` aklapper@phab1001:~$ mysql Reading table information for completion of table and column names You...
[08:26:38] <elukey>	 hello folks
[08:26:57] <elukey>	 helm is not happy for staging-eqiad and eventstreams-internal
[08:26:58] <elukey>	 6       	Tue Sep 20 20:02:32 2022	superseded     	eventstreams-0.5.0	           	Upgrade "main" failed: failed to create resource: Service "eventstreams-main-tls-service" is invalid: spec.ports[0].nodePort: Invalid value: 4992: provided port is already allocated
[08:27:02] <elukey>	 7       	Tue Sep 20 20:02:33 2022	failed         	eventstreams-0.4.1	           	Rollback "main" failed: no NetworkPolicy with the name "eventstreams-internal-main" found        
[08:27:56] <elukey>	 seems related to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/833447
[08:28:47] <elukey>	 ah ok https://phabricator.wikimedia.org/T310721 mentions the issue
[09:01:40] <jayme>	 yeah, that change really messed up the release :/
[09:07:30] <elukey>	 jayme: o/
[09:08:14] <jayme>	 I did ack the alert for now
[09:45:18] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) `gitlab-prod-1001` had failing puppet runs (root@ mail):  ` Sep 24 08:12:47 gitlab-prod-1001 systemd[1]: Starting OpenBSD Secure She...
[09:45:31] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) 05Open→03Resolved
[09:53:07] <wikibugs>	 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure): Migrate gitlab-test instance to bullseye - https://phabricator.wikimedia.org/T318521 (10Jelto)
[10:29:23] <wikibugs>	 10serviceops, 10ChangeProp, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10Aklapper) a:05akosiaris→03None Removing task assignee due to inactivity as this open task has been assigned for more tha...
[11:28:43] <wikibugs>	 10serviceops, 10serviceops-collab, 10Patch-For-Review: move micro sites from ganeti to kubernetes - https://phabricator.wikimedia.org/T300171 (10Jelto) ### Usage of micro sizes  @Dzahn as discussed last week, a short Turnilo overview about usage of the micro sites mentioned above: https://w.wiki/5kCz Keep in...
[12:02:42] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹️): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche) 05Open→03Resolved Scap is already generating Helmfile configuration files as described here...
[12:02:54] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹️): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche)
[12:03:24] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Patch-For-Review, 10Release-Engineering-Team (Bonus Level 🕹️): Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10jnuche)
[12:15:38] <wikibugs>	 10serviceops, 10SRE: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541 (10MoritzMuehlenhoff)
[12:15:45] <wikibugs>	 10serviceops, 10SRE: Update node 14/16 base images - https://phabricator.wikimedia.org/T318541 (10MoritzMuehlenhoff) p:05Triage→03Medium
[12:58:07] * inflatador reports for duty
[13:21:22] <inflatador>	 I'm shadowing service ops for the next 2 wks, so if anyone has a ticket or something they wanna work on together, reach out!
[13:33:14] <jayme>	 hi inflatador o/
[13:34:04] <jayme>	 unfortunately we won't be having a team meeting today
[13:34:38] <jayme>	 aiui you've already been pointet towards flink-ish tasks, right? :)
[13:37:27] <inflatador>	 jayme yeah, I think that's the priority from gehel (aka my boss)
[14:37:11] <_joe_>	 inflatador: welcome!
[14:37:34] <inflatador>	 <o/
[14:40:24] <wikibugs>	 10serviceops, 10serviceops-collab, 10Patch-For-Review: move micro sites from ganeti to kubernetes - https://phabricator.wikimedia.org/T300171 (10JMeybohm) Sounds good to me!
[14:40:44] <_joe_>	 I was taking a look at the mediawiki-errors logstash dashboard, it seems like the errors are evenly distributed between php versions if we exclude the new more stringent notices
[14:48:30] <_joe_>	 and I stumbled upon T313973
[14:48:45] <_joe_>	 jayme: did you ever try to find out what was wrong there?
[14:49:26] <jayme>	 _joe_: yes. Actually I wanted to chat with you about it last week...but time
[14:50:06] <_joe_>	 jayme: ok, we have time now
[14:50:08] <jayme>	 my current level of understanding is that connections get terminated in flight
[14:50:18] <_joe_>	 and not due to timeouts?
[14:50:28] <_joe_>	 I see a lot of conflicting info on that task
[14:50:44] <_joe_>	 do we have evidence that failed requests take 10 seconds?
[14:50:56] <_joe_>	 actually, scratch that, I see now UC
[14:50:58] <_joe_>	 not UT
[14:51:10] <_joe_>	 ok, lol, I think I know what the problem is
[14:51:14] <jayme>	 some terminate due to timeouts but most do not AIUI
[14:51:36] <jayme>	 please elaborate :)
[14:51:39] <_joe_>	 yeah I think there is a problem either between the backend envoy and the ingress
[14:51:47] <_joe_>	 or between the ingress and the service
[14:52:03] <_joe_>	 we usually set keepalive to 4 seconds for nodejs services
[14:52:23] <_joe_>	 because the damn nodejs service has a keepalive timeout of ~ 5 seconds
[14:52:45] <_joe_>	 so let me try this the easy way
[14:53:31] <jayme>	 ah...and ingress does not know, keeping connections open
[14:53:53] <jayme>	 that's why we see 1h connection lenght in service-proxy on appservers?
[14:54:20] <_joe_>	 yeah so we need one in the chain of envoys to close connections before that damn nodejs 
[14:54:40] <jayme>	 (see recently added connection metrics in https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry)
[14:54:59] <_joe_>	 jayme: https://gerrit.wikimedia.org/r/835205 
[14:55:19] <jayme>	 🤦
[14:55:50] <_joe_>	 I am 70% sure it will fix the issue
[14:55:55] <_joe_>	 minus the timeouts
[14:56:10] <jayme>	 but...doesn't the service-proxy on image-suggestion side do that already?
[14:57:19] <_joe_>	 the tls terminator? no we didn't make the keepalive configurable there IIRC 
[14:57:24] <_joe_>	 at least not on k8s
[14:57:39] <_joe_>	 anyways, the important part is that the downstream client and the upstream service "agree"
[14:58:18] <jayme>	 I remember some kind of comment along those lines ... 
[14:58:19] <_joe_>	 given how envoy works, 1 client connection -> 1 upstream connection, always
[14:58:38] <_joe_>	 well 1 client established connection I mean
[14:59:58] <jayme>	 we do have an idle timeout of 4.5s set for the service-proxy
[15:01:24] <_joe_>	 on k8s?
[15:01:27] <jayme>	 yep
[15:01:29] <_joe_>	 you mean I did it there too?
[15:01:35] <_joe_>	 I was sure I did not lol
[15:01:57] <_joe_>	 but still, as explained, sadly what counts is the final envoy in the chain I fear
[15:02:48] <_joe_>	 anyways, let's see if this change has any effect
[15:03:03] <_joe_>	 I want to clean house before making the final php 7.4 push tomorrow
[15:03:19] <jayme>	 I fear I fail to follow. This https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/common_templates/0.4/_tls_helpers.tpl#238 is what ends up in the final envoy or am I wrong?
[15:04:45] <_joe_>	 no sorry with "final" I meant the client
[15:04:51] <jayme>	 ah
[15:05:17] <_joe_>	 one day we will need to understand why envoys can't talk tls 1.3 to each other btw
[15:05:52] <jayme>	 but then, still...image-suggestion is not a nodejs service
[15:07:02] <jayme>	 or I'm totally off the track here
[15:07:23] <jayme>	 but AIUI it's just the cassandra gateway thing (and that's golang)
[15:07:54] <_joe_>	 ok, yeah then uhm
[15:07:56] <_joe_>	 :P
[15:08:00] <jayme>	 eheh
[15:08:02] <_joe_>	 kask has no such issues
[15:08:23] <_joe_>	 nor any other net/http based golang service I've experimented with
[15:10:27] <jayme>	 I've not seen that as well
[15:12:43] <jayme>	 so what looks odd in the metrics is the destroyed connections as well as the session length
[15:12:44] <jayme>	 https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=image-suggestion&from=now-24h&to=now
[15:13:05] <jayme>	 appservers (envoy) -> ingress
[15:16:30] <jayme>	 ingress sees quite regular downstream connection terminations (DC) from image-suggestion https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s&var-namespace=image-suggestion&var-backend=All&var-response_code=All&var-quantile=0.95&var-quantile=0.99&from=now-24h&to=now
[15:16:44] <wikibugs>	 10serviceops, 10Discovery-Search (Current work): Coordinate with ServiceOps Team about a rework of the Search Update Pipeline - https://phabricator.wikimedia.org/T317283 (10TJones) a:03Gehel
[15:18:33] <jayme>	 and the service-proxy (on image-suggestion side) reports quite some destroyed connections as well https://grafana-rw.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s?orgId=1&var-datasource=thanos&var-site=eqiad&var-prometheus=k8s&var-app=cassandra-http-gateway&var-destination=All&from=now-24h&to=now
[15:28:04] <_joe_>	 case in point, the errors on the mediawiki side ended with my change it seems
[15:28:40] <_joe_>	 yes
[15:28:59] <_joe_>	 still too early to call, but it looks somewhat promising
[15:29:06] <_joe_>	 we'll know tomorrow morning
[15:29:48] <jayme>	 let's resync then. I would like to figure out why (in both cases)
[15:30:27] <jayme>	 *than..as in tomorrow
[15:37:34] <_joe_>	 as I said, it's a chain of race conditions for sure, in this case I fear the ingress plays a role
[15:40:59] <jayme>	 yeah...probably :/
[15:43:32] <_joe_>	 so what we obtain limiting the duration of connections is to make the race condition improbable enough that it is practically gone from our systems
[15:48:29] <_joe_>	 I would say we can talk more tomorrow
[16:21:04] <wikibugs>	 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Jdforrester-WMF)
[16:24:52] <wikibugs>	 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10taavi)
[16:31:46] <wikibugs>	 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Jdforrester-WMF)
[17:39:57] <wikibugs>	 10serviceops, 10SRE, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), 10MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Krinkle)
[18:06:06] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul) a:05Papaul→03None
[18:06:17] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Papaul)
[20:22:27] <wikibugs>	 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Krinkle)
[20:22:40] <wikibugs>	 10serviceops, 10Dumps-Generation, 10Patch-For-Review, 10Performance-Team (Radar): Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Krinkle)
[20:22:50] <wikibugs>	 10serviceops, 10MediaWiki-extensions-WikimediaEvents, 10Performance-Team, 10MW-1.39-notes (1.39.0-wmf.26; 2022-08-22): Allow assigning each user to a specific php engine by setting a PHP_ENGINE cookie - https://phabricator.wikimedia.org/T311388 (10Krinkle) 05Open→03Resolved a:03Krinkle
[20:22:58] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10Performance-Team (Radar): Split edge caches between php versions - https://phabricator.wikimedia.org/T311479 (10Krinkle) a:05Krinkle→03Joe Given the rollout of the PHP74 cookie campaign, I assume this has since been resolved.
[20:23:31] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10Performance-Team (Radar): Split edge caches between php versions - https://phabricator.wikimedia.org/T311479 (10Krinkle) 05Open→03Resolved a:03Krinkle
[21:27:25] <wikibugs>	 10serviceops, 10Release Pipeline: Clean-up / delete old versions of service pipeline created docker images from the public docker registry? - https://phabricator.wikimedia.org/T307797 (10bd808) The #toolhub and #wikimedia-developer-portal projects are both also publishing an image for each merged commit and wi...