[06:18:23] 06serviceops: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238 (10elukey) 03NEW [06:50:59] 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916406 (10Aklapper) [07:28:24] 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916422 (10elukey) The only thing that matches seems to be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019290, that could kinda makes sense - Wikifeeds calls Restbase via t... [09:09:22] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9916776 (10JMeybohm) >>! In T365253#9909637, @elukey wrote: > I have built and uploaded the new dragonfly packages to bookworm-wikimedia, and updated the ml-sta... [09:10:49] 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9916784 (10elukey) 05Open→03Resolved [09:23:20] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Create ValidationAdmissionPolicies to replace mediawiki PSP - https://phabricator.wikimedia.org/T368251 (10JMeybohm) 03NEW [09:26:28] 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Create ValidationAdmissionPolicies to replace mediawiki PSP - https://phabricator.wikimedia.org/T368251#9916868 (10JMeybohm) p:05Triage→03High Although this is not technically blocking the k8s upgrade (because ValidationAdmissionPolicies require k8s >=1.26 a... [09:31:24] 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916906 (10elukey) The rationale for the `4s` keepalive is in T263043. I am wondering if the limitation still holds, or if we could test a higher keepalive to see if the tlsproxy's t... [09:34:31] 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9916914 (10brouberol) @akosiaris I was reading the operator code and found out that you ca... [09:38:59] 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916955 (10elukey) I also tried to run `perf` to catch what is causing the CPU usage, but probably due to the lack of symbols is it not straightforward to get what's happening. The a... [09:42:06] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9916968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1407.eqiad.wm... [09:42:41] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9916971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1420.eqiad.wm... [09:55:00] 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9917007 (10elukey) After a chat with Janis, this is probably due to T354532. [09:58:57] 06serviceops, 10MW-on-K8s, 10Prod-Kubernetes, 07Kubernetes: Create ValidationAdmissionPolicies to replace mediawiki PSP - https://phabricator.wikimedia.org/T368251#9917017 (10Clement_Goubert) [10:02:34] claime: o/ not urgent but when you have a moment https://phabricator.wikimedia.org/T368238 [10:02:54] Janis pointed me to the envoy work that you did to limit concurrency etc.. [10:03:22] there are still some question marks (why with cfssl? why not showing up for mobileapps that have the same discovery config? etc..) [10:25:37] 06serviceops, 10MW-on-K8s: glogger crashes regularly in mw-on-k8s containers - https://phabricator.wikimedia.org/T363342#9917052 (10Joe) The problem seems to arise because we allocate a byte slice of size `len(line)`, but somehow we try to copy over bytes past that point. This is caused by this code, that som... [10:32:17] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet... [10:37:04] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet... [10:55:24] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917115 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [11:17:10] heads up: I did not see any issues with test deployments in k8s staging-codfw and will therefore switch staging(-eqiad) from PSP to PSS (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1049123) for everything that used to use the restricted PSP (so basically all but mediawiki and kube-system). [11:24:28] 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9917198 (10SGupta-WMF) Thank you @Scott_French for the detailed explanation .... [11:25:32] 06serviceops, 10Cloud-Services, 06SRE, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9917200 (10MoritzMuehlenhoff) CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be nee... [11:25:45] 06serviceops, 10Cloud-Services, 06SRE, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9917201 (10MoritzMuehlenhoff) [11:54:09] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9917252 (10Jgiannelos) [11:54:44] 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 07Essential-Work, 13Patch-For-Review: Upgrade push notifications to node18 - https://phabricator.wikimedia.org/T367272#9917272 (10Jgiannelos) 05Open→03Resolved [12:05:54] 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917301 (10Clement_Goubert) [12:56:53] 👋 After this patch we allow connections from pcs to eventgate-main https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1048009. Is there any way i can point only PCS staging to eventgate-main staging? I didn't find any discovery or listener that we could use. [12:57:42] The reasons is that eventgate-main staging events don't end up in production topics and might be easier to test [13:13:23] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9917489 (10Jdforrester-WMF) [13:13:44] 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9917490 (10Jdforrester-WMF) [13:15:03] 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9917491 (10Jdforrester-WMF) [13:15:24] 06serviceops, 10ChangeProp, 10EventStreams, 10Recommendation-API, and 3 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9917492 (10Jdforrester-WMF) [13:16:03] <_joe_> nemo-yiannis: yes [13:17:31] <_joe_> nemo-yiannis: all of staging uses the same cname to reach services externally [13:17:38] <_joe_> staging.svc.eqiad.wmnet:$PORT [13:17:53] ! TIL! that is awesome [13:17:54] <_joe_> but you can also use the cluster-internal service name [13:18:04] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9917493 (10Jdforrester-WMF) [13:18:29] nemo-yiannis: another way that we have done this in the past, is to declare a temporary or or development 'dev' or release canditate 'rc' stream. This would be like hosting a dev version of an API endpoint. [13:18:29] https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning [13:18:40] but producing to staging is good too, whichever you prefer! [13:18:58] _joe_: what is the cluster-internal service name in staging? [13:19:10] thanks _joe_ , is there any network/firewall config that needs to be changed or can i just go ahead and update the config to use staging.svc.eqiad.wmnet:port ? [13:19:45] i think using staging should be fine ottomata [13:19:53] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9917496 (10Jdforrester-WMF) [13:20:07] ottomata: do i need to explicitly prefix the topic with staging or it happens automagically ? [13:20:36] nemo-yiannis: it will happen automatically: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/eventgate-main/values-staging.yaml#17 [13:20:42] 👍 [13:20:45] <_joe_> ottomata: same as in the main clusters .svc.cluster.local [13:21:49] oh, i see that is addressable from other staging services? like eventgate-main.svc.cluster.local resolves to staging from other pods in staging, and eqiad from other pods in eqiad? [13:22:13] <_joe_> it's eventgate-production-tls-service.eventgate-main... [13:22:22] <_joe_> going from memory, we should check [13:35:04] 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9917597 (10Jdforrester-WMF) [13:36:13] 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9917600 (10Jdforrester-WMF) [13:37:26] 06serviceops, 10ChangeProp, 10EventStreams, 10Recommendation-API, and 2 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9917618 (10Jdforrester-WMF) [13:39:21] 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9917593 (10Jdforrester-WMF) [13:53:54] ottomata: o/ are you folks planning to rollout Eventgate by any chance? [13:54:16] (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1040862) [13:54:34] otherwise I can do it, I'd need to see those images rolled out in prod to get package upgrades [14:54:20] 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9917943 (10Clement_Goubert) >>! In T368238#9916955, @elukey wrote: > I also tried to run `perf` to catch what is causing the CPU usage, but probably due to the lack of symbols is it... [15:27:19] claime: so far the change looks good - https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=now-1h&to=now [15:28:02] latency seems to have improved as well [15:28:03] So much better when you're not splitting your bucket in 100 x) [15:28:18] yep :) [15:28:42] applying to eqiad as well.. maybe there is space for futher tuning? Like lowering concurrency to say 6/8? [15:30:05] 06serviceops, 10Wikifeeds, 13Patch-For-Review: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9918087 (10elukey) After https://gerrit.wikimedia.org/r/1049197 the throttling dropped a lot, and the overall latency seems good. The throttling is not zero tho... [15:30:51] we shaved ~300/400ms of latency with this change, afaics [15:32:13] (in codfw at least, eqiad may have less benefits) [15:32:21] anyway, all deployed, I'll recheck tomorrow :) [15:37:56] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9918200 (10elukey) ` MariaDB [debmonitor]> insert into src_packages_os(id, name) values (3, 'Debian 10'), (4, 'Debian 11'), (5, 'Debian 12'); Query OK, 3 rows affected (0.001... [15:45:42] 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9918237 (10elukey) [15:57:05] elukey: Maybe yeah, it's a bit trial and error finding the right balance between threads not getting overwhelmed by requests and not throttling too much [15:58:42] claime: my hunch is that concurrency >= 4 is probably just necessary to keep latency low in the face of bursty load [15:58:47] yep [15:58:56] I've seen some configurations with concurrency of 1-2 and that just seems right out [15:58:57] Even with only 1 cpu max [15:59:10] why is everything queueing theory [15:59:30] cdanis: https://phabricator.wikimedia.org/T354532 that's basically what j.ayme found out there [16:00:22] I am wondering if anything else that uses the envoy mesh is affected by this [16:00:38] maybe with less impact than wikifeeds [16:00:57] Probably everything that has a decent amount of rps in or outbound through the service mesh is somewhat affected yes [16:04:13] elukey: we did! it looks like that was not reflected in task? https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils#building-project-conda-distribution-environments [16:04:18] oops wrong link! [16:04:34] T344730 [16:04:40] https://phabricator.wikimedia.org/T344730 [16:05:38] ah nice thanks! [16:05:58] i'll ask sandra to update the task [16:07:17] elukey: sandra deployed on june 12: https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-06-12 [16:11:00] okok thanks! I thought Sandra and I were supposed to do it together, this is why I was asking :) [18:41:35] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9919034 (10Scott_French) Alright, first the good news: I was able to deploy the mediawiki changes to mw-debug and canary releases for on... [21:18:49] 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9919461 (10BCornwall) a:05Papaul→03Jhancock.wm [23:48:34] 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9919727 (10Scott_French) I've manually updated prometheus queries that previously limited `envoy_cluster_name` to "local_service" to be...