[06:18:23] <wikibugs>	 06serviceops: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238 (10elukey) 03NEW
[06:50:59] <wikibugs>	 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916406 (10Aklapper)
[07:28:24] <wikibugs>	 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916422 (10elukey) The only thing that matches seems to be https://gerrit.wikimedia.org/r/c/operations/puppet/+/1019290, that could kinda makes sense - Wikifeeds calls Restbase via t...
[09:09:22] <wikibugs>	 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9916776 (10JMeybohm) >>! In T365253#9909637, @elukey wrote: > I have built and uploaded the new dragonfly packages to bookworm-wikimedia, and updated the ml-sta...
[09:10:49] <wikibugs>	 06serviceops, 06Machine-Learning-Team, 07Kubernetes: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253#9916784 (10elukey) 05Open→03Resolved
[09:23:20] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Create ValidationAdmissionPolicies to replace mediawiki PSP - https://phabricator.wikimedia.org/T368251 (10JMeybohm) 03NEW
[09:26:28] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 07Kubernetes: Create ValidationAdmissionPolicies to replace mediawiki PSP - https://phabricator.wikimedia.org/T368251#9916868 (10JMeybohm) p:05Triage→03High Although this is not technically blocking the k8s upgrade (because ValidationAdmissionPolicies require k8s >=1.26 a...
[09:31:24] <wikibugs>	 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916906 (10elukey) The rationale for the `4s` keepalive is in T263043. I am wondering if the limitation still holds, or if we could test a higher keepalive to see if the tlsproxy's t...
[09:34:31] <wikibugs>	 06serviceops, 06Infrastructure-Foundations, 10Data-Platform-SRE (2024.06.17 - 2024.07.07), 13Patch-For-Review: Create a helm chart for the cloudnativepg postgresql operator - https://phabricator.wikimedia.org/T364797#9916914 (10brouberol) @akosiaris I was reading the operator code and found out that you ca...
[09:38:59] <wikibugs>	 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9916955 (10elukey) I also tried to run `perf` to catch what is causing the CPU usage, but probably due to the lack of symbols is it not straightforward to get what's happening. The a...
[09:42:06] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9916968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1407.eqiad.wm...
[09:42:41] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9916971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw1420.eqiad.wm...
[09:55:00] <wikibugs>	 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9917007 (10elukey) After a chat with Janis, this is probably due to T354532.
[09:58:57] <wikibugs>	 06serviceops, 10MW-on-K8s, 10Prod-Kubernetes, 07Kubernetes: Create ValidationAdmissionPolicies to replace mediawiki PSP - https://phabricator.wikimedia.org/T368251#9917017 (10Clement_Goubert)
[10:02:34] <elukey>	 claime: o/ not urgent but when you have a moment https://phabricator.wikimedia.org/T368238
[10:02:54] <elukey>	 Janis pointed me to the envoy work that you did to limit concurrency etc..
[10:03:22] <elukey>	 there are still some question marks (why with cfssl? why not showing up for mobileapps that have the same discovery config? etc..)
[10:25:37] <wikibugs>	 06serviceops, 10MW-on-K8s: glogger crashes regularly in mw-on-k8s containers - https://phabricator.wikimedia.org/T363342#9917052 (10Joe) The problem seems to arise because we allocate a byte slice of size `len(line)`, but somehow we try to copy over bytes past that point.  This is caused by this code, that som...
[10:32:17] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917087 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1420.eqiad.wmnet...
[10:37:04] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw1407.eqiad.wmnet...
[10:55:24] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917115 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium
[11:17:10] <jayme>	 heads up: I did not see any issues with test deployments in k8s staging-codfw and will therefore switch staging(-eqiad) from PSP to PSS (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1049123) for everything that used to use the restricted PSP (so basically all but mediawiki and kube-system).
[11:24:28] <wikibugs>	 06serviceops, 06SRE, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9917198 (10SGupta-WMF) Thank you @Scott_French for the detailed explanation ....
[11:25:32] <wikibugs>	 06serviceops, 10Cloud-Services, 06SRE, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9917200 (10MoritzMuehlenhoff) CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be nee...
[11:25:45] <wikibugs>	 06serviceops, 10Cloud-Services, 06SRE, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9917201 (10MoritzMuehlenhoff)
[11:54:09] <wikibugs>	 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9917252 (10Jgiannelos)
[11:54:44] <wikibugs>	 06serviceops, 06Content-Transform-Team-WIP, 10Push-Notification-Service, 07Essential-Work, 13Patch-For-Review: Upgrade push notifications to node18 - https://phabricator.wikimedia.org/T367272#9917272 (10Jgiannelos) 05Open→03Resolved
[12:05:54] <wikibugs>	 06serviceops, 10MW-on-K8s, 06SRE, 13Patch-For-Review, 10Release-Engineering-Team (Seen): Set all appservers to pooled=inactive in scap - https://phabricator.wikimedia.org/T368058#9917301 (10Clement_Goubert)
[12:56:53] <nemo-yiannis>	 👋 After this patch we allow connections from pcs to eventgate-main https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1048009. Is there any way i can point only PCS staging to eventgate-main staging? I didn't find any discovery or listener that we could use.
[12:57:42] <nemo-yiannis>	 The reasons is that eventgate-main staging events don't end up in production topics and might be easier to test 
[13:13:23] <wikibugs>	 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9917489 (10Jdforrester-WMF)
[13:13:44] <wikibugs>	 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9917490 (10Jdforrester-WMF)
[13:15:03] <wikibugs>	 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9917491 (10Jdforrester-WMF)
[13:15:24] <wikibugs>	 06serviceops, 10ChangeProp, 10EventStreams, 10Recommendation-API, and 3 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9917492 (10Jdforrester-WMF)
[13:16:03] <_joe_>	 nemo-yiannis: yes
[13:17:31] <_joe_>	 nemo-yiannis: all of staging uses the same cname to reach services externally
[13:17:38] <_joe_>	 staging.svc.eqiad.wmnet:$PORT
[13:17:53] <ottomata>	 !  TIL!  that is awesome
[13:17:54] <_joe_>	 but you can also use the cluster-internal service name
[13:18:04] <wikibugs>	 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9917493 (10Jdforrester-WMF)
[13:18:29] <ottomata>	 nemo-yiannis: another way that we have done this in the past, is to declare a temporary or or development 'dev' or release canditate 'rc' stream.  This would be like hosting a dev version of an API endpoint.  
[13:18:29] <ottomata>	 https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning
[13:18:40] <ottomata>	 but producing to staging is good too, whichever you prefer!
[13:18:58] <ottomata>	 _joe_:  what is the cluster-internal service name in staging?  
[13:19:10] <nemo-yiannis>	 thanks _joe_ , is there any network/firewall config that needs to be changed or can i just go ahead and update the config to use staging.svc.eqiad.wmnet:port ?
[13:19:45] <nemo-yiannis>	 i think using staging should be fine ottomata
[13:19:53] <wikibugs>	 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9917496 (10Jdforrester-WMF)
[13:20:07] <nemo-yiannis>	 ottomata: do i need to explicitly prefix the topic with staging or it happens automagically ?
[13:20:36] <ottomata>	 nemo-yiannis: it will happen automatically: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/eventgate-main/values-staging.yaml#17
[13:20:42] <nemo-yiannis>	 👍
[13:20:45] <_joe_>	 ottomata: same as in the main clusters .svc.cluster.local
[13:21:49] <ottomata>	 oh, i see that is addressable from other staging services?  like eventgate-main.svc.cluster.local resolves to staging from other pods in staging, and eqiad from other pods in eqiad?
[13:22:13] <_joe_>	 it's eventgate-production-tls-service.eventgate-main...
[13:22:22] <_joe_>	 going from memory, we should check
[13:35:04] <wikibugs>	 06serviceops, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371#9917597 (10Jdforrester-WMF)
[13:36:13] <wikibugs>	 06serviceops, 06SRE, 10API Platform (RESTBase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995#9917600 (10Jdforrester-WMF)
[13:37:26] <wikibugs>	 06serviceops, 10ChangeProp, 10EventStreams, 10Recommendation-API, and 2 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750#9917618 (10Jdforrester-WMF)
[13:39:21] <wikibugs>	 06serviceops, 10[DEPRECATED] wdwb-tech, 10Citoid, 06Content-Transform-Team-WIP, and 10 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118#9917593 (10Jdforrester-WMF)
[13:53:54] <elukey>	 ottomata: o/ are you folks planning to rollout Eventgate by any chance?
[13:54:16] <elukey>	 (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1040862)
[13:54:34] <elukey>	 otherwise I can do it, I'd need to see those images rolled out in prod to get package upgrades
[14:54:20] <wikibugs>	 06serviceops, 10Wikifeeds: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9917943 (10Clement_Goubert) >>! In T368238#9916955, @elukey wrote: > I also tried to run `perf` to catch what is causing the CPU usage, but probably due to the lack of symbols is it...
[15:27:19] <elukey>	 claime: so far the change looks good - https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?orgId=1&from=now-1h&to=now
[15:28:02] <elukey>	 latency seems to have improved as well
[15:28:03] <claime>	 So much better when you're not splitting your bucket in 100 x)
[15:28:18] <elukey>	 yep :)
[15:28:42] <elukey>	 applying to eqiad as well.. maybe there is space for futher tuning? Like lowering concurrency to say 6/8?
[15:30:05] <wikibugs>	 06serviceops, 10Wikifeeds, 13Patch-For-Review: Wikifeeds' tls proxy cpu usage heavily increased in April - https://phabricator.wikimedia.org/T368238#9918087 (10elukey) After https://gerrit.wikimedia.org/r/1049197 the throttling dropped a lot, and the overall latency seems good. The throttling is not zero tho...
[15:30:51] <elukey>	 we shaved ~300/400ms of latency with this change, afaics
[15:32:13] <elukey>	 (in codfw at least, eqiad may have less benefits)
[15:32:21] <elukey>	 anyway, all deployed, I'll recheck tomorrow :)
[15:37:56] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9918200 (10elukey) ` MariaDB [debmonitor]> insert into src_packages_os(id, name) values (3, 'Debian 10'), (4, 'Debian 11'), (5, 'Debian 12'); Query OK, 3 rows affected (0.001...
[15:45:42] <wikibugs>	 06serviceops, 06Infrastructure-Foundations: Cleanup old Docker images running Debian Stretch - https://phabricator.wikimedia.org/T367427#9918237 (10elukey)
[15:57:05] <claime>	 elukey: Maybe yeah, it's a bit trial and error finding the right balance between threads not getting overwhelmed by requests and not throttling too much
[15:58:42] <cdanis>	 claime: my hunch is that concurrency >= 4 is probably just necessary to keep latency low in the face of bursty load
[15:58:47] <claime>	 yep
[15:58:56] <cdanis>	 I've seen some configurations with concurrency of 1-2 and that just seems right out
[15:58:57] <claime>	 Even with only 1 cpu max
[15:59:10] <cdanis>	 why is everything queueing theory
[15:59:30] <claime>	 cdanis: https://phabricator.wikimedia.org/T354532 that's basically what j.ayme found out there
[16:00:22] <elukey>	 I am wondering if anything else that uses the envoy mesh is affected by this 
[16:00:38] <elukey>	 maybe with less impact than wikifeeds
[16:00:57] <claime>	 Probably everything that has a decent amount of rps in or outbound through the service mesh is somewhat affected yes
[16:04:13] <ottomata>	 elukey: we did! it looks like that was not reflected in task? https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils#building-project-conda-distribution-environments
[16:04:18] <ottomata>	 oops wrong link!
[16:04:34] <ottomata>	 T344730
[16:04:40] <ottomata>	 https://phabricator.wikimedia.org/T344730
[16:05:38] <elukey>	 ah nice thanks!
[16:05:58] <ottomata>	 i'll ask sandra to update the task
[16:07:17] <ottomata>	 elukey: sandra deployed on june 12: https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2024-06-12
[16:11:00] <elukey>	 okok thanks! I thought Sandra and I were supposed to do it together, this is why I was asking :)
[18:41:35] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9919034 (10Scott_French) Alright, first the good news: I was able to deploy the mediawiki changes to mw-debug and canary releases for on...
[21:18:49] <wikibugs>	 06serviceops, 06DC-Ops, 10ops-codfw, 06SRE, 06Traffic: lvs2011 Memory failure on slot B1 - https://phabricator.wikimedia.org/T368165#9919461 (10BCornwall) a:05Papaul→03Jhancock.wm
[23:48:34] <wikibugs>	 06serviceops, 10Prod-Kubernetes, 13Patch-For-Review: Update all helm modules and charts to be compatible with the restricted PSS - https://phabricator.wikimedia.org/T362978#9919727 (10Scott_French) I've manually updated prometheus queries that previously limited `envoy_cluster_name` to "local_service" to be...