[09:34:58] 06serviceops, 10Maps, 06SRE: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9706160 (10jijiki) [09:35:00] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9706161 (10jijiki) [09:37:07] 06serviceops, 06SRE, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#9706185 (10jijiki) [09:37:16] 06serviceops, 10Maps, 06SRE: Move maps/karthoterian to PKI/cfssl - https://phabricator.wikimedia.org/T360778#9706189 (10jijiki) a:03jijiki [09:38:17] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706195 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2412.codfw.wmnet with OS bullseye [09:38:41] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706199 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2413.codfw.wmnet with OS bullseye [09:39:09] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706204 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2414.codfw.wmnet with OS bullseye [09:39:40] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706209 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2415.codfw.wmnet with OS bullseye [09:40:03] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706213 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2416.codfw.wmnet with OS bullseye [09:40:33] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2417.codfw.wmnet with OS bullseye [09:41:03] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706227 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2418.codfw.wmnet with OS bullseye [09:42:14] bblack: oh, sorry, I read your message back then and then forgot to revisit it the next morning. The TL;DR is ~12k rps before we have major problems across the entire board. It's a bit less than that depending on which specific thing we talk about. e.g. external API is sized differently from the internal API, differently from the available capacity [09:42:14] from web browsers. You can see the split at the level of granularity I assume you want, here: https://w.wiki/9j9j (you 'll need to be logged in to grafana-rw). As you can see, it's indeed just php-fpm worker processes, regardless of legacy vs mw-on-k8s. [09:44:12] So for external API, around 3k rps is what we can currently do (this is changing as we proceed with the migration), I 'd give as a number something like 1/10th to 1/5th of that. [09:44:53] the spikes you 'll note btw are deployment. Deployments on k8s work by spinning up new instances, adding capacity, moving traffic over, shutting down the old instances and so on in batch of 3% until they reach 100% [09:47:01] by end of Q1 next FY we hope to also be able to more dynamically respond to increased rps demands by automatically spawning some more instances, but this will take a while to set up the infra for. And most importantly, tuning it to not end up killing the rest of production in such events is going to prove interesting [09:47:27] that being said, we already do this manually today (thanks mostly to claim.e), so it should be doable. [09:47:36] Human Pod Autoscaler [09:47:52] I'm going to change my job title to this [09:49:14] lol [09:49:46] More seriously, to add to what a.kosiaris just mentioned, we are able to both run a little hotter in terms of saturation of php workers in k8s (less shared cache contention because less workers sharing the cache) before running into latency issues, and the fact that we now differentiate between external and internal API calls will give us a lot more flexibility even without automatic scaling [10:15:28] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2414.codfw.wmnet with OS bullseye completed: - mw24... [10:19:28] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706330 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2417.codfw.wmnet with OS bullseye completed: - mw24... [10:22:34] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706359 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2416.codfw.wmnet with OS bullseye completed: - mw24... [10:25:45] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706361 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2415.codfw.wmnet with OS bullseye completed: - mw24... [10:28:57] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706364 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2412.codfw.wmnet with OS bullseye completed: - mw24... [10:32:05] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706368 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2418.codfw.wmnet with OS bullseye completed: - mw24... [10:36:27] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9706381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2413.codfw.wmnet with OS bullseye completed: - mw24... [10:36:38] 06serviceops, 06Machine-Learning-Team, 10MW-on-K8s, 06SRE: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316 (10Clement_Goubert) 03NEW [10:36:51] 06serviceops, 06Machine-Learning-Team, 10MW-on-K8s, 06SRE: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706397 (10Clement_Goubert) p:05Triage→03Medium [11:24:17] 06serviceops: Package latest version of prometheus-memcached-exporter (v0.14.2) - https://phabricator.wikimedia.org/T350807#9706587 (10jijiki) [11:27:19] 06serviceops: Repackage memkeys for debian bookworm - https://phabricator.wikimedia.org/T362160#9706604 (10jijiki) [11:28:09] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9706613 (10jijiki) [11:28:10] 06serviceops: Package latest version of prometheus-memcached-exporter (v0.14.2) - https://phabricator.wikimedia.org/T350807#9706612 (10jijiki) [11:28:14] 06serviceops: 14Repackage memkeys for debian bookworm - 14https://phabricator.wikimedia.org/T362160#9706606 (10jijiki) 05Open→03Resolved a:03jijiki 14Built and uploaded [11:29:28] 06serviceops: 14Package latest version of prometheus-memcached-exporter (v0.14.2) - 14https://phabricator.wikimedia.org/T350807#9706616 (10jijiki) 05Open→03Resolved a:03jijiki 14Built and repackaged [11:29:31] 06serviceops: 14Package latest version of prometheus-memcached-exporter (v0.14.2) - 14https://phabricator.wikimedia.org/T350807#9706624 (10jijiki) [11:44:00] 06serviceops, 06Machine-Learning-Team, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706666 (10Clement_Goubert) [11:49:13] 06serviceops, 06Machine-Learning-Team, 10MW-on-K8s, 06SRE, 13Patch-For-Review: Migrate ml-services to mw-api-int - https://phabricator.wikimedia.org/T362316#9706673 (10Clement_Goubert) Aaaand I just realized they all use http and not https, so now I can change them all. [12:05:24] 06serviceops, 10Shellbox, 10Wikibase-Quality-Constraints, 10Wikidata, and 4 others: [SW] [WBQC] shellbox-constraints returning 500 on preg_match error - https://phabricator.wikimedia.org/T362084#9706725 (10Lucas_Werkmeister_WMDE) [12:08:05] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: 14Move 70% of mediawiki external requests to mw on k8s - 14https://phabricator.wikimedia.org/T360763#9706729 (10Clement_Goubert) 05In progress→03Resolved [12:14:29] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323 (10Clement_Goubert) 03NEW [12:14:53] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706771 (10Clement_Goubert) p:05Triage→03High [12:32:54] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706818 (10Clement_Goubert) [12:34:07] 06serviceops, 10ops-codfw, 06SRE: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9706836 (10Papaul) [12:36:10] 06serviceops, 10MW-on-K8s, 10wikitech.wikimedia.org: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707#9706839 (10jijiki) a:03jijiki [12:39:53] 06serviceops, 06SRE: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711#9706842 (10jijiki) [12:39:54] 06serviceops, 10MW-on-K8s, 06SRE: 14Create a basic helm chart to test MediaWiki on kubernetes - 14https://phabricator.wikimedia.org/T265327#9706844 (10jijiki) [12:39:55] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706843 (10jijiki) [12:40:44] 06serviceops, 10MW-on-K8s, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690#9706845 (10jijiki) [12:40:45] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9706846 (10jijiki) [12:44:50] 06serviceops, 10Observability-Logging, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Evaluate and enable audit logging for kube-apiserver - https://phabricator.wikimedia.org/T290020#9706851 (10JMeybohm) [12:48:14] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706856 (10Clement_Goubert) [12:48:48] 06serviceops, 10Observability-Logging, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Evaluate and enable audit logging for kube-apiserver - https://phabricator.wikimedia.org/T290020#9706867 (10JMeybohm) [12:51:03] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) - https://phabricator.wikimedia.org/T362323#9706874 (10Clement_Goubert) [15:38:25] 06serviceops, 06Abstract Wikipedia team, 10function-evaluator: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python 3.7 vs. python 3.8 vs. … - https://phabricator.wikimedia.org/T343389#9707716 (10Jdforrester-WMF) [15:38:58] 06serviceops, 10Observability-Logging, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review: Evaluate and enable audit logging for kube-apiserver - https://phabricator.wikimedia.org/T290020#9707712 (10colewhite) Per conversation, the team would like to explore the option of a custom index for k8s audit lo... [18:36:40] 06serviceops, 06Release-Engineering-Team, 10Scap, 13Patch-For-Review, 10Sustainability (Incident Followup): scap should check if it is running within a tmux/screen - https://phabricator.wikimedia.org/T361724#9708306 (10jijiki) @dancy it would be great if someone could finish this soon. While scap now doe... [18:51:08] 06serviceops, 06SRE: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9708317 (10jijiki) [18:52:19] 06serviceops: Sunset onhost memcached on mediawiki servers and puppet - https://phabricator.wikimedia.org/T345740#9708319 (10jijiki) a:03jijiki [18:53:04] 06serviceops: Memcache improvements and essential work (FY 23-24) - https://phabricator.wikimedia.org/T352880#9708321 (10jijiki) [18:53:05] 06serviceops: Sunset onhost memcached on mediawiki servers and puppet - https://phabricator.wikimedia.org/T345740#9708320 (10jijiki) [18:57:01] 06serviceops: Upgrade memcache and memcached gutter pools to Bookworm - https://phabricator.wikimedia.org/T352891#9708323 (10jijiki) p:05Triage→03High a:03jijiki [20:38:03] 06serviceops, 10Deployments, 06Release-Engineering-Team: 14MediaWiki deploy servers should not be mediawiki installation targets - 14https://phabricator.wikimedia.org/T329857#9708545 (10dancy) 14@Clement_Goubert I noticed the `/srv/mediawiki.old.20230424.T329857` directory on deploy1002.eqiad.wmnet toda...