[07:50:37] 10serviceops, 10ChangeProp, 10WMF-JobQueue: Add node-rdkafka metrics for changeprop - https://phabricator.wikimedia.org/T341661 (10elukey) As far as I can read from T145099#2738967 it should be sufficient to add the `node-rdkafka-statsd` package and the necessary kafkaConsumer configuration to start emitting... [09:19:51] 10serviceops, 10MW-on-K8s: Migrate mwmaint server functionality to mw-on-k8s - https://phabricator.wikimedia.org/T341560 (10Joe) [09:20:22] 10serviceops, 10MW-on-K8s, 10noc.wikimedia.org, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10Joe) 05In progress→03Resolved noc.wikimedia.org is migrated. I have some additional improvements I want to make but the task is solved. [09:54:39] claime: o/ [09:55:00] there's no aqs.discovery.wmnet record - I suspect because at the time there was no codfw aqs instances. Any reason I shouldn't create a discovery record for it? https://gerrit.wikimedia.org/r/c/operations/dns/+/943616 [09:55:08] if you have a min I'd need to brainbounce with you about cpu throttling, since I'd need to wrap my head around it [09:55:12] https://grafana.wikimedia.org/d/hyl18XgMk/kubernetes-container-details?orgId=1&var-datasource=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=knative-serving&var-pod=autoscaler-69b79469f7-f85xz&var-container=All&from=now-1h&to=now [09:55:14] (it would make routing requests to the knowledge-gap endpoint much easier) [09:56:47] hnowlan: o/ service.yaml in puppet doesn't have the discovery settings afaics [09:57:07] ahh [09:57:34] cool, that can be fixed. Is there a reason you know of that they weren't created? [09:57:46] IIRC from https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service it is needed, otherwise the dns reload will error out IIRC [09:58:06] hnowlan: nono I think probably historic ones, since it has been eqiad-only for so long [09:59:22] hnowlan: the only thing that I can recall is that for some calls aqs uses druid-public, that is eqiad-only [09:59:40] and I don't think we have TLS between aqs and druid [10:00:16] it is all public data but... anyway, a discovery record should do just fine [10:02:06] claime: my question about the above graph would be - why do I see throttling with a high limit and low cpu usage? Is it something related to how CFS splits the 100ms slot across cpus? [10:02:24] or maybe a miss in our graphs, so far raising limit worked to decrease the throttling [10:10:07] 10serviceops, 10MW-on-K8s: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10Clement_Goubert) [10:10:19] elukey: ah wait backlogging [10:11:23] Yeah looks like the same issue we have with mw-on-k8s [10:12:01] lovely at least I am not crazy [10:12:05] Raising the limit will work to decrease the throttling because you give more CPU time per timebucket, so it'll be throttle less time [10:12:15] throttled* [10:12:47] let me check the actual cfs graphs [10:14:26] https://grafana.wikimedia.org/goto/lNrWfCqVk?orgId=1 [10:14:32] that's a lot of throttling [10:15:01] I don't think we have sub-second resolution so it'll be hard to say for sure [10:15:34] But I'd wager it consumes all it's CPU allocation in a lot less than 100ms (the default quota timebucket), and then has to wait the rest of the time, rinse, repeat [10:16:37] makes sense yes [10:16:44] You can try to make the quota period calculation smaller too (CFS Bandwidth Slice Tuning in https://danluu.com/cgroup-throttling/) [10:16:44] this is what I was thinking as well [10:17:04] Sorry, CFS Period Tuning [10:17:45] "Taking the default values of a CFS bandwidth slice of 5ms and CFS period of 100ms, in the worst case, a highly parallel application could exhaust all of its quota in the first bandwidth slice leaving 95ms of throttled time before any thread could be scheduled again." [10:19:19] the slice is new to me, need to read a bit [10:19:23] <_joe_> hnowlan: do we have codfw aqs now? [10:19:37] <_joe_> if we do, just create the record yes [10:25:10] claime: okok and then we are back to the "remove limits" conversation that we had yesterday [10:25:17] now I get a better picture okok [10:25:34] will read more, so much that I don't currently get [10:25:35] elukey: Yeah, or tune the band, or a number of different quasi-solutions [10:25:51] That really are just chosing tradeoffs depending on your load type [11:02:28] _joe_: yep [11:03:05] <_joe_> TIL [11:13:44] 10serviceops, 10MW-on-K8s: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10Clement_Goubert) p:05Triage→03High a:03Clement_Goubert [11:15:04] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:15:45] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Clement_Goubert) 05Open→03Stalled Blocked by {T343306} [11:49:26] 10serviceops, 10MW-on-K8s: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10Clement_Goubert) appservers mw145[1-2].eqiad.wmnet to be renamed and reimaged to kubernetes102[5-6].eqiad.wmnet [12:17:55] 10serviceops, 10MW-on-K8s: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10Clement_Goubert) mw145[1-2].eqiad.wmnet are actually older generation than what we want for wikikube, I will be taking mw1497 and mw1498 instead. [12:56:24] 10serviceops, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 2 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10vadim-kovalenko) Hi there! I'm responsible for Kiwix migration to another API, but given the discussion above I'm curiou... [13:20:20] <_joe_> James_F: how urgent it is to turn on memcached for wikifunctions? [13:20:30] <_joe_> rephrasing: can it wait for tomorrow? [13:20:50] <_joe_> if so I can avoid context-switching for today which I'd appreciate :) [13:21:08] _joe_: Totally happy to wait for tomorrow! [13:21:20] _joe_: Or even next week, as there's no train theoretically it should be quieter? [13:21:41] <_joe_> James_F: as you prefer, I will finish the infra patches and apply them tomorrow anyways [13:21:45] <3 [13:21:47] <_joe_> so that mcrouter is ready [13:21:52] Thank you for all your help! [14:05:52] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `mw[1497-1498].eqiad.wment` - mw1497.eqiad.wment (**FAIL**) - Downtimed host on Icinga/Ale... [14:14:33] 10serviceops, 10collaboration-services, 10GitLab (CI & Job Runners): Standardize Debian package builds on GitLab CI - https://phabricator.wikimedia.org/T304491 (10LSobanski) p:05Low→03High [14:19:48] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cgoubert@cumin1001 for hosts: `mw[1497-1498].eqiad.wmnet` - mw1497.eqiad.wmnet (**FAIL**) - //Unable to find/resolve the... [14:20:59] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10Clement_Goubert) The failures above are due to the bad first run of the cookbook due to operator (me) error. The hosts will be wiped by the soon to follow reimage. [14:33:44] 10serviceops, 10SRE: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10lmata) Untagging observability, there doesn't seem anything for us to do; please re-tag if you need us to engage. Thanks! [14:35:32] 10serviceops, 10SRE: Nutcracker stats monitoring should only listen on localhost - https://phabricator.wikimedia.org/T111934 (10Joe) 05Open→03Declined [14:51:33] 10serviceops, 10collaboration-services, 10GitLab (CI & Job Runners): Standardize Debian package builds on GitLab CI - https://phabricator.wikimedia.org/T304491 (10taavi) This is what we're currently using for Toolforge packages: https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/tree/main/debian-build... [15:13:06] 10serviceops, 10collaboration-services, 10GitLab (CI & Job Runners): Standardize Debian package builds on GitLab CI - https://phabricator.wikimedia.org/T304491 (10MatthewVernon) I'm looking at this as a KR for this quarter (for some packages we want to deploy), and I'm expecting to use a `dgit`-based approac... [15:50:10] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Wikikube CPU capacity issue - https://phabricator.wikimedia.org/T343306 (10Clement_Goubert) Network netbox changes done ` kubernetes1025 2013339101888 lsw1-f3-eqiad (WMF11409) ge-0/0/26 kubernetes1026 2013339101893 lsw1-f3-eqiad (WMF11409) ge-0/0/27 ` Wi... [17:03:23] 10serviceops, 10Abstract Wikipedia team, 10Patch-For-Review, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) 05Open→03In progress [17:05:18] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (10Jhancock.wm) a:03Jhancock.wm [21:14:49] 10serviceops, 10MediaWiki-Platform-Team: Migrate WMF production from PHP 7.4 to PHP 8.1 - https://phabricator.wikimedia.org/T319432 (10Krinkle) [21:53:51] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python vs. node - https://phabricator.wikimedia.org/T343388 (10Jdforrester-WMF) [21:54:16] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python 3.7 vs. python 3.8 vs. … - https://phabricator.wikimedia.org/T343389 (10Jdforrester-WMF) [22:01:59] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python 3.7 vs. python 3.8 vs. … - https://phabricator.wikimedia.org/T343389 (10Jdforrester-WMF) [22:02:03] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator: Split the monolithic function-evaluator service up in production so we have differently-scalable pods for python vs. node - https://phabricator.wikimedia.org/T343388 (10Jdforrester-WMF) [22:19:31] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Krinkle) [22:19:39] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Krinkle) [22:28:42] 10serviceops, 10MW-on-K8s, 10SRE: MW-on-k8s traffic logs fewer errors than expected (increase in jsonTruncated) - https://phabricator.wikimedia.org/T343390 (10Krinkle) [22:52:03] 10serviceops, 10noc.wikimedia.org: Investigate using php-fpm for noc - https://phabricator.wikimedia.org/T337302 (10Krinkle) 05Resolved→03Open