[01:07:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [07:13:41] _joe_: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/941761 for the wikidiff image thing [07:14:59] <_joe_> akosiaris: yeah sorry, I am currently trying to fix a wtf I did [07:15:10] no rush [07:15:18] <_joe_> Failed to pull image "docker-registry.discovery.wmnet/placeholder-for-mediawiki-image-name" [07:15:20] <_joe_> *sigh* [07:21:30] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) {F37150154} After the change in threads the brokers are out of the "danger" zone (below 30% of idle time, as suggested by upstream) but this task should proceed nonethele... [07:22:02] left some notes to --^ [07:22:36] I think that the main eqiad brokers are ok now, I'd like to do a more complete rebalance in the near future (and package topicmappr etc..) but we can do it later on [07:22:42] lemme know if it is ok! [07:23:11] the next step that I'd like to do is to work with hnowlan to push the new node-rdkafka client for changeprop [07:26:02] elukey: thanks for all that work! [07:26:12] it's not just ok, it's pretty awesome [07:30:52] akosiaris: thanks! I still feel that we need to do more work but surely at a slower pace [07:38:19] <_joe_> akosiaris: now send him a bottle of retsina as appreciation D: [07:44:21] lol [07:44:27] I 'll make sure to do so [07:44:28] :P [07:46:03] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) [07:54:03] _joe_ we should have an offsite in Athens to drink it in person (so I'll invite also Ilias :D) [07:54:49] deal [07:55:05] <_joe_> elukey: poor ilias why do you want him to drink retsina? [07:55:10] <_joe_> what did he do to you? [07:55:24] but let's be clear, a small offsite. I should be counting the participants in my one hands fingers :P [07:55:50] _joe_ we can bring some Tavernello with us [07:56:02] <_joe_> elukey: knife to a gunfight :D [07:56:13] I don't even need to google that to guess it's not good [07:56:33] <_joe_> seriously, retsina can only be maybe matched by the worst of Romanella or some Calabrian/Sicilian wine done "the old way" [07:56:53] <_joe_> the ones with 18 alchohol degrees and a deep sour taste [07:57:07] if you want to make it serious btw, https://www.tirnavoswinery.gr/proionta/krasia/kokkineli/ [07:57:15] Κοκκινέλι is... [07:57:42] <_joe_> that is ladybug in italian, more or less [07:58:04] <_joe_> ahahah and it's done with the red dye that comes from "cocciniglia" [07:58:10] <_joe_> omg [07:58:30] <_joe_> akosiaris: 😱 [07:58:52] <_joe_> I am so sorry claime is not here to see their reaction too :D [07:58:56] _joe_: do you know start to appreciate Retsina perhaps ? [07:59:00] now* [07:59:14] <_joe_> akosiaris: retsina seems great compared to that thing [07:59:21] spot on [07:59:25] <_joe_> I mean i can testify at least it's not toxic [08:00:11] o/ [08:00:18] I heard something about retsina :) [08:00:53] isaranto: o/. Giuseppe is an expert in Retsina. He has at least 4 different bottles of retsina at his house [08:01:57] <_joe_> akosiaris: alice and her friends drank 2 [08:02:22] <_joe_> not even 20-yrs old italians can finish retsina, even with the promise of free alcohol [08:02:24] <_joe_> :D [08:02:48] no wonder ppl mix it with coke [08:04:09] ahhh wow nice, retsina and coke [08:04:21] you have a weird definition of nice [08:07:07] 10serviceops, 10ChangeProp, 10WMF-JobQueue: Check if node-rdkafka's version on changeprop can be upgraded from 2.8.1 - https://phabricator.wikimedia.org/T341140 (10elukey) @hnowlan I checked and the kafka client should be safe to be upgraded, it doesn't use zookeeper or any old thing, so I guess that we can... [08:08:11] <_joe_> isaranto: the greek kalimotxo? [08:08:13] <_joe_> TIL [08:08:22] exactly! [08:08:26] <_joe_> omg [08:08:46] akosiaris: that was a sarcastic comment, I am not a sommelier but I have some minimal quality threshold that I respect :D :D [08:08:59] elukey: 😉 [08:10:18] <_joe_> I have a can of coke and a bottle of retsina [08:10:23] <_joe_> maybe I should try [08:11:40] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10akosiaris) 05Open→03Resolved php7.4-fpm-multiversion-base rebuilt as well, should make it out to mw-on-k8s in the next deployments. I think we can r... [08:13:50] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) I 've gone ahead and created https:/... [08:36:54] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10akosiaris) I 've gone ahead and populated the S... [08:43:01] 10serviceops, 10SRE: Request to block ActionApi client (based on a specific user agent header) - https://phabricator.wikimedia.org/T243858 (10akosiaris) 05Open→03Declined I am gonna close this as declined. While we do have the ability to block requests based on user-agent, we don't do that on request. [08:51:22] 10serviceops, 10SRE, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10Jdforrester-WMF) >>! In T297314#9043540, @akosi... [09:06:34] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10CodeReviewBot) oblivian opened https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/38 Include docroot/noc in the image we build [09:08:08] _joe_: are you coming? [09:08:30] <_joe_> duesen: sigh sorry [09:08:36] <_joe_> I will blame gitlaab [09:08:52] no worries [09:21:20] akosiaris: The last step before being able to call the service from MW is adding it to hieradata/common/profile/services_proxy/envoy.yaml right? Should I push a patch to do that? [09:27:23] (Pushed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/941856) [09:36:07] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) Thanks for all your help @akosiaris! [09:36:29] (oops, sorry for IRC ping) [09:57:13] <_joe_> James_F: uhh call wikifunctions from mediawiki? [09:57:30] <_joe_> just for wikifunctions.org? [09:57:38] Yes. [09:57:53] <_joe_> then yes :) [09:58:01] Cool. [09:58:17] Could you deploy that? [10:04:03] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Alert triage - https://phabricator.wikimedia.org/T342250 (10JMeybohm) a:03JMeybohm The p99 latency for list_image calls is increasing since mid June on codfw and eqiad wikikube and is most likely a consequence of the sheer amount of images that we pull/force p... [10:25:13] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Alert triage - https://phabricator.wikimedia.org/T342250 (10akosiaris) > Increase the threshold of the alert from 1s to 2s (or 1.5) as I'm not aware of any issues arising from this Seems definitely the most promising and easy to do. [10:25:27] James_F: done [10:27:58] akosiaris: Thanks! And now I need to add an entry in deployment-chart's `.fixtures/service_proxy.yaml`? [10:29:24] well: for the orchestrator to talk to MediaWiki ? yes. [10:29:30] Yeah. [10:30:16] in the same patch, enabled it in values.yaml section of the orchestrator in helmfile.d. See e.g. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/918407 [10:30:34] Ack. [10:36:21] Like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/941865/ plus some config in… the MW charts for MW-on-k8s to see the service? Or is that implicit? [10:42:42] no, it is not implicit. you need the listeners part [10:43:25] like in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/918407/4/helmfile.d/services/cxserver/values.yaml [10:45:17] ah wait, the mediawiki listeners part is on deploy1002, let me submit a patch for that [10:49:30] Ah, thanks. [10:56:20] James_F: https://gerrit.wikimedia.org/r/c/operations/puppet/+/941888 [11:00:27] Aha. [11:32:31] 10serviceops, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10SRE: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Clement_Goubert) We've been experiencing throttling on mw-api-int and raising the container's CPU limit has helped, but not fixe... [11:43:59] 10serviceops, 10MW-on-K8s: mw-on-k8s php-fpm container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Clement_Goubert) [11:44:54] 10serviceops, 10MW-on-K8s: mw-on-k8s php-fpm container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Clement_Goubert) p:05Triage→03High In my opinion, we need to fix this before moving forward with migrating more traffic to mw-on-k8s. [11:46:29] 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Clement_Goubert) [11:59:24] Last thing to do is deploy one of https://gerrit.wikimedia.org/r/c/operations/puppet/+/941775 or https://gerrit.wikimedia.org/r/c/operations/puppet/+/941314 to put the service in the production state? [12:01:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Alert triage - https://phabricator.wikimedia.org/T342250 (10Clement_Goubert) Sorry for cookie licking, I was in the neighborhood. [12:04:15] in this week sprint I found out `docker-registry.wikimedia.org/python3-build-jessie` which I guess can be phased out entirely? https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/941442/ :] [12:05:33] and I don't know who can review my series of patch for the python build images, I notably need a Bullseye python 2.7 image in order to migrate Zuul to a Bullseye server ( https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/940161/ ) [12:06:18] and I have a couple more patches enhancing the build script [12:19:11] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Alert triage - https://phabricator.wikimedia.org/T342250 (10JMeybohm) >>! In T342250#9044170, @Clement_Goubert wrote: > Sorry for cookie licking, I was in the neighborhood. At least we did the same thing :-) - merged my tests into your CR. [12:24:27] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) [12:25:44] 10serviceops, 10Abstract Wikipedia team, 10Service-deployment-requests: New Service Request memcached-wikifunctions - https://phabricator.wikimedia.org/T297815 (10Jdforrester-WMF) [12:34:16] 10serviceops: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10LSobanski) [12:38:00] 10serviceops: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10LSobanski) [12:53:35] 10serviceops, 10sre-alert-triage: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (10fgiunchedi) [12:54:10] 10serviceops, 10sre-alert-triage: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10fgiunchedi) [12:54:30] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10sre-alert-triage: Alert triage - https://phabricator.wikimedia.org/T342250 (10JMeybohm) [12:56:56] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10sre-alert-triage: Alert triage: KubeletOperationalLatency - https://phabricator.wikimedia.org/T342250 (10JMeybohm) [12:57:21] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10sre-alert-triage: Alert triage: KubeletOperationalLatency - https://phabricator.wikimedia.org/T342250 (10JMeybohm) 05Open→03Resolved I believe this is resolved now [12:57:41] 10serviceops, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342758 (10LSobanski) [13:07:42] 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10TK-999) We were consistently throttled until we set limits == FPM worker count. Per the description (and Dan Luu's insightful foray[1]) into the topic, I don't think there is muc... [13:08:01] 10serviceops: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342760 (10LSobanski) [13:08:20] 10serviceops, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342758 (10JMeybohm) 05Open→03Resolved a:03JMeybohm >>! [[ https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Envoy | Runbook ]] said: > If you see an error about runtime variables being... [13:08:50] 10serviceops, 10sre-alert-triage: Alert triage: EnvoyRuntimeAdminOverrides on restbase1027 - https://phabricator.wikimedia.org/T342758 (10JMeybohm) [13:09:49] 10serviceops, 10sre-alert-triage: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342761 (10LSobanski) [13:15:34] 10serviceops, 10sre-alert-triage: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342756 (10JMeybohm) This looks like a fallout from {T341859} [13:35:17] 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Joe) Thanks @TK-999 indeed for a PHP application that doesn't shell out much like MediaWiki the number of workers is a hard limit on the amount of CPUs it can use, which is rough... [13:38:48] 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Joe) How I evaluated the current seconds per worker: I used the following formula in promQL: `sum(rate(container_cpu_usage_seconds_total{cluster="$cluster", id=~"/system.slice/... [13:41:09] 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10Clement_Goubert) Thanks for the insight @TK-999 When you say "limits == FPM worker count", do you mean one whole CPU per worker? Did you use pinning as well? As I understand it,... [13:51:45] 10serviceops: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342760 (10Clement_Goubert) a:03Clement_Goubert [13:52:34] 10serviceops, 10MW-on-K8s: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 (10TK-999) @Clement_Goubert Yeah, we currently set a limit of 1 CPU per worker. We have not experimented with pinning. In practice, this keeps throttling at < 0.25% - likely becaus... [14:25:15] hnowlan, kamila_ - o/ how do you prefer to proceed with the changeprop's deployment? [14:36:23] 10serviceops, 10Patch-For-Review: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342760 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [14:46:59] 10serviceops, 10Patch-For-Review: Alert triage: overdue warning alert - https://phabricator.wikimedia.org/T342760 (10Clement_Goubert) 05In progress→03Resolved Alerts cleared and underlying cause resolved. [14:54:03] elukey: sorry, bit busy for the rest of the date with interview/meetings :( If you feel okay going solo be my guest but otherwise could do it tomorrow morning maybe? [14:54:20] hnowlan: sure sure! [14:54:31] tomorrow is fine :) [15:27:37] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Move noc.wikimedia.org to kubernetes - https://phabricator.wikimedia.org/T341859 (10CodeReviewBot) dancy merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/38 Include docroot/noc in the image we build [15:31:45] akosiaris: o/ do you recall the tcp conn timeout for ores-legacy? I think I may have found a clue, namely that the envoy tls proxy gets cpu-throttled when a big query is made and there are not enough TCP connections in the pool already established [15:32:30] when the http conn pool (envoy) is ready, no more errors [15:33:00] I tried to bump the cpu limits/request etc.. and the "covergence" happens quicker every time [15:33:10] it may be just a matter of having more pods etc.. [15:33:22] and tune the limits for the envoy proxy [15:37:32] elukey: in a meeting, will be with you in 25minutes [15:57:42] no problem, we can discuss it tomorrow, it was just a fyi about those weird conns [16:20:12] ok [17:03:00] elukey: quick question: what would happen if I decided to read all the messages in, say, mediawiki.page-create (on kafka-main) at once? [17:03:45] would anyone notice or is reading kafka cheap? [17:45:06] To kafka? Nothing really. Done that a couple of times with kafkacat by mistake. [17:45:29] To your program? I don't know, is it ready for that? [17:45:41] ah, okay, cool, thank you akosiaris! [17:46:24] I want to run a benthos smoke test, so... we'll see if it smokes :D [17:47:13] I might want to think about the resources/limits though, that is a good point, thanks [19:17:23] kamila_: sorry just seen it, yes definitely what Alex said, the kafka brokers wouldn't really be affected (they heavily use page cache etc..) [19:17:45] but your client may, especially if you handle msgs read in memory etc.. [19:18:17] the client is what I want to test, I'll make sure to set reasonable limits so I don't eat all the memory [19:18:24] thank you both! [21:04:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye [21:04:44] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q4:rack/setup/install rdb101[34] - https://phabricator.wikimedia.org/T326170 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host rdb1013.eqiad.wmnet with OS bullseye executed with errors: - rdb1013 (**FAIL**) - Rem... [21:26:37] 10serviceops, 10MediaWiki-Internationalization, 10MediaWiki-extensions-General, 10WMF-General-or-Unknown, and 3 others: Update footer links to direct to proper locations on Foundation Governance Wiki - https://phabricator.wikimedia.org/T331680 (10Pppery) Anything left to do here? It's been 6 weeks.