[06:52:26] hi folks [06:53:06] the 10ms settings for eventgate's kafka batch improved a little metrics, I think that we found a good compromise for the moment, I wouldn't go much further [07:17:55] <_joe_> +1 [07:18:07] <_joe_> the improvement is pretty amazing if yo compare to the original setting [07:27:12] 10serviceops, 10Foundational Technology Requests, 10Prod-Kubernetes, 10Kubernetes: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 (10JMeybohm) [07:37:43] super thanks :) [07:37:56] I'll concentrate on the kafka partitions rebalance then [07:42:28] I hoped for some improvements on the flame graphs too [07:47:25] but they may come as we free some load from kafka-main100[1-3] [08:07:40] https://grafana-rw.wikimedia.org/d/000000027/kafka?forceLogin&from=now-7d&orgId=1&to=now&viewPanel=71 seems to have improved as well [08:15:49] 10serviceops, 10RESTbase Sunsetting, 10Epic, 10Platform Engineering Roadmap: Replace usage of RESTbase parsoid endpoints - https://phabricator.wikimedia.org/T328559 (10DAlangi_WMF) [08:28:17] <_joe_> elukey: "seems to have improved" for a 50% latency reduction seems selling yourself short :) [08:29:54] _joe_ I need to verify if it was due to the changeprop + evengate changes or not, still haven't verified all timings :) [08:30:01] but surely something is due to it [08:30:05] <_joe_> I mean [08:53:32] hnowlan: o/ [08:54:12] I tried to check if changeprop exposes node-rdkafka metrics, but I don't think it does (so kafka spefic producer metrics) [08:54:35] I know that it uses statsd, so maybe https://github.com/wikimedia/node-rdkafka-statsd could work (seems very old, maybe we need to sync from upstream) [08:55:06] eventgate uses https://github.com/Collaborne/node-rdkafka-prometheus [08:55:21] (it is unmaintained but works nicely afaics) [08:56:57] elukey: oh yeah, node-rdkafka-statsd looks promising. tbf changeprop itself is also very old :D [08:57:10] :D [09:00:32] I'll open a task [09:02:37] 10serviceops, 10ChangeProp, 10WMF-JobQueue: Add node-rdkafka metrics for changeprop - https://phabricator.wikimedia.org/T341661 (10elukey) [09:02:40] there --^ [09:02:58] nice [09:04:17] in theory it should be a one line change in the code [09:05:00] we could try/piggyback-it with the new noderdkafka version [09:07:50] yeah, seems reasonable. Easy enough to verify offline or in staging too [09:08:06] I need to test that buster version out today [09:11:22] I have zero nodejs knowledge but I can try to check if I can add the new config [09:29:46] 10serviceops, 10SRE, 10Thumbor, 10Thumbor Migration, and 2 others: Future of Thumbor's memcached backend - https://phabricator.wikimedia.org/T318695 (10jijiki) [09:37:51] ah wow node-rdkafka-statsd is not a mirror, we developed it [10:11:17] 10serviceops: Allow for multiple confd instances in in pupper - https://phabricator.wikimedia.org/T341669 (10JMeybohm) [10:11:41] 10serviceops: Allow for multiple confd instances in pupper - https://phabricator.wikimedia.org/T341669 (10JMeybohm) [10:29:38] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10MoritzMuehlenhoff) JFTR, since I'm away for two weeks: When the tests are complete, the 1.14.1 packages can be imported in reprepro from /home/jmm/impor... [11:22:45] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) 05Open→03Stalled >>! In T340087#9008282, @MoritzMuehlenhoff wrote: > JFTR, since I'm away for two weeks: When the tests are complete,... [11:35:34] 10serviceops, 10Beta-Cluster-Infrastructure, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Install wikidiff2 1.14.1 deb on deployment-prep & test - https://phabricator.wikimedia.org/T340542 (10dom_walden) @GMikesell-WMF and I have finished our testing here. We have not found any m... [11:42:49] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) 05Stalled→03Open [[ https://phabricator.wikimedia.org/T340542#9008492 | QA OK'd ]], we can proceed with the next steps of deployment (... [11:43:07] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) [11:43:22] 10serviceops, 10Beta-Cluster-Infrastructure, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Install wikidiff2 1.14.1 deb on deployment-prep & test - https://phabricator.wikimedia.org/T340542 (10TheresNoTime) 05In progress→03Resolved Thank you! Work continues at {T340087} [11:48:06] o/ ref T340087, and given that mor/itzm is OOO starting tomorrow, would anyone be interested in getting wikidiff2 deployed to some mw canaries? [11:48:18] T340087 [11:48:30] (ok, https://phabricator.wikimedia.org/T340087) [11:55:44] TheresNoTime: I see 1.13 deployed on mwdebug1002, I can take over and deploy to canaries and rest of mwdebugs [11:56:08] everything else is on 1.11.0-1~bpo10+1 btw, only mwdebug is at 1.13.0-1 [11:56:13] akosiaris: it's `1.14.1` we're deploying fwiw [11:56:29] ah just saw the last comment in the task [11:57:10] and moritz left instructions, no need for me to run divination incantations [11:57:15] I 'll take it over [11:58:12] TheresNoTime: I like how QA ok in just 20m :-) [11:58:14] \o/ [11:58:20] ok'd* [11:58:42] we have the best QA engineers, that's why :p [11:58:48] * TheresNoTime hides [12:06:15] ok, I was wrong, I am starting to sync and collect the correct information now [12:08:55] <3 [12:09:00] (no rush!) [12:29:58] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10akosiaris) a:03akosiaris Looking quickly at mw-canaries and mwdebug, they all have 1.13.0-1+wmf1+buster1 ` akosiaris@cumin1001:~$ sudo cumin 'A:mw-ca... [12:40:20] akosiaris: is there a way to specifically use the canaries when browsing projects? [12:40:46] TheresNoTime: they wouldn't be canaries if there was. They would be something else. [12:40:59] but mwdebug is in that list and yes you can use mwdebug [12:41:01] that's.... a good point [12:41:09] ah, yeah! Thank you :) [12:50:34] yw [12:53:32] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10MoritzMuehlenhoff) JFTR; I've also rebuilt/uploaded wikidiff 1.41.1 for component/icu67 (so that we don't regress when the ICU67 migration starts) [12:56:25] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.1 - https://phabricator.wikimedia.org/T340087 (10akosiaris) >>! In T340087#9008758, @MoritzMuehlenhoff wrote: > JFTR; I've also rebuilt/uploaded wikidiff 1.41.1 for component/icu67 (so that we don't re... [13:28:29] 10serviceops, 10Observability-Metrics, 10observability: Scrape envoy runtime metrics in ops & k8s prometheus - https://phabricator.wikimedia.org/T341554 (10JMeybohm) 05Open→03Resolved [13:29:46] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: Write a wrapper function combining pki::get_cert and k8s::kubeconfig - https://phabricator.wikimedia.org/T337826 (10JMeybohm) p:05Triage→03Low [14:26:40] jayme per our conversation yesterday, here's a ticket requesting use of Kafka-main https://phabricator.wikimedia.org/T341625 [14:26:58] ^^ let us know if you need more info on that one [15:16:01] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) From [[ https://thanos.wikimedia.org/graph?g0.expr=sort_desc(topk(1000%2C%20sum(irate(kafka_server_BrokerTopicMetrics_MessagesIn_total%7Bkafka_cluster%3D%22main-eqiad%22%2... [15:16:17] claime, _joe_ --^ [15:24:03] dcausse: o/ - do you think that we could increase partitions on eqiad.mediawiki.job.cirrusSearchLinksUpdate ? [15:24:26] any particular tool that would not like it? (we'll take care of the job queues) [15:46:15] elukey: you mean partitionning by change prop rules, i.e. creating a new partitioned topic (a la eqiad.mediawiki.job.cirrusSearchElasticaWrite -> eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite) [15:46:47] or partitionning the existing eqiad.mediawiki.job.cirrusSearchLinksUpdate topic? [15:47:01] 10serviceops: Rebalance kafka partitions in main-{eqiad,codfw} clusters - 2023 edition - https://phabricator.wikimedia.org/T341558 (10elukey) Proposed changes to both main eqiad and codfw: ` kafka topics --topic eqiad.mediawiki.job.cirrusSearchLinksUpdate --alter --partitions 3 kafka topics --topic codfw.mediaw... [15:47:16] dcausse: the latter yes [15:47:46] changeprop doesn't like when we do it, but after a roll restart everything flows back to normal (I think it is a bug in the old version of node-rdkafka) [15:48:00] I was more worried if there was another consumer etc.. on your side that you'd need some heads up for [15:48:03] elukey: no objection from me if this is supported by eventgate/changeprop [15:48:09] super [22:00:33] 10serviceops, 10Observability-Tracing: Helmchart for OpenTelemetry Collector - https://phabricator.wikimedia.org/T324117 (10RLazarus) 05Open→03Resolved [22:00:35] 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10RLazarus) [22:01:23] 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10RLazarus) 05Open→03Resolved Up and running! [22:01:25] 10serviceops, 10Observability-Tracing: Package OpenTelemetry Collector atop our own base Docker images - https://phabricator.wikimedia.org/T320552 (10RLazarus)