[07:39:33] 10serviceops, 10MinT, 10Prod-Kubernetes, 10Kubernetes, and 2 others: Remove the use of :latest image tags in production - https://phabricator.wikimedia.org/T348856 (10Jelto) Miscweb is no longer using the `latest` image for `prometheus-apache-exporter` [07:43:41] 10serviceops, 10SRE: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) [07:43:49] 10serviceops, 10MW-on-K8s: Define the size of a pod for mediawiki in terms of resource usage - https://phabricator.wikimedia.org/T278220 (10Joe) 05Open→03Resolved a:03Joe I would say this is resolved since a long time? [07:49:05] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Joe) [07:53:40] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10Joe) p:05Triage→03High [08:11:48] 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 (10JMeybohm) [08:21:44] hello folks [08:22:11] so I filed https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/968986 for change propagation, to validate my theory after profiling. Let me know your thoughts :) [08:23:54] sounds good to me [08:25:55] <3 going to test [08:54:28] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Still no errors. I've increased job concurrency to 10, enabled new Impact backend on all Wi... [08:54:38] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) a:03Urbanecm_WMF [09:03:28] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:03:56] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:07:56] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:08:17] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [09:14:31] 10serviceops, 10Growth-Positive-Reinforcement, 10Growth-Team, 10GrowthExperiments-ImpactModule, 10PageViewInfo: Daily pageview/PageViewInfo errors on jobrunners - https://phabricator.wikimedia.org/T348517 (10Urbanecm_WMF) The Growth team has a job that runs at 07:45 UTC and heavily relies on pageview dat... [09:16:50] nope no joy, I still see the cpu usage [09:25:07] dang [09:33:40] or better, something improved, but I still see the rise in CPU [09:33:55] I'll try to use the perf support for node [09:33:59] maybe it is more clear [09:44:52] <_joe_> yeah and tbh increasing the polling interval doesn't feel that great as a solution [09:45:13] <_joe_> don't get me wrong, it was a good attempt to try and assess if that was the root cause [09:48:20] yes yes :) [09:48:36] I am still convinced it is related to timers, 10ms was really tight [09:48:44] and indeed something improved, but not as I hoped [10:12:29] ok now the metrics look better [10:12:56] see for example: [10:12:57] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s-staging&from=now-3h&to=now&viewPanel=54 [10:13:26] one thing that I didn't notice is that user time indeed grows, but system decreases as well [10:13:37] and perf shows most of the time spent in librdkafka basically [10:14:00] so overall, with these settings, we are good [10:14:24] I need to check previous graphs, but so far it seems that CPU increases a little right after the deploy [10:14:27] then it stabilize [10:17:38] so, if everybody agrees, I'd keep the 100ms setting and move to test some rules in staging [10:17:40] <_joe_> there's a huge throttling event after startup yes [10:17:50] <_joe_> elukey: ++ [10:17:58] super thanks <3 [10:24:03] to complete - there will be some increase in cpu, but it doesn't seem too dramatic [10:24:48] that looks pretty good - and a spike after deploy has more or less always happened afaik [10:30:50] the thing that I need to verify is what was the original baseline, I thought I reverted to the original image to test, but it may have not [10:31:20] if I look back 7d ago I see a lower baseline, but if we sum system+user the change is not dramatic [10:32:30] the avg per pod seems to be ~80ms -> 110ms [10:34:28] I'll test some rules and then wrap up, after that we can decide if we are ready for prod [10:49:49] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: refreshUserImpactJob logs mysterious fatal errors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [10:50:49] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10jijiki) `redis::multidc` was a very complicated and not well maintained part of the infrastructure, so as soon as we moved `MainStash` out of redis T212129, we wanted it out of pup... [10:50:55] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: refreshUserImpactJob logs mysterious fatal errors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) [11:00:30] 10serviceops, 10Growth-Positive-Reinforcement, 10Growth-Team, 10GrowthExperiments-ImpactModule: refreshUserImpactJob requires a high number of file descriptors - https://phabricator.wikimedia.org/T349809 (10Urbanecm_WMF) [11:04:16] 10serviceops, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10SRE, and 2 others: refreshUserImpactJob logs mysterious fatal errors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) 05Open→03Resolved Let's be optimistic and call this resolved, since the errors disappeared. I... [12:07:00] 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 (10JMeybohm) libxml2 got a sec update after the packages have been build against icu67 so we need to include that before we can update deployment-prep. @MoritzMuehlenhoff will take care of that... [13:36:19] 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 (10JMeybohm) [13:44:13] <_joe_> elukey: what did you do with changeprop in staging at 13:30? [13:44:19] <_joe_> the cpu usage plummeted [13:45:29] yes yes [13:46:58] I reverted 10 mins ago to the prev image [13:47:10] <_joe_> ah I see :) [13:47:20] <_joe_> I hoped you found a magic solution [13:47:26] :( [13:56:17] the thing that still I don't understand is the variation in network usage [13:56:41] the cpu dropped but we are talking ~110ms -> ~75ms [13:56:55] (per pod) [13:57:28] but I'll try to dig a bit more, my impression is that it is all librdkafka related [13:57:33] (from what I can see from perf) [13:59:58] 10serviceops, 10docker-pkg, 10Release Pipeline (Blubber): Fix how we keep docker-pkg based images up to date - https://phabricator.wikimedia.org/T344478 (10Joe) If your desire is having deterministic builds, it would be enough to NOT remove the apt archives from the base layer image, and then only run apt-ge... [14:05:54] 10serviceops, 10docker-pkg, 10Release Pipeline (Blubber): Fix how we keep docker-pkg based images up to date - https://phabricator.wikimedia.org/T344478 (10Joe) For now it would be enough for us to just get a gerrit account that we can use to: * submit and merge a change per week that adds a new changelog en... [14:12:09] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [14:23:11] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10dcausse) [14:23:35] 10serviceops, 10VisualEditor, 10MW-1.39-notes (1.39.0-wmf.21; 2022-07-18), 10Parsoid (Tracking): Preemptively warm caches for Parsoid output - https://phabricator.wikimedia.org/T301371 (10MSantos) [14:26:18] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10dcausse) {F40321464} [14:46:39] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10MSantos) @Jdforrester-WMF here's a few questions that I have: **Recommendation API ownership** The former #product-infrastru... [14:48:29] 10serviceops, 10Observability-Tracing: OpenTelemetry Collector running as a DaemonSet on Wikikube - https://phabricator.wikimedia.org/T320564 (10fgiunchedi) [15:06:12] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Joe) The problem can also be that we have one component in front of the service (envoyproxy)... [15:58:15] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10Etonkovidova) [16:22:21] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10JMeybohm) @jijiki pointed out we already have a CronJob running with a mesh sidecar in production. That is from the tegola chart and it uses a wrapper script to ensure envoy is up befo... [16:27:13] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10Tgr) In production, rdb1 / rdb2 / rdb3 (which point to rdb1009 and rdb1011) use the [[https://gerrit.wikimedia.org/g/operations/puppet/+/ba56daa92828a4803cdee60b8fbe198106597976/hi... [16:37:36] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator: Explore providing a writable RAM disk / etc. for the function-evaluator instances in k8s so they can write cache and transient operational material there - https://phabricator.wikimedia.org/T349738 (10Jdforrester-WMF) p:05Triage→03Medium [16:53:48] 10serviceops, 10Abstract Wikipedia team, 10function-evaluator: Explore providing a writable RAM disk / etc. for the function-evaluator instances in k8s so they can write cache and transient operational material there - https://phabricator.wikimedia.org/T349738 (10JMeybohm) I probably can't work on this until... [17:14:30] I popped a ticket to discuss MWAPI resource usage for our new Search Pipeline. https://phabricator.wikimedia.org/T349848 . No action from ServiceOps required, just a heads-up [18:49:23] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10RLazarus) Yep, sounds like a similar fit. I guess I didn't hit Submit on the last update -- it's not labels, it's annotations (which makes more sense anyway -- I even remember double-... [18:49:52] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10RLazarus) [18:50:13] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) **Data Engineering ownerhship** `eventgate` is missing from your list, but has WIP to target node 18 https://phabri... [18:51:49] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) Also, `mediawiki/services/similar-users` is a Python service. [19:39:57] 10serviceops, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF) [19:40:00] 10serviceops, 10SRE, 10API Platform (RESTbase Deprecation Roadmap), 10Patch-For-Review: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [19:40:16] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [19:40:22] 10serviceops, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 5 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [19:48:01] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) [19:48:14] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10Jdforrester-WMF) >>! In T349118#9284438, @MSantos wrote: > @Jdforrester-WMF here's a few questions that I have: > > **Recomm... [20:40:48] 10serviceops, 10Growth-Positive-Reinforcement, 10Growth-Team, 10GrowthExperiments-ImpactModule: refreshUserImpactJob requires a high number of file descriptors - https://phabricator.wikimedia.org/T349809 (10KStoller-WMF) p:05Triage→03High [22:20:07] 10serviceops, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) [22:20:40] 10serviceops, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) [22:21:30] 10serviceops, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10RobH) a:03Clement_Goubert >>! In T348045#9281139, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? Would one of you be so kind as to update... [22:22:22] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) [22:22:42] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) [22:23:20] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) a:03Clement_Goubert >>! In T348046#9281141, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? Would one of you be so kind as to update... [22:26:03] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) [22:26:24] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) [22:27:25] 10serviceops, 10DC-Ops, 10ops-eqiad: Q1:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) a:03Clement_Goubert >>! In T348021#9281147, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? I've split the racking task onto it... [22:28:00] 10serviceops, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10RobH) [22:28:05] 10serviceops, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349875 (10RobH) [22:28:15] 10serviceops, 10DC-Ops, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10RobH) [22:29:31] 10serviceops, 10DC-Ops, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10RobH) a:03Clement_Goubert >>! In T348020#9281144, @Kappakayala wrote: > @Clement_Goubert / @Joe could one of you help with the racking details? I've split the racking task onto... [22:29:39] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts - https://phabricator.wikimedia.org/T349876 (10RobH)