[06:54:18] 10serviceops, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10gmodena) >>! In T349118#9285852, @Jdforrester-WMF wrote: >>>! In T349118#9285601, @gmodena wrote: >> **Data Engineering owner... [08:46:42] 10serviceops, 10Data-Engineering, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Found a way to generate Flame Graphs: * nodejs 10 version: {F40463091} * nodejs 18 version: {F40462916} Procedure: * Added the following to nodejs `--perf-basic... [08:46:53] hello folks, added flamegraphs for changeprop --^ [08:50:22] 10serviceops, 10Patch-For-Review: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 (10MoritzMuehlenhoff) [09:19:11] lots of little differences but not sure I can see anything immediately apparent. significant changes in how the different node versions use libuv/epoll which would easily have significant impacts on cpu use and polling (tbh it's pretty hard to get proper impressions from changelogs in how the differences in v8 versions between 16 and 18 would impact) [09:20:40] yeah, one thing that I noticed is that the rd-kafka main function takes a lot more cpu now [09:21:03] and I see some timer-related cpu usage too in the upper frames [09:21:22] so I think the issue is between librdkafka and how nodejs handles timers etc.. [09:21:29] probably not much that we can do about it [09:21:50] (unless there is a magic librdfkafka setting that we are missing) [09:22:47] I'd figure the timers tie back into the more low level changes in the libs etc [09:23:07] I'm a little less bombastic about just seeing what happens in codfw for a little bit after seeing those eventgate numbers :( [09:24:36] eventgate is very different though, I am not sure how the use noderdkafka etc.. [09:25:00] do we have canaries? We could try codfw canary if we want to be more careful [09:37:30] not atm but it would be a good idea [09:42:08] ack will do it [09:43:30] hnowlan: do we have any quick test for changeprop in staging? To verify that rules work etc.. [09:45:13] elukey: yep, there's some quick tests here https://wikitech.wikimedia.org/wiki/Changeprop#Testing [09:47:28] ahhh right! Will test them as well [10:36:44] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: Implement proxy configuration for kubernetes deployment - https://phabricator.wikimedia.org/T349171 (10Tchanders) [10:41:15] 10serviceops, 10iPoid-Service, 10Patch-For-Review, 10Service-deployment-requests, 10Trust and Safety Product Sprint: New Service Request 'iPoid' - https://phabricator.wikimedia.org/T325147 (10Tchanders) [11:20:21] 10serviceops, 10Release-Engineering-Team, 10docker-pkg: Attach opencontainers image metadata to docker images - https://phabricator.wikimedia.org/T345070 (10JMeybohm) [11:21:02] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10User-MoritzMuehlenhoff: Annotate images in our registry with OS (and OS version) - https://phabricator.wikimedia.org/T335337 (10JMeybohm) [11:21:08] 10serviceops, 10Release-Engineering-Team, 10docker-pkg: Attach opencontainers image metadata to docker images - https://phabricator.wikimedia.org/T345070 (10JMeybohm) [11:29:36] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [11:29:54] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [12:21:34] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10JMeybohm) I still wonder why upstream choose to list all the sidecars instead of just declaring the "primary container". Not sure if it's worth investigating in a patch though. [12:23:35] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: Implement proxy configuration for kubernetes deployment - https://phabricator.wikimedia.org/T349171 (10Tchanders) Thanks @jijiki . I've put up a patch renaming HTTP_PROXY to HTTPS_PROXY throughout the repo. Re: the port, do you mean that the... [12:24:22] 10serviceops, 10iPoid-Service, 10Patch-For-Review, 10Trust and Safety Product Sprint: Implement proxy configuration for kubernetes deployment - https://phabricator.wikimedia.org/T349171 (10CodeReviewBot) tchanders opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/152 Rena... [13:13:59] 10serviceops, 10Data-Engineering, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Tested event handling of the Lift Wing rules in staging, everything looks good afaics. [13:30:58] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Ottomata) This will be a problem not just for jobs, but for all events sent by EventBus.... [13:44:44] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) [13:47:37] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9287341, @Ottomata wrote: > Is there some way to remove ingress whi... [13:53:28] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) [13:53:48] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10Ottomata) [14:08:45] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [14:10:25] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: A rolling restart of eventgate-main seems to cause many client failures - https://phabricator.wikimedia.org/T349823 (10Ottomata) Ah, great. Okay so IIUC, we should - upgrade relevant vendor templates in eventg... [14:11:54] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) [14:17:34] 10serviceops, 10Growth-Team, 10Growth-Team-Filtering, 10MW-on-K8s, 10Notifications: Broken (empty) cross-wiki notification when using $wgLocalHTTPProxy (e.g. on Kubernetes) - https://phabricator.wikimedia.org/T223413 (10Macaddct1984) Having the same issue and it's occurring on the meta-wiki side as well.... [14:17:37] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10JMeybohm) >>! In T349823#9287469, @Ottomata wrote: > @JMeybohm does that sound right? Yes... [14:20:46] 10serviceops, 10MW-on-K8s, 10Patch-For-Review: Gracefully handle pod termination in mw-on-k8s - https://phabricator.wikimedia.org/T331609 (10JMeybohm) 05Open→03Resolved Thanks for the patch @Clement_Goubert , I just deployed it. [14:28:48] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [14:52:41] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 (10JMeybohm) [15:47:11] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10bking) [16:01:26] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 (10bking) [18:26:19] 10serviceops, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Event-Platform: [Event Platform] Gracefully handle pod termination in eventgate Helm chart - https://phabricator.wikimedia.org/T349823 (10Ottomata) > Regardless of the the above, this is still a valid question I'd say. Indeed!... [21:25:40] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10Tgr) >>! In T340908#9285076, @Tgr wrote: > In production, rdb1 / rdb2 / rdb3 (which point to rdb1009 and rdb1011) use the [[https://gerrit.wikimedia.org/g/operations/puppet/+/ba56d... [23:06:50] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10Tgr) It still fails with `Could not acquire locks on server rdb1.` :( I'm trying to figure out why but beta DOS-ing its own error log (T349944) doesn't help. [23:31:37] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10Tgr) `PHP Warning: RedisException: Connection timed out`, a firewall issue I guess? Redis is listening for all IPs: ` tgr@deployment-rdb01:~$ sudo lsof -iTCP -sTCP:LISTEN -n -P |... [23:40:08] 10serviceops, 10Beta-Cluster-Infrastructure: Unable to upload files on Beta Commons - https://phabricator.wikimedia.org/T340908 (10Tgr) >>! In T340908#9288678, @Tgr wrote: > iptables has a reasonable-looking rule, but it's not getting any new packets: > ` > tgr@deployment-rdb01:~$ sudo iptables --list -n -v |...