[07:38:26] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) After reading the always enlightening page https://www.brendangregg.com/perf.html I tried to run perf with the LBR (`--call-graph lbr`), as a... [10:12:35] 10serviceops, 10iPoid-Service, 10Patch-For-Review, 10Trust and Safety Product Sprint: Implement proxy configuration for kubernetes deployment - https://phabricator.wikimedia.org/T349171 (10CodeReviewBot) kharlan merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/152 Rename... [11:08:29] hello folks [11:08:55] after some checks it seems that the majority of the time spent by changeprop on cpu is for librdkafka, so I filed https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/969339 [11:09:25] that enables librdkafka debug logs (when we set the value in the config), hopefully we'll have some clue on what's happening [11:16:19] <_joe_> looks good to me [11:19:01] I also discovered that we use an old lib (http://bluebirdjs.com/docs/api/promise.promisifyall.html) to make the kafka consumer async, I suspect that it may play a role in this [11:19:29] last release was 4y ago, I am not 100% sure if it is efficient on node 18 [13:44:09] elukey: this is a good point, would be interesting to see how it would change to use native promise instead? [13:45:16] ottomata: no idea how to do it though, not that expert in node :D [13:45:22] iirc it is also used by service-template-node+service-runner to issue http requests via the preq library https://github.com/wikimedia/preq [13:46:19] the main issue is what Joe keep saying, namely that if we want to keep using Nodejs we should also have a way to manage the common platform/packages/etc.. [13:46:27] it might be easier to experiment with the kafka promsifying https://gerrit.wikimedia.org/g/node-rdkafka-factory [13:46:36] I think a SIG (like the k8s one) could be good [13:46:38] i dont' think changeprop uses that right now, but it could (and I think it was intended for it to do so) [13:46:45] elukey: for sure [13:47:17] i think that just needs a leader...that can't be me right now, my baby schedule is too whacky until the spring [13:47:40] and also maybe some leadership buy-in [13:47:55] yep yep [13:48:06] ah nice the lib uses stuff like producer.setPollInterval(pollInterval); [13:48:15] changeprop uses some time interval for that [13:48:30] wait...does it use bluebird anymore? [13:48:38] maybe i removed that just recently hang on... [13:49:48] having a common package is nice, for example for stuff like https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/969339/4/lib/kafka_factory.js [13:50:01] hm maybe it never did! [13:50:31] yeah, this code was originally taken from there and factored out for use in other nodejs apps. i think petr etc. just never got around to using it in changeprop [13:50:48] yes, this lib does not use bluebird [13:51:11] so maybe we can discount that as an issue with new node-rdkafka then, since we also see the latency increase in eventgate. [13:51:14] or, wait [13:51:22] you explicitly see latency increase in node-rdkafka stuff, right? [13:51:52] nono so far only cpu increase [13:51:58] but changeprop in staging doesn't do much [13:52:21] right sorry, cpu increase. and you can tell this increase is in node-rdkafka libs? [13:52:26] via flamegraphs? [13:52:36] I am not an expert but it seems so yes [13:52:46] see https://phabricator.wikimedia.org/T348950#9290196 for example [13:53:38] yeah was following [13:54:04] too bad we can't build and run images with buster and old nodejs anymore [13:54:24] we could try only upgrading node-rdkafka in the old image (and other nodejs libs) and compare [13:54:44] ...perhaps we shouldn't remove old images until production services are upgraded? :p [13:55:12] I am 100% sure it is librdkafka, we had the same issue when we tried to upgrade to node 12 [13:55:36] well, i'm also curious as to what might be affecting the eventgate latencies [13:55:40] ottomata: or we could avoid to run nodejs versions that are deprecated since years ago in production :P [13:55:45] like, why are the GET requests so affected? [13:56:18] elukey: for sure, but to do that, we'd need to upgrade, and it would be nice to be able to isolate issues in that upgrade with the old images [13:56:20] annywayy [14:15:57] 10serviceops, 10Data-Engineering, 10Event-Platform: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) Deployed the new image to staging, and added the `debug:all` settings for consumer and producer. What I see is that librdkafka keeps trying to fetch data from vari... [14:24:35] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) > last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18 FWIW, swapping out any direct usages of Bl... [14:36:03] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) >>! In T348950#9291443, @Ottomata wrote: >> last release of Bluebird is in fact ~4y old, and I don't think it natively supports nodejs 18 >... [14:37:28] ok so I filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/969758 to upgrade cp-eqiad (afaics the one handling less traffic) to the new docker image [14:38:51] my investigation points to librdkafka polling periodically all the topics to fetch data from, and they don't have any since we are in staging. I don't see anything that stands up, in my opinion the main issue is how we use node-rdkafka in cp [14:43:36] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) https://www.npmjs.com/package/util-promisifyall ? Doesn't really do anything except iterate of object props and call [[ https://nodejs.or... [14:46:06] ottomata: "Published 5 years ago" :D --^ [14:46:22] it mentions node 8 :D [15:01:23] elukey: yeah, but it doesn't do anything other than foreach property as prop promisify(prop) [15:01:38] but, i don't know if you even need promisifyAll [15:22:59] 10serviceops, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10Ottomata) > The promisifyAll(consumer) call IIUC creates some Async functions Looks like only: - `consumer.consumeAsync` - `consumer.commitMessageAs... [16:47:00] 10serviceops, 10iPoid-Service, 10Patch-For-Review, 10Trust and Safety Product Sprint (Sprint Bodhrán): Implement proxy configuration for kubernetes deployment - https://phabricator.wikimedia.org/T349171 (10kostajh) [17:53:39] 10serviceops, 10API Platform, 10MediaWiki-REST-API, 10SRE, and 2 others: Use relative URLs in redirects emitted by rest.php - https://phabricator.wikimedia.org/T349001 (10daniel) 05Open→03Resolved a:03daniel [19:40:26] 10serviceops, 10RESTBase Sunsetting, 10Code-Health-Objective, 10Data Products (Data Products (Sprint 03)), 10Patch-For-Review: Route to new AQS Knowledge Gaps endpoint - https://phabricator.wikimedia.org/T342213 (10VirginiaPoundstone) p:05High→03Low