[07:05:49] good morning :) [07:33:30] ok folks I think I may have found the source of the connection errors from istio to the MW API [07:33:41] tests still ongoing but they are looking better [07:34:11] we have settings in the mw api's istio destination rule to limit the number of requests for each TCP conn [07:34:41] the idea is to every now and then to use a different tcp conn since we are hitting a load balanced service [07:34:59] (since it is a L4 load balancer, we may hit the same mw appserver behind for long time etc..) [07:36:04] if I remove them, I don't see the connection errors anymore [07:36:32] so my guess is that envoy doesn't really do a clean shutdown, or maybe aiohttp on the kserve side keeps using old conns for some reason [07:37:37] I'll try to pinpoint the setting that causes this [07:42:42] elukey: nice! that's a good news :) [07:42:52] good morning :) [07:44:20] aiko: o/ [07:44:45] so I think the "blocking" issue that we discussed on friday is still relevant, but probably only for p9x latencies etc.. [07:45:08] so I am still kind of convinced that a process pool should be used [07:48:10] so I still see the connect reset error, way less but it didn't really disappear [07:49:10] I see. yeah we can try it and see how it goes [08:02:26] the issue is still present sigh [08:02:37] and I see latencies up to 20/30s for some models [08:20:47] the next step that I have in mind is to turn dynamically debug logging for envoy on some pods and see if I get some useful explanation [08:21:18] at the moment it seems as if aiohttp uses a tcp conn for http that envoy has torn down before [08:21:28] (without sending any Connection:close etc..) [08:22:58] of course now I turned on logging and I don't see any occurrence of the issue for those pods [08:23:01] classic [08:24:05] \o Morning [08:24:19] Might just take longer for the issue to surface because things are slower? [08:32:30] there is definitely some issue/envoy settings that play a role, if I change settings related to TCP pools etc.. the issue changes its frequency [08:33:18] the model's cpubound code could play a role too [08:59:41] so every time the error occurs, I see [08:59:41] "response_flags": "UC", [08:59:41] "response_code_details": "upstream_reset_before_response_started{connection_termination}", [08:59:49] this in the envoy logs [09:01:37] so it seems as if envoy gets its tcp conn to the mw api (another envoy) torn down [09:01:45] or maybe it keeps reusing an old one [09:01:48] already closed [09:03:08] https://github.com/envoyproxy/envoy/issues/14981 suggests for example that it may happen if upstream (in our case the mw api) doesn't send a connection:close header [09:09:44] But still, the RST packet should be near immediate, and Envoy (on our side) should know it needs to make a new connection [09:12:35] Hmm, reading that bug, we maybe should check if MWAPI is sending Connection: Close headers as Envoy expects them [09:13:54] yeah what I wrote above :) [09:14:03] but https://phabricator.wikimedia.org/T247484 looks similar to our issue too [09:14:12] Ah, I'm always lagging behind 30m. Must be Monday [09:14:34] Did you try the retry-on-rest thing mentioned in https://github.com/envoyproxy/envoy/issues/14981#issuecomment-1196858383 ? [09:16:36] I was checking if it is available [09:16:37] https://github.com/istio/istio/pull/19635 [09:16:45] seems to sugges it is from 1.4 [09:17:49] now finding how to set it is not easy :D [09:18:49] ah it is a virtual service config https://istio.io/latest/docs/reference/config/networking/virtual-service/#HTTPRetry [09:19:08] let's see [09:21:38] I added [09:21:38] retries: [09:21:38] attempts: 1 [09:21:38] retryOn: connect-failure,refused-stream,503,reset [09:21:55] it is very weird though that we have to do it [09:22:14] but it may be our version of istio/envoy/whatever carrying a bug etc.. [09:22:34] If it is caused by upstream not sending C:c, maybe this could/should be fixed on the MWAPI side [09:24:30] definitely, but in theory we connect to an Envoy proxy on the mw side too [09:24:46] Just so many moving parts :-S [09:24:47] and the issue should be more widespread, we see if really often now [09:24:55] (on our side I mean) [09:25:54] Do we know of other envoys talking to MWAPI? [09:25:59] Or Istios [09:27:34] in theory the serviceops team adds a sidecar to every pod with an envoy tlsproxy, that listens on a specific port and forwards requests to a specific backen server [09:27:41] and there is one for mwapi too [09:29:37] so with 3 'attempts' I don't see more errors on the benthos front, at least for now [09:35:15] yep now zero errors [09:36:04] So "Attempts" is Benthos re-doing the query? [09:38:02] nono attempts in the istio virtual service [09:38:09] (so the envoy sidecar is redoing the cal) [09:38:13] *call) [09:38:22] Ah, right. [09:38:23] https://gerrit.wikimedia.org/r/c/operations/puppet/+/580894/4/hieradata/common/profile/services_proxy/envoy.yaml seems similar to what we are seeing [09:38:58] Hmmmm. [09:41:10] that should be the idle timeout [09:41:11] https://istio.io/latest/docs/reference/config/networking/destination-rule/#ConnectionPoolSettings-HTTPSettings [09:41:27] let's try it [09:44:38] It says the default is 1h. Are our 5s timeouts maybe too aggressively-short? [09:46:14] the connect timeout is 30s now (set manually), did you mean that one? [09:46:45] or the idleTimeout? [09:47:53] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/848245 [09:48:12] ^^^ I sent a patch to roll out outlink's new docker images, so I can test benthos config for outlink model [09:48:18] Nah, I got otherwise confused with the keepalive %-) [09:48:32] aiko: takin' a look [09:50:49] aiko: should the model be visible on s3?/swift? [09:51:55] I only see s3://wmf-ml-models/articletopic/outlink/20220727080723/model.bin [09:52:11] Oh, nvm, me dumb [09:52:56] klausman: yeah that's the one! [09:53:31] My mind was thinking "model update" not "image update" :D [09:53:50] +2d [09:54:51] klausman: thank you :D [10:22:36] klausman: so the magic setting that works seems to be [10:22:44] retries: [10:22:44] attempts: 3 [10:22:44] retryOn: connect-failure,refused-stream,reset [10:26:22] and 0 errors [10:29:07] Let's keep it running for a while and see if there are longer-term downsides [10:29:15] (also: lunch :)) [10:37:17] * elukey lunch! [12:18:23] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/848291 [12:18:51] --^ I got many errors reported from benthos then I realized I forgot to add eventgate settings :D [12:23:13] elukey: is it okay if I also use "mediawiki.revision-score-test" for testing outlink? [12:25:45] elukey: outlink follows the same event schema as revscoring articletopic [12:28:04] +2'd :) [13:02:46] aiko: yes yes definitely! [13:05:22] aiko: if you need any help with benthos lemme know [13:07:54] aiko: I am testing new istio settings on ml-serve-codfw, this is the only ongoing thing (to see if I can reduce the connect breaks etc..) [13:13:16] God [13:13:23] Good morning all! [13:16:22] ahahha after the first attempt I was worried [13:16:23] :D [13:16:25] morning :D [13:36:22] Lol [13:36:31] It’s so cold and hard when I get up now [13:47:33] so https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/848344 seems to fix the connection breakages [13:48:17] I am still pushing for https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/844923 to have the separate process pool though [13:48:21] :) [14:08:52] (03CR) 10AikoChou: [C: 03+1] editquality: run the score method in a separate process [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/844923 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:10:25] wikibugs is back :D [14:10:46] :) [14:10:54] +2ed your change aiko [14:11:37] thanks! [14:35:02] (03CR) 10Elukey: [C: 03+2] editquality: run the score method in a separate process [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/844923 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:43:17] (03Merged) 10jenkins-bot: editquality: run the score method in a separate process [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/844923 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:44:38] aiko: what does MODEL_VERSION represents for outlink? [14:45:12] in my head the version is 20220727080723 [14:45:32] (trying to understand the code) [14:45:47] elukey: it is based on the original file name from Isaac https://analytics.wikimedia.org/published/datasets/one-off/isaacj/articletopic/ [14:46:01] aiko: ah ok so it is more a MODEL_NAME thing right? [14:46:02] elukey: model_alloutlinks_202012.bin [14:46:36] we can go ahead with _VERSION now for the test purposes but it may be confusing for people not aware of it [14:47:29] elukey: I added this env because revision-score-event schema requires model_version [14:48:24] aiko: yes yes I understand, but we now have two "versioning", related to different things [14:49:05] +2ed, let's think about a better naming after testing [14:49:14] just to avoid confusion when reading etc.. [14:49:21] what do you think? [14:49:27] elukey: I can change it to 20220727080723 for avoiding confusion [14:49:56] for the moment let's test it like it is, and then we can see what's besdt [14:49:57] *best [14:50:19] i assume it's not relevant to this discussion, but FYI I intend to train an updated model before the model gets hooked up to actually make predictions per the revision-create stream (as the filename shows, the model is a bit old now) [14:50:38] o/ [14:51:30] isaacj: it is definitely relevant - we store the model in path like s3://wmf-ml-models/articletopic/outlink/20220727080723/model.bin [14:51:39] so for us the "version" is 20220727080723 [14:51:40] hi Isaac!! :) [14:52:44] o/ I think Aiko accidentally pinged me because my username is in the model URL above :) but always exciting to see these discussions [14:53:13] lol [14:53:23] we always think about Research folks <3 [14:53:25] lollllll [14:55:28] elukey: what does the big long number mean in the name? is it just the upload time? aiko already let me know that i should give some heads up before expecting to upload the new model but is there anything else I should know before retraining w/ respect to versioning etc. (none of the inputs etc. would change, just new model weights) [14:57:07] isaacj: it is a timestamp basically, so we know what model binary to load (like a unique id) [14:57:29] all the model binaries are called 'model.bin' [14:58:01] so rather than having a separate name/convention for each model type, we chose to have a uniform unique id [14:58:15] simpler for us when deploying/checking/etc.. [15:00:59] ahh that makes sense. because of the time-of-day looking like a date too, i thought it might have some sort of expiration attached (good through august 7, 2023 or something like that). ok, then i'll just make sure the model card gets updated with the actual details on training time etc. [15:02:05] <3 [15:12:22] isaacj: ores models have model.version like "0.9.2" as an attribute that comes with the model binary. I wonder if it is possible to add these kinds of metadata to outlink as well? [15:19:17] aiko: I have deployed the multi-process patch to enwiki-goodfaith on ml-serve-codfw, so far latencies are better afaics from istio's metrics [15:20:15] p90 and p99 are still high (seconds) but p75 now is it less than a second [15:20:25] I'll let it run to collect more data [15:33:19] aiko: yeah, semantic versioning for models would be really nice. would happily work to figure out what our guidelines on versioning is but that'd be minimal overhead and a nice piece of metadata to have [15:41:00] isaacj: that would be a good topic to discuss in the ml:research meeting :) [15:44:43] yay! I see revision-score events from outlink isvc landed on kafka test stream \o/ [15:52:39] nice!! [15:52:48] * elukey going afk! o/ [16:23:55] aiko: oooh exciting about test stream! agreed good topic for ml:research meeting too. i'll try to get this new model trained in the next few weeks but let me know if the streams are going live before then and i'll try to move faster on that [18:09:38] 10Lift-Wing, 10Machine-Learning-Team, 10Research (FY2022-23-Research-October-December): Create a language agnostic model to predict reverts on Wikipedia - https://phabricator.wikimedia.org/T314385 (10diego) **Updates** * The [[https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/merge_requests/... [19:22:43] aiko: isaacj elukey we are pretty close to merging https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/807565, which I hope you all can base your new revision score schema on! let me know if/when it would be helpful to work with you all on that [20:33:02] * isaacj jumps for joy