[06:43:29] (03PS1) 10Elukey: ores-legacy: set user agent header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938969 [06:45:19] (03CR) 10CI reject: [V: 04-1] ores-legacy: set user agent header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938969 (owner: 10Elukey) [06:48:34] (03PS2) 10Elukey: ores-legacy: set user agent header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938969 [06:52:21] 10Machine-Learning-Team: [ores-legacy] Clienterror is returned in some responses - https://phabricator.wikimedia.org/T341479 (10elukey) Seems to come from the tlsproxy again: ` [2023-07-18T06:37:05.025Z] "POST /v1/models/enwiki-goodfaith:predict HTTP/1.1" 503 UF 47 91 274 - "-" "Python/3.9 aiohttp/3.8.3" "da1d1... [07:18:08] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Nice, totally forgot about this!" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938969 (owner: 10Elukey) [07:20:38] (03PS6) 10Ilias Sarantopoulos: langid: Provide wiki language code and name also in outputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/937575 (https://phabricator.wikimedia.org/T340507) (owner: 10Santhosh) [07:39:24] (03CR) 10Elukey: [C: 03+2] ores-legacy: set user agent header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938969 (owner: 10Elukey) [07:39:27] (03CR) 10Ilias Sarantopoulos: [C: 03+2] langid: Provide wiki language code and name also in outputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/937575 (https://phabricator.wikimedia.org/T340507) (owner: 10Santhosh) [07:40:16] (03Merged) 10jenkins-bot: ores-legacy: set user agent header [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/938969 (owner: 10Elukey) [07:47:33] (03PS7) 10Ilias Sarantopoulos: langid: Provide wiki language code and name also in outputs [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/937575 (https://phabricator.wikimedia.org/T340507) (owner: 10Santhosh) [07:48:49] kevinbazira: o/ I manually restarted the build process in CI: https://integration.wikimedia.org/ci/job/recommendation-api-ng-pipeline-publish/2/console [07:49:11] it failed for a http 500, really weird, maybe it was transient [07:49:31] (03CR) 10Elukey: "Kicked off a new job: https://integration.wikimedia.org/ci/job/recommendation-api-ng-pipeline-publish/2/console" [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/932810 (https://phabricator.wikimedia.org/T339890) (owner: 10Kevin Bazira) [08:10:59] 10Machine-Learning-Team: Host the recommendation-api container on LiftWing - https://phabricator.wikimedia.org/T339890 (10kevinbazira) [08:14:27] elukey: o/ thank you for restarting the CI build. yep, the 500 error got me digging everywhere to understand what's going on. [08:14:28] let's wait for this build to complete 🤞 [08:23:43] this build seems to take forever...35 minutes and counting (retrying while pushing image) [08:30:56] yes something is off [08:41:52] * elukey doctor appointment, bbl [08:45:41] (03PS3) 10Ilias Sarantopoulos: fix: set default lift wing url to null [extensions/ORES] - 10https://gerrit.wikimedia.org/r/937142 (https://phabricator.wikimedia.org/T319170) [08:47:07] morning, planning to deploy this now [08:47:08] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/937453/ [08:47:23] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10kevinbazira) [08:47:26] any objections? [08:47:28] (03CR) 10Ilias Sarantopoulos: fix: set default lift wing url to null (031 comment) [extensions/ORES] - 10https://gerrit.wikimedia.org/r/937142 (https://phabricator.wikimedia.org/T319170) (owner: 10Ilias Sarantopoulos) [08:47:55] Amir1: no objections, you can go ahead! thanks [08:48:13] awesome [08:50:34] elukey, isaranto: I've reached out to RelEng regarding the failing CI build in case they know or see something that we don't: https://phabricator.wikimedia.org/T342084 [08:53:30] ack! [08:55:23] Amir1: we are also ready to start deploying to other wikis (once we verify that envoy proxy works). There were some warning regarding databases but I haven't managed to figure it out [08:56:46] isaranto: by the warning, you mean the model ones? [08:57:03] yes, these https://phabricator.wikimedia.org/P49556 [08:57:25] I don't think that's a blocker, it only shows up when the model version is not stored in ores_model which is quite rare [08:57:41] my guess is that it has something to do with implicit transactions [08:57:44] but meh [08:59:40] cool! I'll submit a patch to enable it in some wikis. If you have specific suggestions from this list that we should do first let us know https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/ext-ORES.php#L12 [08:59:51] Dankeeee [09:08:49] I'll be rolling out new inference-services chart and simplewiki model servers along the day as it affects all services [09:08:57] (03PS1) 10Ladsgroup: Fix model row upsert warning [extensions/ORES] - 10https://gerrit.wikimedia.org/r/939240 (https://phabricator.wikimedia.org/T319170) [09:09:13] deployed and found the bug causing the warnings [09:10:41] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "Thanks for spotting this!" [extensions/ORES] - 10https://gerrit.wikimedia.org/r/939240 (https://phabricator.wikimedia.org/T319170) (owner: 10Ladsgroup) [09:11:49] kevinbazira: if you have the recommendation-api docker image built locally, can you tell me its size? [09:11:54] this may be the issue [09:12:17] I suspect that it is several gbs and the docker registry doesn't like it [09:13:03] The embeddings is an issue. I was thinking that perhaps we could download them on deployment when the pod starts in an init container or sth [09:13:42] elukey: the image is ~4.6GB [09:14:15] It can also be seen on ML sandbox: [09:14:37] REPOSITORY TAG IMAGE ID CREATED SIZE [09:14:37] recommendation-api-prod20230718 latest cff630cecacf 30 minutes ago 4.6GB [09:15:01] very weird then [09:15:03] it is not that big [09:15:11] can you add the info to the task? [09:15:16] they will likely ask [09:15:51] the failure happens while trying to publish the docker image to the registry [09:18:00] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10kevinbazira) [09:18:28] elukey: sure, I have added the size details to the task. [09:19:44] https://grafana.wikimedia.org/goto/EL-3rEC4k?orgId=1 In an effort to better understand the metrics (SLIs) that would go into our future SLO dashboard (made using templating), I crafted this sorta POC dashboard: https://grafana.wikimedia.org/goto/EL-3rEC4k?orgId=1 Note that the actual SLo dashboard will look different, this is just for me to understand the metrics (and see if they look [09:19:45] credible when graphed). Comments/questions etc. very welcome [09:20:14] oops, pasted the same URL twice :) [09:22:55] klausman: this is nice! [09:23:05] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10elukey) On registry2003 I can only see this error: ` level=error msg="response completed with error" err.code="name unknown" err.detail="map[name:... [09:24:01] when I filter by namespace there is a namespace named `unknown`. Any idea about that? [09:24:26] I will be deploying ALL lw model servers today as some of them havent been updated since march [09:24:32] I am unsure where that comes from. It's a proper label in the istio metrics. It also shows up in some other labels [09:25:02] ack, I just noticed it when I selected all of them together [09:26:23] There is also HTTP response code 0, which I am unsure about [09:27:12] Ah, I have a suspicion [09:28:13] So Istio metrics have src and dst namespace labels. For many requests, Istio will not know the src namespace since the request comes from outside the cluster. I suspect the dst=unknown ones are outgoing requests. That can probably be easily filtered (and should be, for the final SLO metrics) [09:28:32] Similar for service names [09:33:14] isaranto: re deploy - please wait a bit, we need to test the concurrency settings [09:33:19] they are only in staging [09:33:33] I did it briefly this morning and all seems good [09:33:38] but we'll need to proceed with care [09:34:16] i already deployed drattopic articletopic and artiquality to prod [09:34:19] they seem fine [09:34:27] ah lol ok :D [09:34:46] staging is already done though, did you re-sync? [09:36:03] yes yes [09:36:06] staging is all set [09:36:14] i'm running the htttpbb tests as well [10:19:44] elukey: in the SLo dash thingy I made, you can see the activator kicking in, btw: https://grafana.wikimedia.org/goto/g_UhgyC4k?orgId=1 [10:23:54] * klausman lunch [10:35:13] nice! [10:35:45] as fyi for everybody we also have https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving [10:49:14] 10Machine-Learning-Team, 10Release Pipeline, 10ci-test-error: Post-merge build failed due to Internal Server Error - https://phabricator.wikimedia.org/T342084 (10elukey) @hashar hi! Are we missing any config in the integration repo by any chance? [10:58:54] I'm also preparing a patch for httpbb tests, as there have been changes that are not reflected in the tests (nsfw gone, revertrisk "graduation" from experimental) [10:59:05] I'll submit it once I have deployed and tested everything [10:59:44] ack [11:05:57] Morning all! [11:06:49] morning Chris! [11:07:27] chrisalbon_: good late night/morning Chris! :D [11:07:48] Ha [11:08:05] I was thinking of sth similar to say Luca :) [11:08:10] super early! [11:09:20] * isaranto goes for lunch! [11:11:59] :) [11:12:38] * elukey lunch! [12:35:35] the patch is ready https://gerrit.wikimedia.org/r/c/operations/puppet/+/939293 [12:39:08] I have re-deployed all model servers (revscoring, revertrisk and articletopic) [13:17:02] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.17; 2023-07-11): Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [13:19:15] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.17; 2023-07-11): Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [13:21:57] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.17; 2023-07-11): Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) a:05isarantopoulos→03None [13:27:07] regarding deployments there is an issue with revertrisk namespace in eqiad. some of the old pods aren't terminated and some new ones get a CrashLoopBackOff so we end up having a mix of old and new pods [13:27:40] codfw is perfectly fine [13:31:34] 10Machine-Learning-Team: Deprecate mediawiki revision-score stream - https://phabricator.wikimedia.org/T342116 (10isarantopoulos) [13:33:10] isaranto: checking [13:33:51] so the pods in crashloop are using the old knative revision [13:34:21] and they report failures in the storage initializer [13:34:37] I already saw these errors, it goes away deleting the old revision [13:35:18] :~# kubectl delete revision revertrisk-language-agnostic-predictor-default-00006 -n revertrisk [13:35:21] revision.serving.knative.dev "revertrisk-language-agnostic-predictor-default-00006" deleted [13:35:24] fixed :) [13:39:11] thanks! I don't have permissions to delete revisions [13:39:26] can u do the same for multilingual? there is an older revision there as well [13:40:07] ah yes weird, I didn't notice the 2/3 [13:41:41] done [13:42:50] 10Machine-Learning-Team: Add Ores UI component in ores-legacy - https://phabricator.wikimedia.org/T342118 (10isarantopoulos) [13:43:16] Grazie! [14:21:04] 10Machine-Learning-Team: Add Ores UI component in ores-legacy - https://phabricator.wikimedia.org/T342118 (10calbon) consider adding a deprecation flag to the UI [14:26:39] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.17; 2023-07-11): Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [14:27:37] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.17; 2023-07-11): Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10isarantopoulos) [14:45:27] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES, 10MW-1.41-notes (1.41.0-wmf.17; 2023-07-11), 10Patch-For-Review: Move backend of ORES MediaWiki extension to Lift Wing - https://phabricator.wikimedia.org/T319170 (10isarantopoulos) We have deployed the model servers on Lift Wing for simplewiki (using e... [15:03:52] 10Machine-Learning-Team: Revert Risk multi-lingual model performance and reliability may need a review - https://phabricator.wikimedia.org/T340822 (10klausman) In an effort to solve the practical problem (getting good RR inference without too many errors and timeouts), I'll do some testing on the other RR model... [15:20:21] 10Machine-Learning-Team, 10Research: Add ML team as developers to research repos - https://phabricator.wikimedia.org/T341856 (10KHernandez-WMF) p:05Triage→03Medium a:03fkaelin [15:34:44] isaranto: sooo the 503s/etc.. that are present in the ores-legacy's response are weird [15:35:00] since I see a 503 UF (upstream failure) in the tls proxy logs [15:35:13] but no 503s on the istio gateway front [15:36:58] better: UF is Failed to connect to upstream [15:37:13] I still have no idea... [15:37:17] so it seems as if envoy (tls-proxy) stumbles on a broken connection [15:37:38] i mean it seems as a load thing or some other proxy issue [15:39:44] iiuc it has to do with async calls in ores-legacy. the thing why I believe it is a proxy issue is that if I run ores-legacy on statbox I dont have this issue [15:40:09] I am logging off for the day, more tomorrow! (feel free to add stuff , I'll check later) [15:40:12] yes definitely, I think it is the combination of aiohttp + envoy [15:40:14] o/ [15:49:22] 10Machine-Learning-Team, 10Research: Add ML team as developers to research repos - https://phabricator.wikimedia.org/T341856 (10fkaelin) There is a 'Invite a group' button on the group members page https://gitlab.wikimedia.org/groups/repos/research/-/group_members, however list of groups that can be selected d... [15:54:07] (03PS1) 10Elukey: revscoring: fix exception handling in fetch_features [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/939321 [16:25:07] 10Machine-Learning-Team, 10Research: Add ML team as developers to research repos - https://phabricator.wikimedia.org/T341856 (10brennen) 05Open→03Resolved > In fact, it seems the list of options is populated by an outdated structure (e.g. `repos/{teamname}` instead of `groups/repos/{teamname}`) Despite th... [16:28:04] * elukey lunch! [16:28:07] ahahahah old one [16:28:10] * elukey afk! [16:28:20] have a nice rest of the day folks! [16:34:17] Just like Hobbits have several breakfeasts, Italians have several lunches ;) [17:26:17] 10Machine-Learning-Team, 10MediaWiki-extensions-ORES: Deployment of Lift Wing usage to all wikis that use ores extension - https://phabricator.wikimedia.org/T342115 (10Aklapper)