[05:26:50] 10Machine-Learning-Team: Deploy nllb-200 to production - https://phabricator.wikimedia.org/T349163 (10isarantopoulos) [06:25:01] elukey: rec api is using default blubber python version 3.9 [06:26:30] so we don't need to change anything. If however they are easy fixes we can do that now to be futureproof. I also suggested running pytest without supressing the deprecation warnings (I dont see a reason why to do that as it can only lead to future errors) [06:52:53] 10Machine-Learning-Team, 10Observability-Alerting, 10Patch-For-Review: Lift Wing alerting - https://phabricator.wikimedia.org/T346151 (10isarantopoulos) For the kafka lag when I try the query `kafka_burrow_partition_lag{ group="cpjobqueue-ORESFetchScoreJob"}` in thanos I see 2 topics for codfw (on... [07:00:10] Afk bbl [07:02:55] morning! [07:03:12] isaranto: yes yes I meant to add easy fixes if it means supporting python 3.11/bookworm [07:20:28] so the import error with python 3.11 comes from flaskrestplus IIUC [07:20:29] https://github.com/noirbizarre/flask-restplus/pull/768 [07:20:32] 2020 [07:23:10] weird though, the code should work for our use case [07:29:50] the code does https://github.com/noirbizarre/flask-restplus/blob/master/flask_restplus/model.py#L9 [07:30:05] that is correct, but I see another one [07:30:07] mmmm [07:30:14] maybe they committed without releasing [07:30:40] https://github.com/noirbizarre/flask-restplus/commit/136da9ddfa19b8eb30004553a017e84364c9ffd8 [07:31:07] yeah [07:32:00] the project seems totally abandoned [07:42:01] anyway, forcing python3.9 in tox [07:43:56] but on Debian Bookworm it misses distutils, so tests are not running locally [07:44:57] need to use a container with bullseye then [07:47:33] thanks for the review elukey. now going to test on staging... [07:47:54] ack super [07:47:58] let us know how it goes [08:03:09] deployment completed, going to query the api ... [08:08:13] running into this error: [08:08:13] ``` [08:08:13] upstream connect error or disconnect/reset before headers. reset reason: connection termination [08:08:13] ``` [08:12:20] that is the external error, there must be something not working in the pod [08:12:24] looked at the container logs and they have: [08:12:24] ``` [08:12:24] 404 Client Error: Not Found for url: http://localhost:6500/ [08:12:24] ``` [08:12:24] not sure whether these requests are going through the envoy proxy [08:12:45] good point. How to verify? [08:15:33] I can't get into the container on staging, not sure how I can verify this [08:16:19] we can verify the settings [08:16:50] for example, the 404 that you are getting is related to a specific endpoint [08:17:08] and we have set localhost:6500, namely the mediawiki api [08:17:37] is the right endpoint? [08:20:02] namely what we set in https://gerrit.wikimedia.org/r/c/research/recommendation-api/+/965142/9/recommendation/data/recommendation_liftwing.ini [08:20:40] yep, the languagepairs endpoint is failing and it uses cxserver.wikimedia.org [08:21:04] it uses localhost:6500 at the moment [08:21:34] (in [endpoints]) [08:22:13] ryep, pushing a fix for this [08:25:32] it actually uses the right host header: https://github.com/wikimedia/research-recommendation-api/blob/5e138ba4a0d5d448deca3d178d078c23ff4752ac/recommendation/data/recommendation_liftwing.ini#L10C1-L10C40 [08:26:03] this should work unless wikimedia.org uses a different listener other than 6500 [08:27:37] is cxserver.wikimedia.org served by the wikimedia php api? [08:28:09] (same question for intake-analytics.wikimedia.org) [08:31:20] the mw-api-int-async-ro envoy listener is only for wikimedia php api? [08:31:30] yep [08:32:32] we need to add the listeners that we need, and set the right localhost:port combination [08:32:34] by any chance have you come across documentation that details this? [08:33:02] looking at the envoy listeners listed here doesnt clarify that: https://gerrit.wikimedia.org/g/operations/puppet/+/refs/heads/production/hieradata/common/profile/services_proxy/envoy.yaml [08:33:51] it doesn't clarify mw-api-int-async-ro envoy listener is only for wikimedia php api [08:34:08] this what caused my confusion [08:34:22] sure sure it can happen [08:34:46] there are separate entries for other endpoints, like cxserver etc.. [08:35:09] they are separate services, usually when people talk about the mw-api they assume the PHP one [08:35:27] (at least in my experience) [08:36:41] sure, so I see cxserver uses 6015 [08:39:44] intake-analytics.wikimedia.org is served by an eventgate instance [08:39:54] should be port 6004 [08:40:11] but we need to add both listeners to the mesh config in deployment-charts [08:42:49] sure sure [08:56:08] interesting, this example shows that the 6500 port (the same as mw-api-int-async-ro) is also used for wikipedia.org: https://wikitech.wikimedia.org/wiki/Envoy#Example_(calling_mw-api) [09:02:15] kevinbazira: yeah it makes sense, it is the mw php api [09:02:19] what is your doubt? [09:05:12] we have got the following so far: [09:05:12] language_pairs = cxserver.wikimedia.org: cxserver - 6015 [09:05:12] pageviews = wikimedia.org: mw-api-int-async-ro - 6500 [09:05:12] wikipedia = {source}.wikipedia.org: mw-api-int-async-ro - 6500 [09:05:12] wikidata = www.wikidata.org: mw-api-int-async-ro - 6500 [09:05:12] event_logger = intake-analytics.wikimedia.org: eventgate-analytics - 6004 [09:05:13] one left: [09:05:26] related_articles = recommend-related-articles.wmflabs.org: ??? [09:08:09] it seems another api, not sure if it is used [09:08:26] candidate_finders.py seems to use it [09:16:20] kevinbazira: I tried with curl, it seems a dead end [09:16:29] it points to the wmflabs proxy, but nothing is configured [09:16:34] also checked in https://openstack-browser.toolforge.org/proxy/ [09:27:59] (03PS2) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 [09:28:01] (03PS2) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 [09:29:36] (03CR) 10Elukey: Add precommit support (032 comments) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [09:31:16] (03CR) 10Ilias Sarantopoulos: Add precommit support (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [09:32:31] yep elukey I looked at the full url as configured by the original devs and it looks like it's down: [09:32:31] https://github.com/wikimedia/research-recommendation-api/blob/5e138ba4a0d5d448deca3d178d078c23ff4752ac/recommendation/data/recommendation.ini#L7C20-L7C101 [09:32:31] changed it to point to the current host on wmflabs and still it's down: [09:32:31] https://recommend.wmflabs.org/types/related_articles/v1/articles/ [09:32:31] in T339890#9179995 Isaac advised that we switch off related_articles, so is likely not to affect the LiftWing instance [09:32:31] I am going to push changes for the listeners we listed above [09:34:12] ack [09:35:56] isaranto: one qs - should I just use "tox" as entrypoint in blubber? [09:37:49] elukey: yes - if you move the command under testenv [09:38:00] yep yep it is a good point [09:38:05] tox.ini would be sth like this [09:38:12] https://www.irccloud.com/pastebin/3qTHdf5E/ [09:38:22] exactly yes [09:38:29] and we shouldn't need a dedicated entrypoint.sh [09:38:41] I see that in inference-services we initialize git [09:39:25] a w8. we would need to do that probably. [09:39:30] if needed I can change run-tests.sh in the rec-api repo like we do for inference-services [09:40:06] pre-commit runs for a git repo and they we handle ci is different. we just copy files in a container and don't checkout the current repo/commit [09:40:33] when we move to gitlab iiuc we could change this behavior [09:41:26] (03PS3) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 [09:41:28] (03PS3) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 [09:41:54] (03CR) 10Elukey: Add precommit support (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [09:42:01] (03CR) 10Elukey: Add precommit support (031 comment) [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [09:42:51] (03CR) 10CI reject: [V: 04-1] Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [09:43:14] (03CR) 10CI reject: [V: 04-1] Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [09:43:26] ./run-test.sh: line 2: git: command not found [09:43:29] right :D [09:45:32] testing locally this time to see if it works [09:46:25] (03PS1) 10Kevin Bazira: Update external endpoint ports used on LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966826 (https://phabricator.wikimedia.org/T348607) [09:46:51] (03CR) 10Elukey: [C: 03+1] Update external endpoint ports used on LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966826 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:47:33] woah, thanks elukey. that was fast! :) [09:47:59] (03CR) 10Kevin Bazira: [C: 03+2] Update external endpoint ports used on LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966826 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:48:13] it is a quick one, we already discussed the changes above :) [09:48:31] (03Merged) 10jenkins-bot: Update external endpoint ports used on LiftWing [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966826 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:50:55] (03PS4) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 [09:50:57] (03PS4) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 [09:52:03] (03CR) 10CI reject: [V: 04-1] Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [09:52:32] (03CR) 10CI reject: [V: 04-1] Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [09:53:09] I wanted to see if it was only local, but I see [09:53:10] AttributeError: module 'virtualenv.create.via_global_ref.builtin.cpython.mac_os' has no attribute 'CPython3macOsBrew' [09:53:48] aa w8 I witnessed this issue in another patch recently [09:54:35] I recall something weird similar [09:54:51] maybe the virtualenv version or something? [09:55:08] yes virtualenv incompatibility between libraries [09:55:28] a found it [09:55:32] we pin virtualenv==20.23.1 [09:55:49] and there is also https://github.com/tox-dev/tox/pull/1537 [09:55:52] in the langid image. [09:55:52] I added `pip uninstall virtualenv -y` in the entyrpoint.sh script [09:56:21] so that we allow tox to do its thing [09:56:24] yes now I recall, the "please don't ask" :D [09:57:02] but what if I use a different virtualenv? [09:57:29] trying to build [09:59:01] perhaps that could work too (using a different virtualenv [09:59:07] * isaranto afk early lunch! [09:59:08] not pinning works, but then I see ERROR: No matching distribution found for flask==1.1.4 [09:59:56] and it seems that that version doesn't support py39 [10:24:12] it is all a mess with the debian image I think, trying to tweak some settings [10:34:21] 10Lift-Wing, 10Machine-Learning-Team: Discuss caching strategies for Lift Wing - https://phabricator.wikimedia.org/T349180 (10klausman) [10:35:02] 10Machine-Learning-Team, 10Goal: Goal: Decide on an optional Lift Wing caching strategy for model servers - https://phabricator.wikimedia.org/T348155 (10klausman) [10:35:04] 10Lift-Wing, 10Machine-Learning-Team: Discuss caching strategies for Lift Wing - https://phabricator.wikimedia.org/T349180 (10klausman) [10:44:33] * elukey lunch! [10:44:39] ditto [10:46:55] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10jijiki) [10:48:05] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10jijiki) Data has been flushed from both rdb1011 and rdb2009 [11:07:15] lemm know if I can help with the python images for rec-api [12:14:11] isaranto: hopefuly I should be able to fix it in a bit, very weird [12:58:09] Morning [12:58:46] morning! [12:59:06] o/ [13:49:37] (03PS5) 10Elukey: Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 [13:49:39] (03PS5) 10Elukey: Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 [13:49:49] testing to see if another problem is only local/mine --^ [13:50:35] (03CR) 10CI reject: [V: 04-1] Add precommit support [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966542 (owner: 10Elukey) [13:51:14] (03CR) 10CI reject: [V: 04-1] Fix pre-commit errors and bump version [research/recommendation-api] - 10https://gerrit.wikimedia.org/r/966543 (owner: 10Elukey) [13:52:06] without virtualenv I get [13:52:07] ERROR: Could not find a version that satisfies the requirement flask==1.1.4 (from versions: none) [13:53:22] it is definitely an issue between what debian installs and what pip installs [13:53:30] since with a regular debian:bullseye container all works [14:30:10] 10Lift-Wing, 10Machine-Learning-Team: Discuss caching strategies for Lift Wing - https://phabricator.wikimedia.org/T349180 (10calbon) [14:53:45] so flask 1.1.4 is really old. I wonder if it can be upgraded easily [14:53:52] "easily" 😓 [14:57:53] I regret our decision not to move it to fastapi since we'll eventually need work to move upgrade it anyway [15:00:10] I tried to upgrade but it gives the same issue, it seems more an issue with virtualenvs/pip/etc.. [15:00:15] I'll figure it out [15:00:19] there is no rush [15:53:50] pff can't get the alert right [15:54:02] starting fresh tomorrow! [15:59:18] added a patch for nllb - https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/966891/ [15:59:47] logging off, have a nice evening/day folks! [16:01:14] you too [16:07:12] going afk too! [16:07:19] have a nice rest of the day folks [21:17:09] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team (Sprint 1 (Growth Team)), 10User-notice: Deploy "add a link" to 15th round of wikis - https://phabricator.wikimedia.org/T308141 (10KStoller-WMF)