[07:07:22] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) [07:26:30] good morning :) [07:37:00] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kevinbazira) @kostajh, we completed training models for the sixth round of wikis (listed in the task description) and shared the models' evaluation abo... [08:02:40] I am testing multiple revisions with editquality's good faith extractor and I see latencies varying a log [08:02:43] *lot [08:04:01] it doesn't always take 1s for example, but even less (400ms etc..) [08:04:08] and predict's timing change too [08:10:51] prepping another code review :D [08:10:59] ilias: o/ [08:18:22] 10Machine-Learning-Team, 10Add-Link, 10Growth-Team, 10User-notice: Deploy "add a link" to 6th round of wikis - https://phabricator.wikimedia.org/T304550 (10kostajh) @kevinbazira yes please publish the datasets! [08:42:15] for example, enwiki's 1097728152 revid seems slow to compute [08:42:26] on my laptop I mean, it takes more than a second to extract features [08:42:47] with other rev-ids it takes way less, like 200ms [08:42:56] so the cpu time in this case varies a lot [08:52:12] (03PS1) 10Elukey: revscoring: Improve MP code and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/852129 (https://phabricator.wikimedia.org/T320374) [09:33:12] Morning! [09:33:43] elukey: I *think* I got the Postgres thing done. https://labels.wmflabs.org/ui seems to work fine, and I see established connections from wikilabels-03 to wikilabels-databse-02 [09:34:49] morning! [09:35:16] does that mean I can shut down the old servers? [09:35:19] nice work :) I think that we can document and update the task accordingly (to taa*vi will be able to clean up) [09:35:35] ahahahah I didn't want to ping you, but you were quicker :D [09:35:59] :D [09:36:20] If we could hsut them down but not delete them (yet), that would be a nice test that everything works. [09:36:29] sounds good, i'll do that [09:36:41] Maybe I'll also ask Kevin who I think knows more about how the Webui is supposed to work. [09:37:09] taavi: actually, wait a second [09:37:18] I want to get one final dump of the db, just in case [09:37:38] waiting [09:38:59] ok, I'm done [09:40:33] elukey: and yes, will writeup the dump+restore steps in the ticket. Probably will also add the to the Wikitech article [09:40:54] both are shut down now, I'll set myself a reminder next week to delete them, please shout somewhere if you've missed something [09:41:07] klausman: yes please let's add it to Wikitech as well, at least a reference of the task [09:41:24] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), 10cloud-services-team (Kanban): Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10taavi) These are shut down now and can be deleted next week. [09:41:30] taavi: ack, thanks. Also thanks for your patience [09:57:28] klausman: I think you need to update or remove https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/wikilabels.pp#10 [09:59:32] yes, will do [10:14:50] klausman: Janis is rolling out a change to update kubelet's configs, that should be a no-op [10:22:45] ack, thanks [10:28:33] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10achou) [10:39:06] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10achou) If there is an upstream issue, we're not getting the right content type for JSON data, so it would raise a ValueError of Could not decode as JSON. (see line 88... [10:42:40] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), and 2 others: Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10klausman) 05Open→03Resolved Created VM and Puppet stuff as detailed above, and migrated the data... [10:43:51] 10Machine-Learning-Team: Move Wikilabels Postgres Instances to VMs - https://phabricator.wikimedia.org/T312564 (10klausman) As just added to [T307389](https://phabricator.wikimedia.org/T307389#8362341): DBs have been migrated and docs updated. Taavi has shut down the old clouddb instances and if we don't find we... [10:43:56] 10Machine-Learning-Team: Move Wikilabels Postgres Instances to VMs - https://phabricator.wikimedia.org/T312564 (10klausman) 05Open→03Resolved [10:44:22] elukey: when you have time, can you give the dump&restore instructions a quick read, see if I missed anything? [10:44:28] No rush etc. [10:45:54] klausman: sure! In the task? [10:52:24] I've put a copy in the task, but it's also on the Wikitech page for WL [10:55:52] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10elukey) This is the same issue I am battling with in T320374. My understanding so far is that the Envoy proxy that Istio uses to fetch data from the MW API (mwapi con... [10:58:09] klausman: lgtm! [10:59:33] Oh, and btw, if you're like me and prefer writing docs like that in Markdown rather than Mediawiki syntax, you can use pandoc to convert Markdown (several subformats) to mediawiki. It still takes some manual twiddling, but is much easier to write, still (well, IMHO) [10:59:54] e.g. `pandoc -f markdown+simple_tables -t mediawiki input.md` [11:00:10] didn't know about pandoc [11:00:36] It's an amazing piece of software, it can convert between a bazillion different formats [11:00:54] Including stuff like "docx to LaTeX" [11:01:04] aiko: where did you test the RR model this morning? [11:01:20] I am checking metrics to see what happened [11:01:46] ah ok I see the ticket's description nevermind [11:02:36] weird, eqiad doesn't work, codfw works [11:03:28] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10elukey) ` elukey@ml-serve-ctrl1001:~$ curl "https://inference.svc.codfw.wmnet:30443/v1/models/enwiki-articlequality:predict" -d @input.json -H "Host: enwiki-articlequ... [11:06:46] Is there maybe something going on with MWAPI in eqiad? [11:08:25] I suspect that it may be due to eqiad not being fully aligned with ml-serve-codfw, I haven't deployed to all clusters all the times.. but I am checking istio logs [11:11:39] elukey: this is the metrics when error occurs in RRR model https://grafana.wikimedia.org/d/zsdYRV7Vk/istio-sidecar?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=experimental&var-backend=All&var-response_code=All&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99&from=1667307600000&to=1667311199000 [11:13:01] aiko: I think that there are two issues 1) articlequality that has the api-ro cluster unreachable for some reason, and RR that sometimes fails to contact the mw api right? [11:15:16] elukey: yeah I think they are different issues. [11:15:44] aiko: ack so for RR, see in "Backend traffic by response flags" that we have UC,URX ? [11:16:23] that means (from the point of view of the envoy/istio proxy) - Upstream closed (upstream is mwapi), but a retry was attempted (we do it 3 times before giving up) [11:18:05] elukey: I see [11:18:52] elukey: in articlequality it is UH [11:19:35] aiko: yeah no healthy upstream, for the moment use codfw (I'll try to get to fix it) [11:26:52] going afk for lunch, will check again this afternoon! [11:41:57] (03CR) 10AikoChou: [C: 03+1] "LGTM" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/852129 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [11:57:17] 10Machine-Learning-Team: Retrain fawiki articlequality model - https://phabricator.wikimedia.org/T317531 (10kevinbazira) Thank you for the suggestion @achou. I have tested the models using the full revision text and below is the workflow I used. Built the docker image using the steps mentioned in T322006#835886... [11:57:42] <- Lunch! [12:07:16] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10achou) Some load test results: * 1 connection ` aikochou@deploy1002:~/rrr$ wrk -c 1 -t 1 --timeout 5s -s inference.lua https://inference.discovery.wmnet:30443/v1/m... [12:08:58] 10Machine-Learning-Team: Move Wikilabels Postgres Instances to VMs - https://phabricator.wikimedia.org/T312564 (10klausman) 05Resolved→03In progress [12:09:11] 10Machine-Learning-Team, 10Data-Services, 10Wikilabels, 10Cloud-VPS (Debian Stretch Deprecation), and 2 others: Upgrade wikilabels databases to buster/bullseye - https://phabricator.wikimedia.org/T307389 (10klausman) 05Resolved→03In progress [12:22:35] 10Machine-Learning-Team, 10ContentTranslation, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10LSobanski) I don't see a specific ask for SRE so removing the tag. Please add it back when needed. [12:44:20] Morning all! [12:51:44] \o heyo Chris [14:22:19] (03CR) 10Elukey: revscoring: Improve MP code and logging (031 comment) [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/852129 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:22:23] (03CR) 10Elukey: [C: 03+2] revscoring: Improve MP code and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/852129 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:33:40] (03Merged) 10jenkins-bot: revscoring: Improve MP code and logging [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/852129 (https://phabricator.wikimedia.org/T320374) (owner: 10Elukey) [14:45:36] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10elukey) We have two problems: 1) In eqiad most of the times the `no healthy upstream` is returned. I fixed articlequality simply updating the docker images, so it sh... [14:46:59] klausman: o/ do you have time during the next days to review/rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/852158/ ? [14:48:51] Proooobably :) [14:49:24] The actual disruption in prod should be a net zero, right? [14:49:28] Assuming no bugs etc [14:54:32] yep it is a no-op [15:39:01] 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Deploy revert-risk-model to production - https://phabricator.wikimedia.org/T321594 (10MunizaA) Thanks a lot for sharing these results here, @achou! I do see that we're seeing more socket connect errors with increased connections. Is that something we... [16:06:56] rolled out the new docker images to ml-serve-codfw, I see the new logging bits [16:10:20] (sorry only the editquality goodfaith ones, I'll proceed with the rest) [16:10:29] (staging first) [16:10:45] in the meantime, I am running benthos :) [16:22:32] staging updated :) [16:26:13] latency is varying a lot, but now we should have more insight [16:26:14] for example [16:26:15] [I 221102 16:25:18 decorators:18] Function get_revscoring_extractor_cache took 1.2570 seconds to execute. [16:26:18] [I 221102 16:25:19 decorators:35] Function fetch_features took 0.9157 seconds to execute. [16:26:21] [I 221102 16:25:19 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 2177.38ms [16:27:15] we do 3 mw api calls in get_revscoring_extractor_cache for editquality goodfaith, so some slow requests may come into play and a 1.2s execution is totally understandable (when it is not the avg of course) [16:27:29] but the 0.9s of fetch features is really not great [16:27:47] and in varies a lot, from tens of ms to a second [16:27:50] really weird [16:27:56] it depends on the rev-id afaics [16:46:59] rollout of the new docker images completed [16:50:49] I don't see any connection problem registered [16:50:55] with Benthos... [16:51:08] but it is probably something happening in certain use cases [16:51:09] sigh [16:55:45] 10Machine-Learning-Team, 10Analytics-Radar, 10serviceops: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10jbond) [17:28:55] 10Lift-Wing, 10Machine-Learning-Team: No healthy upstream and upstream connect error in Lift Wing - https://phabricator.wikimedia.org/T322196 (10elukey) @achou something like https://github.com/mediawiki-utilities/python-mwapi/compare/master...elukey:python-mwapi:master [17:28:58] folks I created https://github.com/mediawiki-utilities/python-mwapi/compare/master...elukey:python-mwapi:master [17:29:16] it may require different ranges of http responses, but 50x could be enough as starter [17:29:43] created also https://github.com/mediawiki-utilities/python-mwapi/pull/48 [17:32:33] going afk, ttl folks!