[04:59:40] 10Machine-Learning-Team, 10Patch-For-Review: Remove traffic from old eswikibooks and eswikiquote deployments - https://phabricator.wikimedia.org/T345850 (10isarantopoulos) The SWViewer PR has been merged so the 5 aforementioned model servers can be removed. [07:12:10] o/ [07:12:51] elukey: o/ [07:13:42] preparing to deploy the rec-api on LW prod [07:13:50] ack [07:14:17] kevinbazira: keep in mind that we don't have any LB VIP configured, so you will not have a quick way to test the API etc.. [07:14:59] no problem, I plan to create tickets for that today [07:25:58] ο/ afk will be back in approx 30' [07:27:36] Both rec-api deployments on eqiad and codfw have been completed successfully. [07:27:36] NAME READY STATUS RESTARTS AGE [07:27:36] recommendation-api-ng-main-5c4f58c685-dc22j 2/2 Running 0 3m59s [07:27:36] recommendation-api-ng-main-5c4f58c685-mqm97 2/2 Running 0 3m59s [07:27:36] recommendation-api-ng-main-5c4f58c685-s47l8 2/2 Running 0 3m59s [07:27:37] recommendation-api-ng-main-5c4f58c685-w8s2p 2/2 Running 0 3m59s [07:27:37] recommendation-api-ng-main-5c4f58c685-zw9tx 2/2 Running 0 3m59s [07:29:39] very nice :) [07:33:11] 10Machine-Learning-Team: Deploy the recommendation-api-ng on LiftWing - https://phabricator.wikimedia.org/T347015 (10kevinbazira) The recommendation-api-ng has successfully been deployed to LiftWing production in both eqiad and codfw: ` kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recomme... [07:38:57] 10Machine-Learning-Team: Set SLO for the recommendation-api-ng service hosted on LiftWing - https://phabricator.wikimedia.org/T347262 (10kevinbazira) [07:46:30] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10kevinbazira) [07:47:03] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10kevinbazira) a:05kevinbazira→03None [07:50:12] elukey: tasks for the rec-api SLO definition and external endpoint setup have been created: https://phabricator.wikimedia.org/T347263 and https://phabricator.wikimedia.org/T347262. [07:50:12] I also tagged both SREs as you suggested that they'll help with this. [08:00:02] kevinbazira: ack thanks! For the SLO we decided that all new projects will have initially a low slo bar, say 95% availability and 95% of successful requests below X seconds (or ms) [08:00:27] so the work to do is essentially to create the dashboard etc.. [08:00:37] for the endpoint, there are two steps [08:00:42] 1) internal VIP/LB [08:00:47] 2) API-gateway config [08:01:05] for 2), I am not 100% sure if we need to publish the API to the external users [08:01:25] in theory yes for Android apps, but we'd need to ask to Isaac/Seddon what are their plans [08:01:48] and to Content Translation what they need to (I suspect only an internal endpoint for them) [08:02:05] and if research wants to move the GAP UI to the new API or not [08:02:29] I'd suggest to follow up with them to figure out what is best [08:02:41] ok, let me ask them in the task [08:07:04] kevinbazira: try also to follow up with them via chat or email, gathering consensus only on phab may be long [08:14:35] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10kevinbazira) Hi @Isaac, @santhosh, and @Seddon. The ML team was assigned the task of migrating the [[ https://gerrit.wikimedia.org/r/plugins/gitiles/research/recommendatio... [08:43:27] elukey: I was thinking to add some more logs in ores-legacy , do u think we could do it now? [08:47:18] o/ [08:47:35] 15 mins before the switch? :D [08:48:12] I am +1 for later, we can observe logs in the gateway logstash dashboard for the moment [08:48:14] what do you think? [08:49:01] yes probably better after the switch, but we can't see user-agents at the moment right? [08:49:17] we can yes, ores-legacy goes through the istio ingress [08:49:26] same dashboard [08:49:40] we just need to filter for ores-legacy (now there is nothing since we don't have traffic) [08:49:44] lemme try on thing [08:50:39] isaranto: https://logstash.wikimedia.org/goto/69b030ca32432272c27ba31909388ff1 [08:50:52] aa nevermind, we can see user-agent, just checked [08:51:32] we should also add better logging on the pods etc.. [08:51:38] great \o/ [08:51:51] Amir1: o/ we are ready to go, whenever you are [08:51:56] https://gerrit.wikimedia.org/r/c/operations/puppet/+/959762 [08:52:28] awesome [08:52:51] quick q: Should we coordinate with traffic, maybe we need to disable puppet and some jazz like that :D [08:53:54] definitely, let's do it [08:59:10] migration started! [08:59:28] to watch traffic https://logstash.wikimedia.org/goto/69b030ca32432272c27ba31909388ff1 [09:05:32] Morning [09:05:47] the puppet patch is applied automatically/gradually over the next half hour / hour, so we'll see traffic gradually shifting [09:06:10] basically puppet runs at different times on the various cpXXXX nodes (the caching traffic nodes), and it updates the ATS config [09:06:17] klausman: morning! [09:06:32] Sorry for being sorta late to the switch, but my work laptop is throwing ECC errors [09:06:46] we just started, Amir merged the puppet patch [09:06:52] ack [09:06:55] now we have the difficult part :) [09:10:44] another interesting graph: [09:10:45] https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=esams%20prometheus%2Fops&var-cluster=text&var-origin=ores-legacy.discovery.wmnet&from=now-3h&to=now [09:11:28] the traffic flows in this way: [09:12:06] 1) external client ---> LVS ---> HA Proxy (cpXXXX) TLS termination [09:12:19] 2) HA Proxy -> Varnish frontend (first layer of caching) [09:12:45] 3) Varnish frontend -> Apache Traffic Server (ATS) (cpXXXX nodes, both are co-locateD) [09:12:54] 4) ATS -> ores-legacy.discovery.wmnet [09:16:02] Eventually (once all the puppet runs have happened), the numbers for ores.d.w and ores-legacy.d.w should be the same (or slightly higher for o-l), right? [09:16:09] if we see 50x errors in the above dashboard (or any sign of trouble) it means that something is not all right [09:16:22] interesting thanks for explaining the flow! [09:16:38] klausman: yeah with ores.d having zero, or just health checks etc.. [09:17:32] I'd expect ores.d might still be counted, depending if the rewrite is taken into account for the numbers or not, I.e. the traffic going to ores by name bu being transformed to o-l might be counted twice. [09:18:12] klausman: I think it shouldn't, ATS is instructed to just connect to ores-legacy.d [09:18:21] ack [09:18:25] so all traffic for ores.wikimedia.org should go through it [09:18:36] atm, both o and o-l are increasing, but tracking each other closely [09:21:54] klausman: check all dcs, drmrs for example shows a different view [09:22:25] right [09:26:18] the ores dashboard (logstash) without twisted page getter (health check) shows a gradual decrease [09:30:14] And all the user agents are language libs or browsers [09:30:45] One exception: @WikiPhotoFight by ragesoss [09:32:20] I am also seeing https://ores.wikimedia.org/ replying with ores-legacy now [09:32:33] Same [09:33:07] I go through the Marseille's cache [09:33:33] At 0900 we were at ~200 r/m, now it's 17. [09:33:40] and I am starting to see zero traffic (excluding health checks) on the ores dashboard [09:33:50] I see ores-legacy as well [09:34:07] isaranto: do you go through esams? [09:34:23] (you can see a X-Cache response header) [09:34:40] yes as far as I remember.lemme check [09:36:16] huh. with curl, I still get the old ores [09:37:43] I see zero traffic from ATS -> ores.d.w [09:37:50] yes I go through esams. Through the browse I get ores-legacy. through curl the old one [09:38:07] I still get replies with this: server: ores2009.codfw.wmnet [09:38:23] I also tried with --http1.1, same result [09:38:41] can you check your X-Cache headers? [09:38:45] do they say "pass" ? [09:38:57] also, what command are you executing? [09:39:01] nah, hit-front. [09:39:04] I get a 307 to /docs [09:39:06] So we're getting cached results [09:39:16] klausman: ah right, and you go through esams [09:39:17] ? [09:39:18] curl --http1.1 -vv https://ores.wikimedia.org/ [09:39:42] No drmrs, cp6016 [09:39:57] x-cache: cp6016 miss, cp6010 hit/5 [09:40:06] klausman: what about now? [09:40:50] Now I don't even get a page body anymore [09:40:55] super [09:40:58] ah, it's a 307 :) [09:41:00] cleared it from mwmaint :) [09:41:14] elukey@mwmaint1002:~$ echo 'https://ores.wikimedia.org' | mwscript purgeList.php [09:41:15] I get ores-legacy now from anywhere I check [09:41:18] Purging 1 urls [09:41:20] Done! [09:41:34] the above tool needs to be used with extreme carre [09:41:35] *care [09:41:39] but it works when needed [09:42:11] folks I think that we did it [09:42:11] yep getting o-l on both my test machines now [09:43:00] GAH. [09:43:17] the ats backends graphs are _stacked_. [09:43:36] "closely tracking" is what I thought. but in reality it was "one is zero" [09:44:55] this is a HUGE moment folks [09:45:06] we have been working on this for more than 2y [09:45:16] Aye! I am having celebratory choc croissants [09:46:05] woohoo! \o/ \o/ \o/ [09:46:19] * elukey dances [09:48:37] 🎉 [09:49:07] 🎉 [09:50:40] https://phabricator.wikimedia.org/F37755455 [09:53:05] bye bye :) [09:54:58] The pod resource numbers for o-l in codfw look very steady, besides a very mild (and expected) increase in network traffic [09:56:57] klausman: nice chart!! [09:57:45] cpu usage is up from 2ms/s to 5ms/s, aka: big factor, still negligible compared to allowance [10:15:55] going to grab a quick bite, bbiab [10:42:25] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [10:51:09] I'm a bit curious why half of the requests are being logged as redirects( 307 response code). If I make the same request I get a 200 [10:52:06] I keep getting a 307 from curl [10:55:48] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [10:57:41] ok, I am getting a 307 from curl, but got a 200 from postman. probably sth is different in postman so nevermind [10:58:34] gooood [10:58:39] I think that we are stable [10:58:56] I expected to see some HTTP 400 for the bots hitting us with 50 rev-ids at once though [10:59:18] lets wait till later :) [10:59:43] ah no silly me [10:59:45] https://logstash.wikimedia.org/goto/2d245f035e3cc736483d357ae7a6a35b [10:59:46] isaranto: --^ [10:59:48] of course [11:00:00] we also need to filter for "ores.wikimedia.org" now [11:00:33] and indeed the 400s [11:01:33] oh ok [11:01:39] I'm going through the requests [11:01:45] yeah just realized [11:01:48] Good morning [11:01:52] o/ [11:04:21] ο/ [11:04:38] they are requesting too many revids [11:05:57] yep yep [11:06:18] I expected that, we'll see if anybody complains [11:06:29] it is the only minor "issue" if we can call it in this way [11:06:37] the rest is perfect [11:09:01] 404s coming from this app https://phabricator.wikimedia.org/T342958 [11:17:00] 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) a:05achou... [11:46:24] 10Machine-Learning-Team, 10Data-Engineering, 10Wikimedia Enterprise, 10Data Engineering and Event Platform Team (Sprint 2), and 2 others: Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792 (10achou) After discu... [13:42:47] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10Seddon) Just to be clear this ticket is about the translation recommendation-api. The mobile apps recommendation-api the apps rely on is https://gerrit.wikimedia.org/g/m... [13:54:47] PROBLEM - ORES worker production on ores.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2230 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/ORES [13:57:33] RECOVERY - ORES worker production on ores.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 985 bytes in 0.409 second response time https://wikitech.wikimedia.org/wiki/ORES [14:04:12] lol? [14:06:06] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10elukey) @Seddon my understanding is that this version of the recommendation API is the one that we want to progress from now on, deprecating the one that the apps are usin... [14:07:43] ah right so the alert is related to ores.wikimedia.org, it is an old nagios/icinga check [14:08:05] going to remove it via https://gerrit.wikimedia.org/r/c/operations/puppet/+/960567 [14:46:16] isaranto: https://github.com/kserve/kserve/releases/tag/v0.11.1 :P [14:47:13] aaa thanks! [14:47:36] will take a look, perhaps it wont be too much of hassle to upgrade [14:47:58] also having a discussion about metadata here https://github.com/kserve/kserve/issues/3098 [14:48:07] will add that to the orresponding task [14:48:12] *corresponding [14:49:11] ack! [14:49:18] I think it is just a patch release, nothing major [14:51:31] aiko: the readability change for the API GW (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959684) is ready, I will deploy it tomorrow morning. We can then see about the docs [15:04:55] 10Machine-Learning-Team, 10Patch-For-Review: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10klausman) a:03klausman [15:06:56] klausman: +1, thanks :) [15:16:21] 10Machine-Learning-Team: Decommission ORES configurations and servers - https://phabricator.wikimedia.org/T347278 (10elukey) [15:17:32] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10Seddon) I think that is my understanding of our goals here, it was just more that the repo that was linked is not the repo for the recommendation-api that the mobile apps... [15:20:18] 10Machine-Learning-Team, 10Patch-For-Review: use wikiID in inference name on LW for revscoring models - https://phabricator.wikimedia.org/T342266 (10elukey) 05Resolved→03Open [15:22:06] 10Machine-Learning-Team: Create external endpoint for recommendation-api-ng hosted on LiftWing - https://phabricator.wikimedia.org/T347263 (10elukey) >>! In T347263#9195825, @Seddon wrote: > I think that is my understanding of our goals here, it was just more that the repo that was linked is not the repo for the... [15:22:42] isaranto: o/ do we still need https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940945 ? I don't recall if we already followed up or not [15:30:23] I was following up on this today since my PR for swviewer got merged [15:31:45] okok! [15:32:08] elukey: it was intended to reflect the changes for wikiID where a wiki doesnt necessarily have the `wiki` suffix. But as I see now , it seems that both of the model servers work through th api gateway [15:32:36] https://www.irccloud.com/pastebin/CmJsPvMM/ [15:33:58] isaranto: yeah I'd remove the support from API gw first, then isvcs [15:34:06] anyway, https://grafana.wikimedia.org/d/slo-ORES_Legacy/ores-legacy-slo-s?orgId=1 [15:34:18] the latency SLO is already red :D :D :D [15:36:35] indeed our p95 is a horror [15:36:35] https://grafana.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&refresh=30s&var-cluster=codfw%20prometheus%2Fk8s-mlserve&var-namespace=ores-legacy&var-backend=All&var-response_code=200&var-response_code=307&var-response_code=400&var-response_code=404&var-response_code=405&var-response_code=422&var-quantile=0.5&var-quantile=0.95&var-quantile=0.99 [15:36:41] I'm not sure if this patch would remove access from API gateway https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940945 [15:36:41] klausman would it? [15:37:10] ah right now I see what you are saying [15:37:11] yes yes [15:37:38] we can proceed with the isvcs then [15:37:40] +1ed [15:38:11] dat 50s latency peak :-/ [15:38:54] it is expected. Perhaps the latency SLo is over optimistic [15:39:23] still some of the requests probably could be improved as I don't think they should take that much time [15:40:42] we should figure out a new threshold that we want to keep as reference [15:49:51] I find it hard to predict what kind of latency is acceptable from a user POV [15:54:09] it is not easy yes [16:06:52] 10Machine-Learning-Team, 10Section-Level-Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [XL] Productionize section alignment model training - https://phabricator.wikimedia.org/T325316 (10CodeReviewBot) mfossati opened https://gitlab.wikimedia.org/repos/data-engineering/air... [16:14:11] (03CR) 10AikoChou: "Sorry I have a naive question. With this implementation, does `https://ores-legacy.wikimedia.org/v3/scores/enwiki/12312342/damaging?featur" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/960587 (https://phabricator.wikimedia.org/T347193) (owner: 10Ilias Sarantopoulos) [16:16:29] 10Machine-Learning-Team, 10Patch-For-Review: Remove traffic from old eswikibooks and eswikiquote deployments - https://phabricator.wikimedia.org/T345850 (10isarantopoulos) Summary: The following servers are exposed through the API-GW ` - /lw/inference/v1/models/enwiktionary-reverted:predict - /lw/inference/v1/... [16:16:36] * elukey afk! [16:18:35] 10Machine-Learning-Team: petscan expects javascript function callback from ORES - https://phabricator.wikimedia.org/T347317 (10taavi) [16:18:50] aiko: the only naive question is the one that hasn't been asked yet <3 [16:19:17] ^^ is a seemingly popular tool being incompatible with the new ORES legacy service [16:27:52] thanks taavi: we're taking a look [16:37:41] geeq. [16:37:46] oops, wrong window :) [16:45:41] going afk, will follow up on those issues early morning [16:50:27] is there any traffice at all coming from @WikiPhotoFight? i wasn't aware i had any instance of it up. [16:50:45] (maybe that's what the mystery raspberrry pi in my storage room does?) [16:51:27] There was a little bit of traffic over 5-10m this morning around the switch time, haven''t checked since. sec [16:52:02] okay. i'll figure out which server is running that and turn it off. i'm done with twitter bots for obvious reasons [16:53:13] Thanks klausman for looking at that [17:00:04] Of course now I can't find it anymore :D [17:01:42] ragesoss: yes there is some! [17:02:11] i can't find anywhere that i have that running, so if you do find any details, let me know and it might help me track it down. i did find that i had a FixMeBot job still running, even though it broke in terms of being able to post to twitter months ago. [17:03:02] Ah, found it [17:03:12] Source IP is 75.172.74.153 [17:03:57] Last request was Sep 25, 2023 @ 16:58:57.294 UTC [17:05:03] ragesoss: ^^^ [17:05:20] and what was the user agent? [17:06:38] "@WikiPhotoFight by ragesoss" [17:08:04] great. should be no more traffice now. that user agent was hard-coded somewhere it should not have been, so it was actually the FixmeBot twitter bot that I just turned off. [17:08:31] Ah, righto. Thanks! [17:08:36] thank you! [17:09:02] I'll keep an eye on logstash just in case [17:12:24] Thanks ragesoss [17:13:43] np. i'm just about ready to switch to LW for the both production instances of the Dashboard, as well. the code is ready, just wanted to make sure i had plenty of time to keep an eye on it if it introduces major performance problems. [18:06:35] The past requests were at xx:58, and there was nothing just as this last hour rolled around, so I think we good. [18:06:46] * klausman signing off for today \o [19:28:01] 10Machine-Learning-Team, 10ORES: User-scripts running on Wikipedia can no longer use ORES (CORS issue) - https://phabricator.wikimedia.org/T347344 (10Halfak) [19:28:40] 10Machine-Learning-Team, 10ORES: User-scripts running on Wikipedia can no longer use ORES (CORS issue) - https://phabricator.wikimedia.org/T347344 (10Halfak) @Ciell reported the issue this weekend. All of my investigations lead to this error. [19:30:36] 10Machine-Learning-Team, 10ORES: User-scripts running on Wikipedia can no longer use ORES (CORS issue) - https://phabricator.wikimedia.org/T347344 (10Halfak) I confirmed that the issue persists when calling ores-legacy. E.g. > $.ajax({url: "https://ores-legacy.wikimedia.org/v3/scores/"}).done(function(respons... [19:54:28] 10Machine-Learning-Team, 10ORES, 10Wikimedia-Site-requests: ORES article quality is gone from euwiki in Mozilla Firefox 117.0.1 - https://phabricator.wikimedia.org/T347243 (10Aklapper) This might be the same as {T347344}. [21:49:24] 10Machine-Learning-Team, 10ORES: User-scripts running on Wikipedia can no longer use ORES (CORS issue) - https://phabricator.wikimedia.org/T347344 (10Novem_Linguae) I confirm this bug on enwiki in the user script https://en.wikipedia.org/wiki/User:Evad37/rater, which has 1200 installs. The CORS error is: ` Ac... [22:06:58] 10Machine-Learning-Team, 10ORES: User-scripts running on Wikipedia can no longer use ORES (CORS issue) - https://phabricator.wikimedia.org/T347344 (10Widefox) I confirm the Rater user-script can no longer use ORES in for enwiki.