[07:01:16] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) [07:29:36] accraze: o/ [07:30:08] you are definitely right, the egress gateway wasn't configured to reach *.wiktionary.org or *.wikibooks.org [07:30:21] I added the endpoints manually to ml-serve-eqiad and filed https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/763653 [07:30:38] I am testing the inference.svc.eqiad.wmnet endpoint, and I still get the 500 [07:30:58] but from the istio egress logs it seems that we are getting an answer from the api [07:33:50] ahh nice [07:33:50] revscoring.errors.RevisionNotFound: RevisionNotFound: Could not find revision ({revision}:132421) [07:33:55] lemme try with a good one [07:34:42] works! [07:37:36] we should probably catch the RevisionNotFound error and return a different error code [07:38:13] maybe a HTTP 400, or 404 (even if 404 may also be returned by istio if no domain is available) [07:40:41] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Return meaningful HTTP responses in Lift Wing's revscoring backends - https://phabricator.wikimedia.org/T300270 (10elukey) Today, while testing the new wikibooks model, I got a HTTP 500 and the pod logs indicated this: ` revscoring.errors.RevisionNotFound:... [07:43:12] accraze: wiktionary and wikibooks are allowed now on both eqiad and codfw. The only mildy wierd thing is the name of the isvc, since it is like "enwiktionarywiki-goofaith" etc.. [07:43:26] if we want to change the isvc name we can try, lemme know, otherwise we leave this one [07:43:46] (we'd need to take these cases in consideration when injecting the Http host header in the api-gateway I think) [07:44:52] ah snap I see only now that Kevin also needs wikiquotes.orgs :D [07:49:04] kevinbazira: o/ [07:49:35] elukey: o/ [07:51:32] kevinbazira: I am reading https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/763647/1/helmfile.d/ml-services/revscoring-editquality/values.yaml but the "wiki" variables or the "host" variables may be wrong [07:51:41] I see "de" with "es.wiki.." [07:55:17] thank you for catching that. i've fixed it: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/763647 [07:57:12] kevinbazira: lgtm! Will merge so you can deploy [07:57:32] great. i'll be on standby. [08:12:47] kevinbazira: you can deploy :) [08:13:17] kevinbazira: not sure if you saw https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Deploy, I moved the page to a more canonical place [08:13:29] it is surely a little stale, we can add new info etc.. as we go [08:13:49] ok. thank you for moving the docs. starting deployment now ... [08:15:55] 10Lift-Wing, 10artificial-intelligence, 10editquality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Add editquality isvc configurations to ml-services helmfile - https://phabricator.wikimedia.org/T301415 (10kevinbazira) [08:22:57] both eqiad and codfw deployments have been completed successfully. [08:23:08] checking pods now ... [08:23:46] they are up and running [08:23:47] NAME READY STATUS RESTARTS AGE [08:23:47] eswiki-damaging-predictor-default-24nlh-deployment-bfb5bc7sgwrm 2/2 Running 0 81s [08:23:47] eswiki-goodfaith-predictor-default-f68qr-deployment-7b797fwphmr 2/2 Running 0 79s [08:23:47] eswikiquotewiki-damaging-predictor-default-lk2lf-deploymenjm8ld 2/2 Running 0 78s [08:23:47] eswikiquotewiki-goodfaith-predictor-default-jf2md-deployme5v5qk 2/2 Running 0 76s [08:26:22] super [08:31:25] I tested es wikiquote and it looks fine [08:31:59] I am going to log off folks, today I am taking the rest of the day off [08:32:05] have a good weekend :) [08:32:07] o/ [08:34:29] thank you for every elukey. [08:34:41] enjoy your weekend o/ [14:16:16] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) [14:17:33] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q3:(Need By: TBD) rack/setup/install ml-cache100[1-3] - https://phabricator.wikimedia.org/T299435 (10Jclark-ctr) ml-cache1001 E1 U23 ml-cache1002 E2 U23 ml-cache1003 F1 U23 [14:52:35] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) @elukey Could these be racked in 10g racks? [14:56:13] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) >>! In T294949#7721267, @Jclark-ctr wrote: > @elukey Could these be racked in 10g racks? Hi John! These hosts don't need 10g, so they can be... [14:57:54] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10Jclark-ctr) @elukey we are at our limit for power in our old cage and these have 10g cards in them and our new cage will be live any day now so it cou... [15:13:56] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [15:36:46] o/ [15:37:08] glad to see our issues from yesterday was just egress config :P [16:01:45] today im going to try to install feast on ml-sandbox so we can all play around with it next week [16:06:33] a basic online feature store seems fairly lightweight, just redis and a feature-server for now [16:28:31] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [18:52:54] morning all! [18:53:47] Apparently there was some big vandalism attack on enwiki yesterday. All is good but I wonder if the list of revisions and reverts could make a training dataset [19:53:03] ^^^ that is a v interesting idea [19:55:48] I'm going to dig into it [20:57:38] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with O... [21:05:52] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) @elukey can you please double check again the partman i am getting the error below ` Failed to retrieve the preconfig... [21:08:50] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Dzahn) @Papaul It's just missing the ".cfg" file ending. https://apt.wikimedia.org/autoinstall/partman/ [21:14:28] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Dzahn) @Papaul deployed fix and ran puppet on apt1001. try again now [21:14:47] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) thanks [21:22:38] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bu... [21:23:16] ok got a slightly older version of feast (v.0.12.0) running on ml-sandbox for now, still need to figure out the best way to test, will try loading some revscoring features in next week [21:24:35] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with O... [21:56:52] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2002.codfw.wmnet with OS bu... [21:59:41] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2003.codfw.wmnet with O... [22:36:25] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2003.codfw.wmnet with OS bu... [22:47:00] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-cache2001.codfw.wmnet with O... [22:53:13] awesome thanks accraze [23:19:00] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-cache2001.codfw.wmnet with OS bu... [23:23:50] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [23:24:14] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) [23:25:07] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q3:(Need By: TBD) rack/setup/install ml-cache200[1-3] - https://phabricator.wikimedia.org/T299433 (10Papaul) 05Open→03Resolved @elukey complete