[07:45:38] o/ [07:46:31] 10Machine-Learning-Team, 10ORES, 10GitLab (Project Migration): Migrate ORES/Revscoring/etc. repos to Gitlab or Gerrit - https://phabricator.wikimedia.org/T264651 (10Aklapper) [07:53:44] I created https://github.com/kserve/kserve/issues/2292 to ask to upstream if knative 1.0 is the only supported version for kserve 0.8 or not [07:54:10] https://kserve.github.io/website/0.8/admin/serverless/ is not very encouraging, from 0.8+ they support only k8s 1.20+ [08:12:33] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache2001.codfw.wmnet with OS buster [08:13:07] ok so I am going to reimage ml-cache2* to buster [08:13:22] so that I can handover the clusters to Tobias [08:15:37] (03CR) 10Elukey: [V: 03+2 C: 03+2] scap: increase ores canary targets from 1 to 4 [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809617 (owner: 10Elukey) [08:15:52] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update the ores submodule to deploy the last changes [services/ores/deploy] - 10https://gerrit.wikimedia.org/r/809597 (owner: 10Elukey) [08:19:12] deploying ores to the 4 canaries [08:21:24] elukey: hi, pinging again about https://phabricator.wikimedia.org/T307389 as requested [08:23:37] taavi: sorry again, pinged Chris on slack [08:23:57] thanks! [08:27:20] thank you! [08:27:22] --- [08:27:32] all canaries look good, proceeding with the rest of ores nodes [08:28:36] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache2002.codfw.wmnet with OS buster [08:36:46] 10Machine-Learning-Team, 10ORES: ORES gives internal error on an invalid model_info parameter - https://phabricator.wikimedia.org/T279271 (10elukey) 05In progress→03Resolved @Gethan your change has been deployed today, thanks a lot for your contribution! [08:37:46] ORES deployment completed :) [08:40:52] 10Lift-Wing, 10SRE-swift-storage, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) @MatthewVernon hi! Do you have any guidance about how to proceed? [08:41:14] 10Lift-Wing, 10SRE-swift-storage, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) [08:41:28] 10Lift-Wing, 10SRE-swift-storage, 10Machine-Learning-Team (Active Tasks): Create Swift account for readonly access to ML models - https://phabricator.wikimedia.org/T311628 (10elukey) [08:42:54] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ml-cache2003.codfw.wmnet with OS buster [08:54:53] all logs and metrics from ORES look good, so I think we are done [08:54:59] 4 canaries are definitely better :) [08:55:19] I kicked off the ml-cache2* reimages, will bootstrap the cluster once everything is done [08:55:27] (afk for a bit) [08:57:18] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache2001.codfw.wmnet with OS buster completed: - ml-... [09:36:13] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache2002.codfw.wmnet with OS buster completed: - ml-... [09:47:59] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ml-cache2003.codfw.wmnet with OS buster completed: - ml-... [11:28:06] <- Lunch [11:28:34] elukey: when we're both back, can we have a quick VC about Prometheus and k8s pods? Nothing fancy, just bouncing ideas around [12:33:50] (03PS1) 10Bluehill395: Add Korean special page alias [extensions/ORES] - 10https://gerrit.wikimedia.org/r/809974 [13:08:48] klausman: sure! Anything specific in mind? [13:09:22] Mostly questions of reachability and discovery [13:09:25] what I know is that we can add special annotations to resources to allow prometheus masters to scrape pods [13:09:34] or better, some ports that they expose [13:09:56] like prometheus.io/scrape: "true" [13:09:59] The first question - Can Prometheus even scrape individual pods? - you've answered :) [13:10:15] or prometheus.io/port: "9102" [13:10:26] yeah it depends on those annotations [13:10:37] The second one is hwo prometheus knows that they're there, scrapes them (and possibly scraping/transformation rules) [13:11:11] IIUC the prometheus masters get the information about what resources have certain annotations using the k8s control plane api [13:11:17] And then retention, dashboards etc, but I suspect the answer to half of that is "decent defaults" and "Grafana" [13:11:27] it should all be configured in the specific prometheus k8s master instances [13:11:38] So Prometheus would autodiscover scrape-able pods and auto-scrape them? [13:11:44] in theory yes [13:11:56] Ok, that is delightfully simpler than I had feared [13:11:58] I have done the same for istio/knative/etc.. [13:12:17] maybe there are some subtle gotchas that I don't have [13:12:21] but it should work in that way [13:13:21] The next question then will be how to make revscoring "hello world", as in: simplest possible setup that can be iterated on quickly [13:13:53] i.e. not six miles of code reviews and container config/uploads/... [13:15:06] in my home dir on stat1004 I have a simple revscoring_test.py that is the feature api extractor, basically the piece of code that causes so many mw api calls [13:15:18] it needs to be used with a venv with revscoring deps etc.. [13:15:57] https://grafana.wikimedia.org/d/RLhtAw6mz/ores-redis?orgId=1&refresh=1m is also useful, forgot to tell you [13:16:12] noted [13:16:32] and https://grafana.wikimedia.org/d/vAN_bQemz/ores-advanced-metrics?orgId=1&refresh=1m [13:16:45] but instrumenting revscoring (if this is the idea) may be difficult [13:16:58] How so? [13:18:34] first of all it is a library, so it doesn't really expose any prometheus HTTP server etc.. it is not instrumented to push any metric afaik, but ORES is (it pushes metrics to a local statsd endpoint, that ingests them and publish them as prometheus metrics [13:20:38] in the revscoring py example mentioned above (on stat1004) there is a commented logging.getLogger().setLevel(logging.DEBUG), if you uncomment it you'll see what revscoring does [13:20:48] (it prints all calls to mwapi etc..) [13:21:26] when I asked to Aaron if we were able to make your own HTTP calls and inject results into revscoring, he answered back with https://www.mediawiki.org/wiki/ORES/Feature_injection#Feature_injection:_playing_with_what_ORES_sees [13:21:32] that is useful but not super straightforward [13:21:58] in the ideal world we'd be able to make async http calls to the mw api to retrieve the features, to then pass them to revscoring [13:21:58] hmm, I will have to do some reading, then [13:26:04] ml-cache codfw cluster up and running on buster :) [13:26:10] very nice [13:26:42] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) codfw cluster up and running on Buster :) [13:27:11] I am going to recap what we decided in the --^ task and then I can handover the rest of the work to you :) [13:32:16] :D [13:51:25] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10elukey) @lbowmaker @Eevans I had a long chat with my team about the AQS cluster and our use cases, we reached some consensus about how to proceed, lem... [13:51:37] I have summarized our chats (I hope in a good way) in https://phabricator.wikimedia.org/T302232#8040310 [13:51:40] klausman: --^ [13:58:54] * elukey bbl [14:04:45] That summary sounds good to me. Succinct yet complete [14:47:20] thanks for the review :) [15:01:34] Morning all! [15:02:01] Long day done! Now I get to live in my true form, in a hoodie hunched over a keyboard alone [15:05:46] :) [15:05:48] morning [15:06:17] so this should be the first patch for the new mediawiki.revision-score-editquality stream: [15:06:20] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/810007 [15:06:36] it needs a mw deploy, and an eventgate roll restart (minor, in k8s) [15:06:50] after that, we should be able to send events to eventgate from liftwing [15:07:19] I am trying to follow up with data platform to see if we can use something that they provide (airflow, etc..) instead of changeprop [15:07:42] if so we'll be able to test generating events continuosly from kafka [15:07:48] (editquality only for the moment) [15:08:44] chrisalbon: I summarized what we discussed in https://phabricator.wikimedia.org/T302232#8040310 [15:09:46] LGTM'd! [15:10:38] thanks! [15:11:05] I'll wait for Andrew's opinions about naming etc.., I am sure something will need to be tweaked [15:11:15] but overall I feel good for the plan (last famous words) [15:18:48] It'll be fine™ [15:37:20] all right going afk folks, o/ [15:39:12] looking! not sure where airflow fits in there tho? [15:54:34] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10calbon) Thanks @elukey for the summary. The TL;DR is that we are trying to build the minimum thing to fix a very specific problem regarding the ORES m... [19:27:39] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Send score to eventgate when requested - https://phabricator.wikimedia.org/T301878 (10Ottomata) Looks good. My only worry is that these making these new streams now, and planning to refactor their data model them later based... [21:07:58] 10Lift-Wing, 10Epic, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Set up the ml-cache clusters - https://phabricator.wikimedia.org/T302232 (10Eevans) >>! In T302232#8040310, @elukey wrote: > @lbowmaker @Eevans I had a long chat with my team about the AQS cluster and our use cases, we reached...