[07:26:02] 10Machine-Learning-Team, 10Observability-Metrics, 10serviceops, 10Kubernetes: Don't scrape every containerPort for metrics - https://phabricator.wikimedia.org/T318707 (10Joe) I think the current solution works well. Basically: * If your pod contains `prometheus.io/scrape: true` prometheus will pick up the... [08:52:33] \o [08:54:00] (03CR) 10Elukey: "recheck" [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922809 (https://phabricator.wikimedia.org/T337287) (owner: 10Elukey) [09:03:25] (03PS2) 10Elukey: ores-legacy: simplify test in test_liftwing.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922809 (https://phabricator.wikimedia.org/T337287) [09:03:27] (03PS5) 10Elukey: WIP - Add read-only cache support to ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922561 (https://phabricator.wikimedia.org/T337287) [09:05:50] (03CR) 10Klausman: [C: 03+1] ores-legacy: simplify test in test_liftwing.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922809 (https://phabricator.wikimedia.org/T337287) (owner: 10Elukey) [10:26:38] (03CR) 10Kevin Bazira: [C: 03+1] ores-legacy: simplify test in test_liftwing.py [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922809 (https://phabricator.wikimedia.org/T337287) (owner: 10Elukey) [10:46:12] aiko: fyi, merging and deploying the regex fix now [10:52:21] $ curl https://staging.svc.eqiad.wmnet:8087/service/lw/inference/v1/models/revertrisk-language-agnostic:predict -X POST -d '{"rev_id": 123456, "lang": "en"}'; echo [10:52:23] {"lang":"en","rev_id":123456,"score":{"prediction":false,"probability":{"true":0.25512129068374634,"false":0.7448787093162537}}} [10:52:25] Works in staging \o/ [10:52:39] Now deploying to apigw prod instances [10:54:40] * elukey lunch [10:55:23] and all done (regex) [10:56:28] 10Machine-Learning-Team, 10Patch-For-Review: Fix Regular Expression in API GW config for revert risk - https://phabricator.wikimedia.org/T337378 (10klausman) Changes have been merged and deployed, Bot eqiad and codfw (and staging) sections of the API GW work fine (tested from within clusters), as well as remot... [10:57:26] heading for lunch as well [12:02:08] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) [12:12:29] 10Machine-Learning-Team, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10jnuche) In a project's `.gitlab-ci.yml`, it is now possible to publish documentation and test coverage results to doc.wikimedia.org... [13:12:48] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10ML-Governance, 10Documentation: Create licenses transclusion template for ORES model cards - https://phabricator.wikimedia.org/T337479 (10kevinbazira) [13:21:54] (03PS6) 10Elukey: WIP - Add read-only cache support to ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922561 (https://phabricator.wikimedia.org/T337287) [13:21:56] (03PS1) 10Elukey: test_liftwing.py: simplify decorator test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/923339 [13:23:30] (03CR) 10CI reject: [V: 04-1] test_liftwing.py: simplify decorator test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/923339 (owner: 10Elukey) [13:29:48] (03PS2) 10Elukey: test_liftwing.py: simplify decorator test [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/923339 [13:29:50] (03PS7) 10Elukey: WIP - Add read-only cache support to ores-legacy [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/922561 (https://phabricator.wikimedia.org/T337287) [14:06:16] 10Machine-Learning-Team, 10ORES, 10artificial-intelligence, 10ML-Governance, 10Documentation: Create licenses transclusion template for ORES model cards - https://phabricator.wikimedia.org/T337479 (10kevinbazira) A licenses template has been created and can be found here: https://meta.wikimedia.org/wiki/... [14:50:53] very weird, hitting revert risk from api gateway takes ~2/3s [14:51:04] via inference.d.w takes 0.3/0.4 [14:53:19] the api servers are still suffering a bit [14:53:42] ahhh no ok [14:53:53] so if I hit the "eqiad" internal endpoint it takes the same as api gateway [14:54:00] it is super fast on codfw [14:54:10] so ok this is related to the ongoing outage, let's re-test it tomorrow [14:57:57] morning all [15:00:46] hello chris! [15:01:06] also, another thing that we could try is to figure out a way to invalidate caches for the api-gateway [15:01:25] because we could in theory cache results in varnish, and invalidate them when a new model is deployed [15:01:28] or similar [15:01:53] at the moment every request to api-gateway is a "pass", so goes straight to lift wing [15:02:04] if we add simple HTTP caching we'd have varnish in front of us [15:03:09] https://phabricator.wikimedia.org/T324200 [15:09:22] How would the invalidation mechanism work? [15:13:18] ah, the task has a desc [15:16:09] this is the main question mark, it is not super easy [15:17:14] Ideally, it would happen as part of runing the helm chart, or thereabouts [15:17:56] not sure if it is possible, I think it is more along the lines of emitting an event that is then processed by benthos etc.. that in turn send cache purge requests [15:18:22] That might be easier, yes [15:19:05] and also we probably don't want to invalidate the cache for every deployment, but only the ones that change the model or similar [15:20:49] Good point, it should remain a manual/optional step [15:42:12] going afk folks! [15:42:16] have a nice rest of the day! [15:42:19] elukey: o/ [15:42:33] elukey: need help [15:42:46] elukey: a quick question - I sent an event to staging.liftwing.test-outlink-events, but I didn't see new request in outlink's logs. How do I check if changeprop works as expected? [15:43:02] is there any logs I can check? [15:43:31] in theory on logstash, but lemme quickly check on the pods [15:44:27] thank u!! [15:45:04] the main issue is that both staging wikikube clusters connect to the eqiad main kafka cluster [15:45:13] so either of them may process the events [15:45:21] needs to be fixed of course, not great [15:46:06] ah wow I think the pods are not health [15:46:09] *healthy [15:46:45] what's the problem? [15:47:03] trying to get it in the horror js stacktrace [15:47:47] it says "Error: Invalid match object given!" but not where, I suspect it may be the new code [15:48:17] match object as in Python regex? [15:48:38] Error: Invalid match object given!\n at Rule._processMatch (/srv/service/lib/rule.js:307:19) [15:48:54] it seems to be the changeprop config's match [15:49:33] I added two new match configs [15:49:39] page.is_redirect: false and page.namespace_id: 0 [15:50:03] maybe they are not valid? [15:51:28] in this patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/920282 [15:55:12] from https://integration.wikimedia.org/ci/job/helm-lint/10547/console I don't see anything weird in the rendered config [15:56:42] ahh wait maybe it is the "." [15:56:56] lemme try to file a change [15:57:16] you mean yaml interpreting the . when it shouldn't? [15:58:57] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/923376 [15:58:59] this is my idea [15:59:20] LGTM [16:00:25] but it doesn't render ok sigh [16:03:37] testing one idea [16:05:48] ah yeah now I recall the problems, with quoting etc.. [16:05:54] I cannot simply use toYaml [16:10:13] aiko: I need to figure out a good way to handle the use case in the templates, will work on it tomorrow (need to go now) [16:10:18] byeee [16:10:27] \o [16:10:45] elukey: no prob! I'll look into it as well [16:10:56] elukey: bye :) [17:58:57] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata) @achou Oh, you know, we should probably version this stream. https://wikitech.wikimedia.org/wiki/Event_Platform/Str... [18:00:01] 10Machine-Learning-Team, 10Data-Engineering-Planning, 10Event-Platform Value Stream: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Ottomata)