[06:21:03] 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10elukey) @ACraze wow nice! I checked and the endpoint to use should be `api-ro.discovery.wmnet` + `Host` header, the rest should be handled by the `mwapi` library in theory (so... [06:36:54] 10Lift-Wing, 10Machine-Learning-Team: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey) [06:41:46] 10Lift-Wing, 10Machine-Learning-Team: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835 (10elukey) [06:42:06] all right created the remaining tasks for the infra --^ [06:42:37] 1) LB endpoint for inference.discovery.wmnet + TLS cert + secure port + etc.. [06:42:42] 2) PSP for k8s [06:43:00] 3) GlobalNetworkPolicies for the ml clusters (to limit network traffic) [07:18:08] 10Lift-Wing, 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Implement model storage for enwiki-goodfaith inference service - https://phabricator.wikimedia.org/T282802 (10elukey) To keep archives happy: https://wikitech.wikimedia.org/wiki/User:Elukey/M... [07:21:49] ah of course I forgot something important [07:21:57] 4) add prometheus metrics [08:57:27] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) [09:44:22] folks I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/715209 in a bit to flip the redis celery settings to another rdb nodes [09:44:28] since SRE needs to reboot the active one [09:44:42] the procedure seems ok from the past experiences of Alex, let's see how it goes [09:45:11] +1'd [09:45:20] <3 [09:54:55] I am currently running [09:54:56] cumin -m async -b 1 -s 30 'A:ores-codfw' 'run-puppet-agent' 'depool' 'sleep 5' 'systemctl restart celery-ores-worker ; systemctl restart uwsgi-ores' 'sleep 5' 'pool' [09:55:08] this should, in theory, move all to the rdb replica [09:55:31] after a chat with Alex we agreed that flipping also the rdb master/replica redis setting wasn't really needed [09:55:53] the redis instance for the celery workers isn't really in need to be in sync [09:56:12] Right. [09:56:21] there is another one containing the cached scores, but if we loose some in the process it is not a big deal (I mean failover/failback) [09:56:25] now this is the theory [09:56:35] in reality let's see how ORES behaves :D [09:57:39] watching https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m&from=now-1h&to=now-1m [10:06:51] roll restart finished, nothing exploding so far [10:23:17] rdb2007 rebooted, doing the fallback to it (so another ORES roll restart) [10:37:07] failback completed [10:37:11] nothing on fire afaics [10:37:24] going afk for lunch in a bit, but I'll add docs afterwards [10:38:33] elukey: https://i.imgur.com/e2qnRyd.png My lunch today :-P [10:39:40] nice <3 [10:39:42] * elukey lunch [13:31:56] expanded https://wikitech.wikimedia.org/wiki/ORES/Deployment#Restarting_Redis with what we did today [14:10:26] https://thanos.wikimedia.org/graph?g0.expr=istio_requests_total&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [14:10:30] this is super nice [14:10:43] istio ships with annotations for prometheus, that are automatically picked up by our masters [14:14:16] mmm the first request that hits a model after some inactivity (hours) is usually 30s long [14:14:43] then super fast, even with different revs [14:28:00] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) As Joe pointed out in T287007#7224824, we are indeed already collecting Istio metrics! [15:00:43] very first draft of the istio dashboard - https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-backend=All [15:03:10] 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) Started the Istio dashboard in https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-backend=enwiki-goodfait... [15:10:37] Just catching up [15:10:49] A dashboard! [15:11:34] Nice lunch klausman [15:11:48] I'm going to assume that was totally homemade [15:11:58] Nah, Pizeeria on the corner [15:12:29] On the way back, I noticed to my dismay, that halway between here and the Pizerria, they'll soon open a Donut shop [15:12:43] Hahahaha [15:13:22] I'm actually going to meet Olja and Leila in person today [15:13:26] Big moment for me [15:13:42] Ooh, do say hello from us :) [15:14:11] Will do! [15:15:39] How hard was the dashboard? [15:15:48] I was wondering if that was going to be a big lift [15:16:35] Grafana dashboards are typicall not hard once you coax the data out of Prometheus. [15:17:43] The query for that dashboard is sum(irate(istio_requests_total{destination_service_name=~"$backend"}[5m])) by (destination_service_name, response_code) [15:17:50] which is fairly typical, complexity-wise [15:18:35] chrisalbon: there are still a lot of metrics to display etc.. but the good part is that istio adds k8s annotations by default that work with our current infra (so metrics are automatically scraped etc..) [15:18:49] yeah, that is awesome [15:18:52] knative seems different, so we'll have to dig into it [15:19:06] no idea about kfserving [15:19:26] but ideally we should have metrics from all the layers of the cake [15:20:13] Yeah, half of me suspects KFServing will plug into analytics with one setting change, half of me thinks it will be a whole project to do it [15:20:23] The early adopter curse [15:21:11] one thing that I am still trying to figure out is how to get insights about performance at the various levels [15:24:02] https://knative.dev/docs/admin/collecting-metrics/serving-metrics/metrics/ are very nice but we'll see what we have in 0.18 [15:24:27] and https://knative.dev/docs/admin/collecting-metrics looks more complicated than istio [15:26:40] yeah and it matters because there is a chance we just can't get performance to a point that we don't need a precache at the start [15:27:17] which we'd only know if we could measure performance [15:27:41] or I should saw, easily monitor [15:32:09] for example, I am still getting the 30s request issue [15:32:17] and it would be great to know where to look :D [15:32:26] or better, get some hints [15:50:46] yeah exactly [16:18:04] The issue of the 30s smells as if knative was doing some queueing [16:18:32] I thought it was the scale-to-zero functionality but the kfserving pod never get killed [16:19:45] but then I don't explain this [16:19:45] [I 210827 16:17:58 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 31867.89ms [16:19:49] [I 210827 16:18:35 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 369.02ms [16:19:52] [I 210827 16:18:44 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 354.68ms [16:20:01] this is the kfserving pod, that shows the 30+ seconds [16:20:34] so it seems that the kfserving server takes time [16:40:13] I think I found it [16:40:16] oh? [16:40:32] when the 30s delay occurs, if I check netstat on the pod with nsenter I see [16:40:35] tcp6 0 1 2620:0:861:300:df:59992 text-lb.eqiad.wik:https SYN_SENT 7630/python3 [16:40:58] so it is probably an ipv6 problem, and it takes 30s to timeout [16:41:16] oooohhh [16:41:21] yeah [16:43:27] https://phabricator.wikimedia.org/T289778 should be a good follow up, no idea why ipv6 behaves in that way [16:45:30] maybe the global network policies, atm we don't specify them and it is all open, wondering if ipv6 is not there for $reasons [17:03:50] o/ [17:03:57] elukey: nice dashboard :) [17:06:57] accraze: o/ it is horrible for the moment but thanks for the kindness :D [17:09:03] TIL Python black formatter [17:09:53] haha! [17:10:07] cool just saw your WIKI_HOST patch [17:10:29] only an idea feel free to dump it if not great [17:11:50] it is definitely the right direction -- gonna take a closer look in a sec [17:14:35] this approach should work for us with an internal endpoint and for community users using the external endpoints [17:15:08] is there anything else we need to add in the header to use the internal endpoint? [17:24:46] accraze: in theory only the Host header [17:24:55] at least this is what I have to set for curl [17:25:00] I can't think about other ones [17:26:09] I just realized that I can simplify the code, session=s could be either a request session or None [17:28:53] accraze: updated :) [17:32:52] niiiice, +2'd [17:33:58] accraze: ah one thing is missing, the wmf-certificates package [17:34:51] not sure if it needs to be passed to requests, sorry I just got the error in a test [17:37:03] yes so the code works fine but I had to add [17:37:06] export REQUESTS_CA_BUNDLE=/etc/ssl/certs/Puppet_Internal_CA.pem [17:37:38] the wmf-certificates package should add the Puppet CA pem/ctr in theory [17:37:54] ohh i see, do we need to verify the cert in the session object? [17:38:08] or just setting that env var should do it? [17:39:14] in my code using the REQUEST_CA_BUNDLE worked, but it may not be super clean [17:39:54] what we can do is add the ENV variable to the InferenceService specs [17:40:00] what do you think? [17:41:46] yeah that sounds good! [17:42:14] ok sending another CR for wmf-certificates [17:42:50] https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/715270 [17:59:07] ahhh shoot, looks like the editquality pipeline is having the same pip backtracking issues that draftquality and articlequality had [17:59:55] will wait to see if the remaining builds finish, but we'll probably need to pin the deps for the editquality model-server [18:00:12] ah snap :( [18:05:29] accraze: anything that I can do to help? (otherwise I'll log off, but really sorry to leave with the dep mess) [18:09:09] nah nothing right now, i can fix it later today :) [18:13:04] ack thanks a lot, have a good weekend folks :) [18:14:45] have a good one elukey! [18:27:23] night elukey! [20:40:17] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix articlequality production pipeline - https://phabricator.wikimedia.org/T289749 (10ACraze) [20:40:21] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [20:40:50] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix articlequality production pipeline - https://phabricator.wikimedia.org/T289749 (10ACraze) 05Open→03Resolved Pipeline is fixed, marking this as RESOLVED [20:40:53] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze) [21:08:47] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix editquality production pipeline - https://phabricator.wikimedia.org/T289886 (10ACraze) [22:29:14] 10Lift-Wing, 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Implement model storage for enwiki-goodfaith inference service - https://phabricator.wikimedia.org/T282802 (10ACraze) @elukey: I reviewed the docs and everything looks good re: rbac role/sec... [22:46:31] 10Lift-Wing, 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Implement model storage for enwiki-goodfaith inference service - https://phabricator.wikimedia.org/T282802 (10ACraze) 05Open→03Resolved [22:46:34] 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Find a way to store models for Kubeflow - https://phabricator.wikimedia.org/T280025 (10ACraze)