[06:21:03] <wikibugs>	 10Machine-Learning-Team: revscoring model should be able to query an internal mw endpoint - https://phabricator.wikimedia.org/T289778 (10elukey) @ACraze wow nice! I checked and the endpoint to use should be `api-ro.discovery.wmnet` + `Host` header, the rest should be handled by the `mwapi` library in theory (so...
[06:36:54] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Add network policies to the ML k8s clusters - https://phabricator.wikimedia.org/T289834 (10elukey)
[06:41:46] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835 (10elukey)
[06:42:06] <elukey>	 all right created the remaining tasks for the infra --^
[06:42:37] <elukey>	 1) LB endpoint for inference.discovery.wmnet + TLS cert + secure port + etc..
[06:42:42] <elukey>	 2) PSP for k8s
[06:43:00] <elukey>	 3) GlobalNetworkPolicies for the ml clusters (to limit network traffic)
[07:18:08] <wikibugs>	 10Lift-Wing, 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Implement model storage for enwiki-goodfaith inference service - https://phabricator.wikimedia.org/T282802 (10elukey) To keep archives happy: https://wikitech.wikimedia.org/wiki/User:Elukey/M...
[07:21:49] <elukey>	 ah of course I forgot something important
[07:21:57] <elukey>	 4) add prometheus metrics
[08:57:27] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey)
[09:44:22] <elukey>	 folks I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/715209 in a bit to flip the redis celery settings to another rdb nodes
[09:44:28] <elukey>	 since SRE needs to reboot the active one
[09:44:42] <elukey>	 the procedure seems ok from the past experiences of Alex, let's see how it goes
[09:45:11] <klausman>	 +1'd
[09:45:20] <elukey>	 <3
[09:54:55] <elukey>	 I am currently running
[09:54:56] <elukey>	 cumin -m async -b 1 -s 30 'A:ores-codfw' 'run-puppet-agent' 'depool' 'sleep 5' 'systemctl restart celery-ores-worker ; systemctl restart uwsgi-ores' 'sleep 5' 'pool'
[09:55:08] <elukey>	 this should, in theory, move all to the rdb replica
[09:55:31] <elukey>	 after a chat with Alex we agreed that flipping also the rdb master/replica redis setting wasn't really needed
[09:55:53] <elukey>	 the redis instance for the celery workers isn't really in need to be in sync
[09:56:12] <klausman>	 Right.
[09:56:21] <elukey>	 there is another one containing the cached scores, but if we loose some in the process it is not a big deal (I mean failover/failback)
[09:56:25] <elukey>	 now this is the theory
[09:56:35] <elukey>	 in reality let's see how ORES behaves :D
[09:57:39] <elukey>	 watching https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?orgId=1&refresh=1m&from=now-1h&to=now-1m
[10:06:51] <elukey>	 roll restart finished, nothing exploding so far
[10:23:17] <elukey>	 rdb2007 rebooted, doing the fallback to it (so another ORES roll restart)
[10:37:07] <elukey>	 failback completed
[10:37:11] <elukey>	 nothing on fire afaics
[10:37:24] <elukey>	 going afk for lunch in a bit, but I'll add docs afterwards
[10:38:33] <klausman>	 elukey: https://i.imgur.com/e2qnRyd.png My lunch today :-P
[10:39:40] <elukey>	 nice <3
[10:39:42] * elukey lunch
[13:31:56] <elukey>	 expanded https://wikitech.wikimedia.org/wiki/ORES/Deployment#Restarting_Redis with what we did today
[14:10:26] <elukey>	 https://thanos.wikimedia.org/graph?g0.expr=istio_requests_total&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[14:10:30] <elukey>	 this is super nice
[14:10:43] <elukey>	 istio ships with annotations for prometheus, that are automatically picked up by our masters
[14:14:16] <elukey>	 mmm the first request that hits a model after some inactivity (hours) is usually 30s long
[14:14:43] <elukey>	 then super fast, even with different revs
[14:28:00] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) As Joe pointed out in T287007#7224824, we are indeed already collecting Istio metrics!
[15:00:43] <elukey>	 very first draft of the istio dashboard - https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-backend=All
[15:03:10] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team: Add prometheus metrics collection for Istio and Knative - https://phabricator.wikimedia.org/T289841 (10elukey) Started the Istio dashboard in https://grafana-rw.wikimedia.org/d/G7yj84Vnk/istio?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-backend=enwiki-goodfait...
[15:10:37] <chrisalbon>	 Just catching up
[15:10:49] <chrisalbon>	 A dashboard!
[15:11:34] <chrisalbon>	 Nice lunch klausman
[15:11:48] <chrisalbon>	 I'm going to assume that was totally homemade
[15:11:58] <klausman>	 Nah, Pizeeria on the corner
[15:12:29] <klausman>	 On the way back, I noticed to my dismay, that halway between here and the Pizerria, they'll soon open a Donut shop
[15:12:43] <chrisalbon>	 Hahahaha
[15:13:22] <chrisalbon>	 I'm actually going to meet Olja and Leila in person today
[15:13:26] <chrisalbon>	 Big moment for me
[15:13:42] <klausman>	 Ooh, do say hello from us :)
[15:14:11] <chrisalbon>	 Will do!
[15:15:39] <chrisalbon>	 How hard was the dashboard?
[15:15:48] <chrisalbon>	 I was wondering if that was going to be a big lift
[15:16:35] <klausman>	 Grafana dashboards are typicall not hard once you coax the data out of Prometheus.
[15:17:43] <klausman>	 The query for that dashboard is sum(irate(istio_requests_total{destination_service_name=~"$backend"}[5m])) by (destination_service_name, response_code)
[15:17:50] <klausman>	 which is fairly typical, complexity-wise
[15:18:35] <elukey>	 chrisalbon: there are still a lot of metrics to display etc.. but the good part is that istio adds k8s annotations by default that work with our current infra (so metrics are automatically scraped etc..)
[15:18:49] <chrisalbon>	 yeah, that is awesome
[15:18:52] <elukey>	 knative seems different, so we'll have to dig into it
[15:19:06] <elukey>	 no idea about kfserving
[15:19:26] <elukey>	 but ideally we should have metrics from all the layers of the cake
[15:20:13] <chrisalbon>	 Yeah, half of me suspects KFServing will plug into analytics with one setting change, half of me thinks it will be a whole project to do it
[15:20:23] <chrisalbon>	 The early adopter curse
[15:21:11] <elukey>	 one thing that I am still trying to figure out is how to get insights about performance at the various levels
[15:24:02] <elukey>	 https://knative.dev/docs/admin/collecting-metrics/serving-metrics/metrics/ are very nice but we'll see what we have in 0.18
[15:24:27] <elukey>	 and https://knative.dev/docs/admin/collecting-metrics looks more complicated than istio
[15:26:40] <chrisalbon>	 yeah and it matters because there is a chance we just can't get performance to a point that we don't need a precache at the start
[15:27:17] <chrisalbon>	 which we'd only know if we could measure performance
[15:27:41] <chrisalbon>	 or I should saw, easily monitor
[15:32:09] <elukey>	 for example, I am still getting the 30s request issue
[15:32:17] <elukey>	 and it would be great to know where to look :D
[15:32:26] <elukey>	 or better, get some hints
[15:50:46] <chrisalbon>	 yeah exactly
[16:18:04] <elukey>	 The issue of the 30s smells as if knative was doing some queueing 
[16:18:32] <elukey>	 I thought it was the scale-to-zero functionality but the kfserving pod never get killed
[16:19:45] <elukey>	 but then I don't explain this
[16:19:45] <elukey>	 [I 210827 16:17:58 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 31867.89ms
[16:19:49] <elukey>	 [I 210827 16:18:35 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 369.02ms
[16:19:52] <elukey>	 [I 210827 16:18:44 web:2243] 200 POST /v1/models/enwiki-goodfaith:predict (127.0.0.1) 354.68ms
[16:20:01] <elukey>	 this is the kfserving pod, that shows the 30+ seconds
[16:20:34] <elukey>	 so it seems that the kfserving server takes time
[16:40:13] <elukey>	 I think I found it
[16:40:16] <chrisalbon>	 oh?
[16:40:32] <elukey>	 when the 30s delay occurs, if I check netstat on the pod with nsenter I see
[16:40:35] <elukey>	 tcp6       0      1 2620:0:861:300:df:59992 text-lb.eqiad.wik:https SYN_SENT    7630/python3
[16:40:58] <elukey>	 so it is probably an ipv6 problem, and it takes 30s to timeout
[16:41:16] <chrisalbon>	 oooohhh
[16:41:21] <chrisalbon>	 yeah
[16:43:27] <elukey>	 https://phabricator.wikimedia.org/T289778 should be a good follow up, no idea why ipv6 behaves in that way
[16:45:30] <elukey>	 maybe the global network policies, atm we don't specify them and it is all open, wondering if ipv6 is not there for $reasons
[17:03:50] <accraze>	 o/
[17:03:57] <accraze>	 elukey: nice dashboard :)
[17:06:57] <elukey>	 accraze: o/ it is horrible for the moment but thanks for the kindness :D
[17:09:03] <elukey>	 TIL Python black formatter
[17:09:53] <accraze>	 haha!
[17:10:07] <accraze>	 cool just saw your WIKI_HOST patch
[17:10:29] <elukey>	 only an idea feel free to dump it if not great
[17:11:50] <accraze>	 it is definitely the right direction -- gonna take a closer look in a sec
[17:14:35] <accraze>	 this approach should work for us with an internal endpoint and for community users using the external endpoints
[17:15:08] <accraze>	 is there anything else we need to add in the header to use the internal endpoint?
[17:24:46] <elukey>	 accraze: in theory only the Host header
[17:24:55] <elukey>	 at least this is what I have to set for curl
[17:25:00] <elukey>	 I can't think about other ones
[17:26:09] <elukey>	 I just realized that I can simplify the code, session=s could be either a request session or None
[17:28:53] <elukey>	 accraze: updated :)
[17:32:52] <accraze>	 niiiice, +2'd
[17:33:58] <elukey>	 accraze: ah one thing is missing, the wmf-certificates package
[17:34:51] <elukey>	 not sure if it needs to be passed to requests, sorry I just got the error in a test
[17:37:03] <elukey>	 yes so the code works fine but I had to add
[17:37:06] <elukey>	 export REQUESTS_CA_BUNDLE=/etc/ssl/certs/Puppet_Internal_CA.pem 
[17:37:38] <elukey>	 the wmf-certificates package should add the Puppet CA pem/ctr in theory
[17:37:54] <accraze>	 ohh i see, do we need to verify the cert in the session object?
[17:38:08] <accraze>	 or just setting that env var should do it?
[17:39:14] <elukey>	 in my code using the REQUEST_CA_BUNDLE worked, but it may not be super clean
[17:39:54] <elukey>	 what we can do is add the ENV variable to the InferenceService specs
[17:40:00] <elukey>	 what do you think?
[17:41:46] <accraze>	 yeah that sounds good!
[17:42:14] <elukey>	 ok sending another CR for wmf-certificates
[17:42:50] <elukey>	 https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/715270
[17:59:07] <accraze>	 ahhh shoot, looks like the editquality pipeline is having the same pip backtracking issues that draftquality and articlequality had
[17:59:55] <accraze>	 will wait to see if the remaining builds finish, but we'll probably need to pin the deps for the editquality model-server
[18:00:12] <elukey>	 ah snap :(
[18:05:29] <elukey>	 accraze: anything that I can do to help? (otherwise I'll log off, but really sorry to leave with the dep mess)
[18:09:09] <accraze>	 nah nothing right now, i can fix it later today :)
[18:13:04] <elukey>	 ack thanks a lot, have a good weekend folks :)
[18:14:45] <accraze>	 have a good one elukey!
[18:27:23] <chrisalbon>	 night elukey!
[20:40:17] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix articlequality production pipeline - https://phabricator.wikimedia.org/T289749 (10ACraze)
[20:40:21] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze)
[20:40:50] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix articlequality production pipeline - https://phabricator.wikimedia.org/T289749 (10ACraze) 05Open→03Resolved Pipeline is fixed, marking this as RESOLVED
[20:40:53] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Production images for ORES/revscoring models - https://phabricator.wikimedia.org/T279004 (10ACraze)
[21:08:47] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Fix editquality production pipeline - https://phabricator.wikimedia.org/T289886 (10ACraze)
[22:29:14] <wikibugs>	 10Lift-Wing, 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Implement model storage for enwiki-goodfaith inference service - https://phabricator.wikimedia.org/T282802 (10ACraze) @elukey: I reviewed the docs and everything looks good re:  rbac role/sec...
[22:46:31] <wikibugs>	 10Lift-Wing, 10artificial-intelligence, 10revscoring, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Implement model storage for enwiki-goodfaith inference service - https://phabricator.wikimedia.org/T282802 (10ACraze) 05Open→03Resolved
[22:46:34] <wikibugs>	 10Lift-Wing, 10Machine-Learning-Team (Active Tasks): Find a way to store models for Kubeflow - https://phabricator.wikimedia.org/T280025 (10ACraze)