[09:36:01] <wikibugs>	 06Machine-Learning-Team: 14Support building and running of articletopic-outlink model-server via Makefile - 14https://phabricator.wikimedia.org/T360177#9656801 (10kevinbazira) 05Open→03Resolved 14Support for building and running the articletopic-outlink model-server using the `Makefile` was added and it...
[09:44:08] <aiko>	 morning o/
[10:34:15] <jinxer-wm>	 (ORESFetchScoreJobKafkaLag) firing: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ...
[10:34:20] <jinxer-wm>	 - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag
[10:53:09] <klausman>	 Morning everyone
[11:01:48] <klausman>	 Looking at that ORES alert. Not sure what's going on except a spike in latency
[11:03:35] <klausman>	 Hmm, claime is mentioning about mw staging and kafka having an issue.
[11:39:15] <jinxer-wm>	 (ORESFetchScoreJobKafkaLag) resolved: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ...
[11:39:15] <jinxer-wm>	 - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag
[11:40:33] <klausman>	 \o/
[11:50:36] * klausman lunch
[12:59:39] * aiko afk ~30m
[13:06:51] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657533 (10Papaul) @klausman hello please see @Jhancock.wm comment above. Thank you.
[13:08:09] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657536 (10klausman) >>! In T360446#9649946, @Jhancock.wm wrote: > Found the drive as absent in iDRAC. Physically, the drive is there...
[13:25:09] <elukey>	 hello folks!
[13:29:35] <elukey>	 klausman: o/ let's try to investigate what happened :)
[13:29:48] <klausman>	 yespls. I was completely lost
[13:30:15] <klausman>	 I got to where I saw that the revscoring services for e.g. wikidata had high latency, but couldn't figure out why
[13:37:53] <elukey>	 klausman: how did you increase workers on cp?
[13:38:19] <klausman>	 Yes, 30->60, have since reverted it
[13:39:51] <elukey>	 so afaics from https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-6h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus%2Fops&viewPanel=1
[13:40:37] <elukey>	 there was an increase in lag for the main job, but not a correspondent increase (or, not that heavy) in retries
[13:40:54] <elukey>	 in case cp gets a 50x or similar from Lift Wing the job is added to the retry topic
[13:41:12] <elukey>	 so the fact that we didn't see a huge jump in there may be an indication that cp itself was slow
[13:41:54] <klausman>	 There were a tonm of requests for the wikidata services, so it might be that just a lot of revisions happened
[13:42:12] <klausman>	 https://grafana.wikimedia.org/goto/ialwLCJSk?orgId=1 visible here for example
[13:43:33] <klausman>	 The request volume is still high-ish, but not as high as it was at 0925 UTC the latency is not
[13:44:25] <elukey>	 from https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-damaging&var-component=All&var-model_name=wikidatawiki-damaging&from=now-6h&to=now wikidata-damaging preprocess went up
[13:46:28] <elukey>	 and enwiki seems fine
[13:48:38] <elukey>	 but in logstash (istio gw dashboard) I don't see a big and sustained jump in requests
[13:48:54] <klausman>	 for the revscoring service?
[13:49:25] <elukey>	 for all the traffic, nothing seems visibile
[13:50:09] <elukey>	 also for wikidata damaging we have maxReplicas 10
[13:50:26] <klausman>	 The dashboard I linked shows it, using the istio_requests_total{destination_workload_namespace=~"revscoring-ediquality-damaging", destination_canonical_service=~"enwiki_damaging_predictor_default"}
[13:50:41] <klausman>	 er/enwiki/wikidata/
[13:51:20] <klausman>	 If the preprocessing step was slow, that probably means a mediawiki-api call was slow?
[13:51:41] <klausman>	 I am not sure what the revscoring services do anything in preprocess that isn't just a fetch from there.
[13:52:08] <elukey>	 there is also some cpu-bound code that they run (sometimes) for feature extraction
[13:52:59] <elukey>	 okok I see the bump, but I never seen the dashboard
[13:53:52] <klausman>	 It was an old draft I made before we started actual SLO work
[13:53:59] <elukey>	 ah no ok the pods for wikidata went up
[13:53:59] <elukey>	 https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=revscoring-editquality-damaging&viewPanel=25
[13:57:26] <klausman>	 yeah, which makes sense, given the increase in rps
[13:59:09] <aiko>	 hi Luca!
[13:59:20] <elukey>	 o/
[14:02:46] <elukey>	 klausman: so far I didn't see any sign that we failed requests, is it also consisten with what you are seeing?
[14:02:55] <klausman>	 yes, agreed
[14:03:24] <klausman>	 The # of non-200s seems unaffected in the relevant time period
[14:03:24] <wikibugs>	 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657681 (10Jhancock.wm) reseating the drive did not fix the issue. server is in warranty. created a ticket with Dell to get it replace...
[14:09:06] <elukey>	 ok so I checked the kserve logs in logstash 
[14:09:10] <elukey>	 and I see stuff like
[14:09:52] <elukey>	 (sorry misread some logs, gimme a sec)
[14:10:33] <elukey>	 preprocess_ms: 1303.630828857,
[14:10:51] <elukey>	 Function get_revscoring_extractor_cache took 1.2865 seconds to execute.
[14:11:44] <elukey>	 I think that we should open a task to investigate more
[14:12:26] <klausman>	 I can do that
[14:12:30] <elukey>	 super thanks
[14:12:40] <elukey>	 aiko: o/ any luck with the rr-ml image?
[14:19:56] <aiko>	 elukey: o/ yeah I built the rr-ml image using the pytorch-amd base image
[14:20:10] <aiko>	 here is the layers size https://phabricator.wikimedia.org/P58904
[14:21:35] <elukey>	 aiko: looks good! To double check: pytorch was removed from requirements.txt, and the model server worked fine
[14:22:42] <wikibugs>	 06Machine-Learning-Team: Investigate temporary high latency in revscoring service for wikidata - https://phabricator.wikimedia.org/T360894 (10klausman) 03NEW
[14:25:11] <aiko>	 I'm trying  to run the image to see if the model server works fine
[14:25:18] <elukey>	 ahh okok super
[14:25:22] <elukey>	 lemme know if I can help
[14:25:29] <aiko>	 yes pytorch was removed 
[14:52:09] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9657834 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=78701a88-bd13-4896-9ad1-88076e82347e) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and...
[14:52:30] <wikibugs>	 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9657836 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9cabb1e2-3230-40ba-8e89-bce14ddf9042) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and...
[15:13:13] <aiko>	 elukey: hmm it's been 12 mins since I ran the 'docker run' but the model server is still not running and there are no logs 
[15:13:43] <elukey>	 aiko: looks promising :D
[15:14:19] <aiko>	 why? lol
[15:14:29] <elukey>	 I was joking :D
[15:14:40] <aiko>	 hahaha
[15:14:57] <elukey>	 it is weird since if it doesn't work I expected some python failure
[15:15:09] <elukey>	 is the container running though?
[15:15:16] <elukey>	 IIRC with docker container ls you should see it
[15:15:17] <aiko>	 I tried entering the image docker run --rm -it --entrypoint /bin/bash and  ran python3 model_server/model.py 
[15:15:25] <aiko>	 I got Illegal instruction
[15:15:34] <aiko>	 yes the container is running 
[15:17:31] <elukey>	 do you have a stracktrace for the illegal instruction?
[15:18:17] <aiko>	 no, only one line "Illegal instruction" 
[15:18:57] <elukey>	 I'll try to replicate the same setup on my laptop so I can check
[15:19:34] <aiko>	 Mm the "Illegal instruction (core dumped)" error typically indicates that there's an instruction being used by the program that's not supported by your CPU
[15:20:03] <aiko>	 ok!
[15:20:50] <elukey>	 aiko: one qs - how did you remove pytorch from requirements?
[15:21:00] <elukey>	 IIUC it is installed via KI?
[15:21:37] <aiko>	 yes so I used this commit https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/commit/068887a4bfa8f77bbac4cda03d5110b8e43d5f38
[15:22:44] <elukey>	 ack!
[15:28:45] <elukey>	 aiko: did you get the following while building?
[15:28:45] <elukey>	 #24 68.05       aiohttp/_websocket.c:198:12: fatal error: longintrepr.h: No such file or directory
[15:28:48] <elukey>	 #24 68.05         198 |   #include "longintrepr.h"
[15:28:59] <elukey>	 this is due to the aiohttp version used, not compatible with 3.11 IIRC
[15:29:14] <elukey>	 (we are implicitly bumping to bookworm and py3.11 with the base image)
[15:32:53] <aiko>	 ahhh yeah I got the same error 
[15:33:24] <elukey>	 I think it is a kserve -> fastapi -> aiohttp dep IIRC
[15:33:47] <elukey>	 or something from KI?
[15:36:03] <aiko>	 you added the python/requirement.txt to the blubber, right?
[15:38:34] <aiko>	 I got that error after I added python/requirement.txt to blubber
[15:38:57] <elukey>	 ahhh it comes from python's requirement!
[15:39:21] <elukey>	 yes yes super
[15:39:24] <elukey>	 I'll bump to 3.9
[15:40:15] <elukey>	 klausman: the new prometheus job for istio is up!
[15:40:19] <elukey>	 seems working fine atm
[15:40:50] <klausman>	 excellent. how long do you think we'll have to wait for the new metrics to be testable?
[15:43:08] <wikibugs>	 06Machine-Learning-Team, 06Wikimedia Enterprise, 10Data-Engineering (Sprint 9), 07Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792#9658089 (10lbowmaker)
[15:43:12] <elukey>	 aiko: so I am able to see logs for the model server's startup, but then I get an error since I don't have the model (will test it in a bit)
[15:43:36] <elukey>	 klausman: I think that we may need to do another run of label pruning, check something like 
[15:43:39] <elukey>	 istio_requests_total{job="k8s-pods-istio", destination_service_namespace="revscoring-articlequality"}
[15:43:52] <elukey>	 there are a lot of knative labels with uuids
[15:44:40] <aiko>	 elukey: oooh nice
[15:45:52] <klausman>	 Yeah, I thinkthe UIDs we can probably drop, as well as the e.g. the pod_template_hash
[15:46:14] <aiko>	 why doesn't it work on my end :( is it a mac issue?
[15:53:47] <elukey>	 aiko: very weird not sure!
[15:54:02] <elukey>	 I am fixing an issue but I'll test the whole thing with the model binary
[16:08:23] <elukey>	 klausman: the main issue is that if I isolate a single UUID, there are multiple time series associated.. so not sure what happens if we drop all of them
[16:08:37] <elukey>	 do we get a single time series with everything aggregated?
[16:10:06] <klausman>	 ah, good point.
[16:11:39] <klausman>	 I think serving_knative_dev_configurationUID can be dropped since the Generation label should distinguish them
[16:11:55] <klausman>	 Similar for revision, I suspect. 
[16:12:16] <klausman>	 serviceUID, I don't know. Is that the only thing that would distinguish replicas?
[16:13:18] <elukey>	 not sure I've never checked that deep into the knative metrics.. I am wondering if there is a way from the k8s pods to disable them
[16:17:18] <klausman>	 thing is, it's a label in an istio metric. Not sure whose repsonsibility it would be
[16:19:48] <elukey>	 knative configures istio basically, so in theory there should be a way to tune it from knative
[16:21:41] <klausman>	 I've been rummaging in the knative docs, but Ihaven't found anything useufl yet
[16:23:31] <elukey>	 klausman: one good thing that I realized is that those uids are only in metrics that sidecar publish
[16:23:43] <elukey>	 meanwhile we use the istio gateway ones for the SLIs/rec-rules
[16:23:45] <elukey>	 like 
[16:23:46] <elukey>	 istio_requests_total{kubernetes_namespace="istio-system", job="k8s-pods-istio", destination_service_namespace="revscoring-articlequality"}
[16:23:57] <elukey>	 in here I don't see knative stuff
[16:24:02] <elukey>	 that is good :)
[16:24:06] <klausman>	 yes!
[16:24:11] <elukey>	 I mean half good
[16:24:23] <elukey>	 the sidecar metrics will be a problem
[16:32:06] <klausman>	 I am a bit surprised that we seem to be the only ones with these problems
[16:32:38] <elukey>	 everybody else is *cloud native*
[16:32:53] <elukey>	 (it is a half joke :D)
[16:32:59] <klausman>	 Hrm. :-/
[16:42:18] <wikibugs>	 06Machine-Learning-Team, 10observability, 13Patch-For-Review: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#9658443 (10elukey) With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014035/ we changed how labels are collected on the Prometheus nodes. We now have...
[16:54:16] <elukey>	 aiko: so finally back on testing :)
[16:54:22] <elukey>	 I was wrong, the error that I see is
[16:54:23] <elukey>	 ModuleNotFoundError: No module named 'knowledge_integrity.mediawiki'
[17:02:42] <elukey>	 also one thing that I have realized is that we copy python's site-packages to /opt/lib/python/site-packages
[17:02:53] <elukey>	 meanwhile in the base image I used /usr/lib/etc..
[17:05:29] <elukey>	 anyway, I don't see mediawiki.py under /opt/lib/python/site-packages/knowledge_integrity/
[17:05:50] <elukey>	 and it is not in https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/tree/rr-multilingual-gpu/knowledge_integrity?ref_type=heads
[17:09:46] <aiko>	 ahh sorry I need to change the code 
[17:09:59] <aiko>	 'knowledge_integrity.mediawiki' is from KI v0.6
[17:10:44] <aiko>	 In that branch I didn't rebased to the new KI version
[17:12:37] <elukey>	 yes yes I figured :)
[17:12:42] <elukey>	 let's restart tomorrow morning!
[17:14:23] <aiko>	 ok :)
[17:15:21] <elukey>	 see you tomorrow folks!
[17:16:06] <aiko>	 have a nice evening Luca
[17:16:13] <elukey>	 you too!
[17:18:32] <klausman>	 \o