[09:36:01] 06Machine-Learning-Team: 14Support building and running of articletopic-outlink model-server via Makefile - 14https://phabricator.wikimedia.org/T360177#9656801 (10kevinbazira) 05Open→03Resolved 14Support for building and running the articletopic-outlink model-server using the `Makefile` was added and it... [09:44:08] morning o/ [10:34:15] (ORESFetchScoreJobKafkaLag) firing: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [10:34:20] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [10:53:09] Morning everyone [11:01:48] Looking at that ORES alert. Not sure what's going on except a spike in latency [11:03:35] Hmm, claime is mentioning about mw staging and kafka having an issue. [11:39:15] (ORESFetchScoreJobKafkaLag) resolved: Kafka consumer lag for ORESFetchScoreJob over threshold for past 1h. ... [11:39:15] - https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/Alerts#Kafka_Consumer_lag_-_ORESFetchScoreJobKafkaLag_alert - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DORESFetchScoreJobKafkaLag [11:40:33] \o/ [11:50:36] * klausman lunch [12:59:39] * aiko afk ~30m [13:06:51] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657533 (10Papaul) @klausman hello please see @Jhancock.wm comment above. Thank you. [13:08:09] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657536 (10klausman) >>! In T360446#9649946, @Jhancock.wm wrote: > Found the drive as absent in iDRAC. Physically, the drive is there... [13:25:09] hello folks! [13:29:35] klausman: o/ let's try to investigate what happened :) [13:29:48] yespls. I was completely lost [13:30:15] I got to where I saw that the revscoring services for e.g. wikidata had high latency, but couldn't figure out why [13:37:53] klausman: how did you increase workers on cp? [13:38:19] Yes, 30->60, have since reverted it [13:39:51] so afaics from https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-6h&orgId=1&to=now&var-cluster=main-eqiad&var-consumer_group=cpjobqueue-ORESFetchScoreJob&var-datasource=eqiad%20prometheus%2Fops&viewPanel=1 [13:40:37] there was an increase in lag for the main job, but not a correspondent increase (or, not that heavy) in retries [13:40:54] in case cp gets a 50x or similar from Lift Wing the job is added to the retry topic [13:41:12] so the fact that we didn't see a huge jump in there may be an indication that cp itself was slow [13:41:54] There were a tonm of requests for the wikidata services, so it might be that just a lot of revisions happened [13:42:12] https://grafana.wikimedia.org/goto/ialwLCJSk?orgId=1 visible here for example [13:43:33] The request volume is still high-ish, but not as high as it was at 0925 UTC the latency is not [13:44:25] from https://grafana.wikimedia.org/d/n3LJdTGIk/kserve-inference-services?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-namespace=revscoring-editquality-damaging&var-component=All&var-model_name=wikidatawiki-damaging&from=now-6h&to=now wikidata-damaging preprocess went up [13:46:28] and enwiki seems fine [13:48:38] but in logstash (istio gw dashboard) I don't see a big and sustained jump in requests [13:48:54] for the revscoring service? [13:49:25] for all the traffic, nothing seems visibile [13:50:09] also for wikidata damaging we have maxReplicas 10 [13:50:26] The dashboard I linked shows it, using the istio_requests_total{destination_workload_namespace=~"revscoring-ediquality-damaging", destination_canonical_service=~"enwiki_damaging_predictor_default"} [13:50:41] er/enwiki/wikidata/ [13:51:20] If the preprocessing step was slow, that probably means a mediawiki-api call was slow? [13:51:41] I am not sure what the revscoring services do anything in preprocess that isn't just a fetch from there. [13:52:08] there is also some cpu-bound code that they run (sometimes) for feature extraction [13:52:59] okok I see the bump, but I never seen the dashboard [13:53:52] It was an old draft I made before we started actual SLO work [13:53:59] ah no ok the pods for wikidata went up [13:53:59] https://grafana.wikimedia.org/d/c6GYmqdnz/knative-serving?orgId=1&var-cluster=eqiad%20prometheus%2Fk8s-mlserve&var-knative_namespace=knative-serving&var-revisions_namespace=revscoring-editquality-damaging&viewPanel=25 [13:57:26] yeah, which makes sense, given the increase in rps [13:59:09] hi Luca! [13:59:20] o/ [14:02:46] klausman: so far I didn't see any sign that we failed requests, is it also consisten with what you are seeing? [14:02:55] yes, agreed [14:03:24] The # of non-200s seems unaffected in the relevant time period [14:03:24] 06Machine-Learning-Team, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: failed disk for ml-serve2008.codfw.wmnet (not urgent) - https://phabricator.wikimedia.org/T360446#9657681 (10Jhancock.wm) reseating the drive did not fix the issue. server is in warranty. created a ticket with Dell to get it replace... [14:09:06] ok so I checked the kserve logs in logstash [14:09:10] and I see stuff like [14:09:52] (sorry misread some logs, gimme a sec) [14:10:33] preprocess_ms: 1303.630828857, [14:10:51] Function get_revscoring_extractor_cache took 1.2865 seconds to execute. [14:11:44] I think that we should open a task to investigate more [14:12:26] I can do that [14:12:30] super thanks [14:12:40] aiko: o/ any luck with the rr-ml image? [14:19:56] elukey: o/ yeah I built the rr-ml image using the pytorch-amd base image [14:20:10] here is the layers size https://phabricator.wikimedia.org/P58904 [14:21:35] aiko: looks good! To double check: pytorch was removed from requirements.txt, and the model server worked fine [14:22:42] 06Machine-Learning-Team: Investigate temporary high latency in revscoring service for wikidata - https://phabricator.wikimedia.org/T360894 (10klausman) 03NEW [14:25:11] I'm trying to run the image to see if the model server works fine [14:25:18] ahh okok super [14:25:22] lemme know if I can help [14:25:29] yes pytorch was removed [14:52:09] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9657834 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=78701a88-bd13-4896-9ad1-88076e82347e) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [14:52:30] 06Machine-Learning-Team, 06serviceops, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9657836 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9cabb1e2-3230-40ba-8e89-bce14ddf9042) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [15:13:13] elukey: hmm it's been 12 mins since I ran the 'docker run' but the model server is still not running and there are no logs [15:13:43] aiko: looks promising :D [15:14:19] why? lol [15:14:29] I was joking :D [15:14:40] hahaha [15:14:57] it is weird since if it doesn't work I expected some python failure [15:15:09] is the container running though? [15:15:16] IIRC with docker container ls you should see it [15:15:17] I tried entering the image docker run --rm -it --entrypoint /bin/bash and ran python3 model_server/model.py [15:15:25] I got Illegal instruction [15:15:34] yes the container is running [15:17:31] do you have a stracktrace for the illegal instruction? [15:18:17] no, only one line "Illegal instruction" [15:18:57] I'll try to replicate the same setup on my laptop so I can check [15:19:34] Mm the "Illegal instruction (core dumped)" error typically indicates that there's an instruction being used by the program that's not supported by your CPU [15:20:03] ok! [15:20:50] aiko: one qs - how did you remove pytorch from requirements? [15:21:00] IIUC it is installed via KI? [15:21:37] yes so I used this commit https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/commit/068887a4bfa8f77bbac4cda03d5110b8e43d5f38 [15:22:44] ack! [15:28:45] aiko: did you get the following while building? [15:28:45] #24 68.05 aiohttp/_websocket.c:198:12: fatal error: longintrepr.h: No such file or directory [15:28:48] #24 68.05 198 | #include "longintrepr.h" [15:28:59] this is due to the aiohttp version used, not compatible with 3.11 IIRC [15:29:14] (we are implicitly bumping to bookworm and py3.11 with the base image) [15:32:53] ahhh yeah I got the same error [15:33:24] I think it is a kserve -> fastapi -> aiohttp dep IIRC [15:33:47] or something from KI? [15:36:03] you added the python/requirement.txt to the blubber, right? [15:38:34] I got that error after I added python/requirement.txt to blubber [15:38:57] ahhh it comes from python's requirement! [15:39:21] yes yes super [15:39:24] I'll bump to 3.9 [15:40:15] klausman: the new prometheus job for istio is up! [15:40:19] seems working fine atm [15:40:50] excellent. how long do you think we'll have to wait for the new metrics to be testable? [15:43:08] 06Machine-Learning-Team, 06Wikimedia Enterprise, 10Data-Engineering (Sprint 9), 07Epic, 10Event-Platform: [Event Platform] Implement PoC Event-Driven Data Pipeline for Revert Risk Model Scores using Event Platform Capabilities - https://phabricator.wikimedia.org/T338792#9658089 (10lbowmaker) [15:43:12] aiko: so I am able to see logs for the model server's startup, but then I get an error since I don't have the model (will test it in a bit) [15:43:36] klausman: I think that we may need to do another run of label pruning, check something like [15:43:39] istio_requests_total{job="k8s-pods-istio", destination_service_namespace="revscoring-articlequality"} [15:43:52] there are a lot of knative labels with uuids [15:44:40] elukey: oooh nice [15:45:52] Yeah, I thinkthe UIDs we can probably drop, as well as the e.g. the pod_template_hash [15:46:14] why doesn't it work on my end :( is it a mac issue? [15:53:47] aiko: very weird not sure! [15:54:02] I am fixing an issue but I'll test the whole thing with the model binary [16:08:23] klausman: the main issue is that if I isolate a single UUID, there are multiple time series associated.. so not sure what happens if we drop all of them [16:08:37] do we get a single time series with everything aggregated? [16:10:06] ah, good point. [16:11:39] I think serving_knative_dev_configurationUID can be dropped since the Generation label should distinguish them [16:11:55] Similar for revision, I suspect. [16:12:16] serviceUID, I don't know. Is that the only thing that would distinguish replicas? [16:13:18] not sure I've never checked that deep into the knative metrics.. I am wondering if there is a way from the k8s pods to disable them [16:17:18] thing is, it's a label in an istio metric. Not sure whose repsonsibility it would be [16:19:48] knative configures istio basically, so in theory there should be a way to tune it from knative [16:21:41] I've been rummaging in the knative docs, but Ihaven't found anything useufl yet [16:23:31] klausman: one good thing that I realized is that those uids are only in metrics that sidecar publish [16:23:43] meanwhile we use the istio gateway ones for the SLIs/rec-rules [16:23:45] like [16:23:46] istio_requests_total{kubernetes_namespace="istio-system", job="k8s-pods-istio", destination_service_namespace="revscoring-articlequality"} [16:23:57] in here I don't see knative stuff [16:24:02] that is good :) [16:24:06] yes! [16:24:11] I mean half good [16:24:23] the sidecar metrics will be a problem [16:32:06] I am a bit surprised that we seem to be the only ones with these problems [16:32:38] everybody else is *cloud native* [16:32:53] (it is a half joke :D) [16:32:59] Hrm. :-/ [16:42:18] 06Machine-Learning-Team, 10observability, 13Patch-For-Review: Istio recording rules for Pyrra and Grizzly - https://phabricator.wikimedia.org/T351390#9658443 (10elukey) With https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014035/ we changed how labels are collected on the Prometheus nodes. We now have... [16:54:16] aiko: so finally back on testing :) [16:54:22] I was wrong, the error that I see is [16:54:23] ModuleNotFoundError: No module named 'knowledge_integrity.mediawiki' [17:02:42] also one thing that I have realized is that we copy python's site-packages to /opt/lib/python/site-packages [17:02:53] meanwhile in the base image I used /usr/lib/etc.. [17:05:29] anyway, I don't see mediawiki.py under /opt/lib/python/site-packages/knowledge_integrity/ [17:05:50] and it is not in https://gitlab.wikimedia.org/repos/research/knowledge_integrity/-/tree/rr-multilingual-gpu/knowledge_integrity?ref_type=heads [17:09:46] ahh sorry I need to change the code [17:09:59] 'knowledge_integrity.mediawiki' is from KI v0.6 [17:10:44] In that branch I didn't rebased to the new KI version [17:12:37] yes yes I figured :) [17:12:42] let's restart tomorrow morning! [17:14:23] ok :) [17:15:21] see you tomorrow folks! [17:16:06] have a nice evening Luca [17:16:13] you too! [17:18:32] \o