[08:19:40] hello folks [08:20:08] kevinbazira: o/ you can use deploy1002 :) [08:20:25] hello elukey o/ [08:20:43] thanks for the confirmation, let me start the deployment. [08:21:11] ack! [08:21:25] In the meantime, the new storage initializer works! [08:21:29] just deployed it in staging [08:21:58] woohooo [08:22:11] so kserve 0.8 seems to work fine [08:22:15] at least in staging [08:22:26] I'll do more tests [08:30:58] both eqiad and codfw prod deployments have been completed successfully. [08:30:58] checking pods now ... [08:30:58] all new pods are up and running: [08:30:58] NAME READY STATUS RESTARTS AGE [08:30:58] arwiki-articletopic-predictor-default-8w8g6-deployment-5ff98mw4 3/3 Running 0 3m [08:30:58] cswiki-articletopic-predictor-default-nwbcb-deployment-85654hmx 3/3 Running 0 3m [08:30:58] enwiki-articletopic-predictor-default-7k42x-deployment-858djscz 3/3 Running 0 3m [08:34:35] super [08:35:00] klausman: o/ I noticed that we don't have metrics for staging :( [08:37:02] yeah, I know :-/ [08:37:32] so we didn't check this before closing the task [08:37:33] I think we turned monitoring off originally to avoid alerst on staging, but that also disabled all prom scraping [08:38:22] ??? [08:38:40] I don't recall disabling prom scraping :) [08:38:42] There is a monitoring: false in the config [08:39:11] in what config? [08:39:38] The deployment charts [08:39:53] Or do you mean metrics fro the k8s/kserve stuff itself [08:40:25] I mean metrics for the k8s control plane etc.. [08:40:58] https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&var-datasource=thanos&var-site=codfw&var-cluster=k8s-mlstaging [08:41:05] I see metrics? [08:42:07] right, for some reason grafana didn't show them to me, weird [08:42:10] now I can see them [08:42:28] but we are missing some puppet config, at least reading from https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Prometheus [08:42:30] @elukey o/ [08:42:42] Some of the dshboards don't list ml-staging but I dunno why [08:42:56] I'll take a look [08:43:12] @elukey I added a new section to the kserve documentation (was thinking to add after your examples but felt not right) https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing/KServe#New_service [08:44:09] About the dependencies for KServe, those four dependencies are the minimum set that I've tested, other dependencies of dependencies will be automatically installed by pip and no problem [08:44:37] Or do you think it's better to provide a full list of dependencies (all dependencies pinned) like https://github.com/wikimedia/machinelearning-liftwing-inference-services/blob/main/outlink-topic-model/model-server/requirements.txt? [08:44:43] klausman: https://gerrit.wikimedia.org/r/c/operations/puppet/+/817201 [08:45:02] aiko: checking [08:45:19] Luca, always ahead of me :D [08:45:46] +1'd [08:45:49] thanks! [08:45:59] And sorry for missing it originally [08:48:04] super fine, it happens! [08:48:34] aiko: I'll read the whole doc and test it, looks great! The only weird thing in the main wikitech page is that we have "New Service" and "Run Kserve locally via Docker" [08:54:39] klausman: for T312550 I think that we could use your snippet and simply add a "main-json.log" file and that's it [08:54:53] elukey: yeah it is a bit weird because I was thinking if I add after your examples, people will see more complex examples first. [08:55:04] the size of the logs is fine and there is a logrotate rule for *.log under /srv/ores/log [08:55:08] what do you think? [08:55:26] aiko: maybe we can rename the sections, just for clarity [08:55:51] elukey: what about log rotation/expiry? Or instead maybe sending to `/dev/null`? I am not sure the locally-saved log will be useful for much [08:56:09] elukey: yep we can do that! [08:57:06] klausman: /etc/logrotate.d/ores should take care of everything if we name the new log accordingly, we can have both main.log and main-json.log in my opinion [08:57:33] ORES will hopefully go away during the next months [08:57:47] Alright. I'll make a patch [08:57:59] I am worried about not having those logs more than having few MBs on disk [08:58:18] we have ~600G free on ores1001:/srv :D [09:01:54] Hmm. It seems this config is used widely. Should we special-case it for ores or do it everywhere? [09:03:01] klausman: let's do it for ORES only for the moment, if there is a way to override the general config with ours [09:03:15] I'll see what i can do :) [09:03:42] So the main config for uwsgi is modules/service/manifests/uwsgi.pp [09:05:09] Would an override then live in modules/ores/manifests/ ? [09:07:01] elukey: I renamed the section :) [09:18:32] klausman: checking [09:19:04] I think we'd have to if/else special case ores in the former file, since it does complex geenration of the config file with deep merges and stuff [09:21:29] yeah it seems all entangled [09:21:51] I'll try and figure it out, probably with `if $title == "ores'` [09:21:56] aiko: way better thanks! [09:22:40] klausman: let's avoid it, I hoped there was a way to do it via hiera or similar.. one thing that we could do is to add a flag for a main-json.log or similar [09:23:04] you mean a flag for the uwsgi class? [09:23:05] so that log_config_local could have the extra json logger only for ores [09:23:08] yes yes [09:23:34] Hmm. Yeah, that could work. I am just unsure how to add two local loggers with the existing class. [09:32:23] aaah, my brain hurts. [09:45:45] elukey: https://puppet-compiler.wmflabs.org/pcc-worker1002/36378/ores1001.eqiad.wmnet/change.ores1001.eqiad.wmnet.err This is... puzzling [09:46:32] this is related to another change, something happened recently probably [09:52:06] yeah, HEAD is broken for ores, with that error [10:15:33] going out for lunch! ttl\ [10:23:23] \o [11:26:12] elukey: I think I got the ores-uwsgi change ready now [12:34:11] klausman: commented! [12:38:24] Nice catch with log_encoder. I dunno if it's at all possible with the current puppet setup to have more than one of those :-/ [12:39:28] Isn't Puppet meant to make thses things -simpler :-/ [12:52:04] elukey: I have nfc how to add a second log encoder. I can overwrite the existing one (using `log-encoder => [...]`, but adding a second one seems impossible wiht this setup. [12:52:32] See https://puppet-compiler.wmflabs.org/pcc-worker1001/36405/ores1001.eqiad.wmnet/index.html [12:56:18] klausman: can we re-use the same log encoder? [12:56:42] I don't think so, since its definition includes the target/sink name [12:57:46] (or at least I think so. The config delightfully overloads sink names with encoder/logging types :-/) [12:58:55] not great [13:00:14] testing one thing on ores1001 [13:00:35] I am this close to forking uwsgi.pp entirely and making a new class just for ORES [13:01:24] I think if the log-encoder tuple was actually a list of tuples, this might work. [13:02:03] But we're sinking a lot of time and effort into this :-/ [13:02:33] we are, but it is important, we don't have logs for ORES [13:02:40] nor aggregation about UAs etc.. [13:02:47] it is vital for the migration to liftwing [13:03:31] Oh, I don't contest the need for a fix. But I am willing to do it in a radical, simple way. [13:03:52] Mostly since the incurred inflexibility is unlikely to become a problem [13:04:18] I'd prefer not to fork the uwsgi class if possible [13:04:58] Well, since WMF does not use class inheritance, that's about the only thing we can do besides continuing with the current approach [13:05:13] I'll see how much work it would be to turn the tuple into a list of tuples. [13:05:34] did we check why on netbox1001 everything works and on oresXXXX it doesn't? [13:05:44] Not in-depth [13:05:58] But I can have a closer look [13:07:46] ah snap, on netbox we have bullseye [13:07:56] Ah. was about to ask. [13:08:33] same for puppetboard [13:08:34] so 2.0.19.1-7.1 vs 2.0.18-1 [13:09:52] Hmmm. Debmonitor is on Buster and has a similar config [13:10:24] uwsgi is 2.0.18-1 [13:10:39] and the logging works (I see non-zero UDP datagrams in tshark) [13:13:09] Things we do on ORES that debmonitor doesn;t adding headers, log x-forwarded-for. Demon sets buffer_size=8192, but I dunnot what that buffer is for [13:14:21] testing it on ores1001 now :D (depooled) [13:15:36] klausman: it works :D [13:15:47] Buffer size did it? [13:15:54] yep! [13:15:58] I see logs on logstash [13:16:02] Ok, that'll vastly simnplify things [13:16:09] definitely [13:16:27] so probably the json buffer is related and it gets truncated or similar [13:16:58] Or the request body is free()d befor the JSON is done [13:17:16] Set the max size of a request (request-body excluded), this generally maps to the size of request headers. By default it is 4k. If you receive a bigger request (for example with big cookies or query string) you may need to increase it. It is a security measure too, so adapt to your app needs instead of maxing it out. [13:17:37] let's bump it to 8k then [13:17:42] klausman: can you file a code change? [13:17:43] on it [13:19:45] thanks [13:20:30] I'll reuse the change I had, for posterity [13:20:41] yep yep [13:21:01] let's also update the upstream github issue etc.., it was a good report, maybe we'll save some other poor soul [13:21:07] Aye [13:21:24] (it is arguably still an open bug, but at least there is a known mitigation now) [13:21:38] Ok, change updated, PCC looks as expected [13:23:19] Updated upstream bug [13:24:52] +1ed, there is a small typo in the commit msg but LGTM! [13:25:23] commitmsg fixed. [13:26:02] I presume with us not messing with the uwsgi class anymore, we don't need a review from Joe/Alex? [13:26:18] yep yep [13:26:23] we can merge and deploy [13:26:41] will do [13:28:14] I'll also force a puppet run on the first three ores machines in eqiad, see what happens [13:28:34] sure [13:32:29] where would one see the topic size/message rate relevant for this? [13:33:07] if we see logstash being populated it should be enough [13:33:27] I see logs in https://logstash.wikimedia.org/app/dashboards#/view/ORES [13:34:02] yeah, that looks nice [13:35:42] going afk a bit before meetings [14:03:07] kevinbazira_: team meetinggggg :) [14:04:54] joining now .. [16:21:00] aiko: https://pypi.org/project/mwapi/0.6.1/ :) [16:21:26] woohoo [16:21:53] going afk! have a nice rest of the day folks! [18:23:06] 10Machine-Learning-Team, 10SRE, 10ops-codfw: codfw: ml-serve2001 memmory issue DIMM A2 - https://phabricator.wikimedia.org/T313822 (10Volans) p:05Triage→03High [18:48:08] (03PS8) 10AikoChou: outlink: use async HTTP calls to fetch data [machinelearning/liftwing/inference-services] - 10https://gerrit.wikimedia.org/r/807135 (https://phabricator.wikimedia.org/T311043)