[06:57:11] accraze: niceeee! [06:57:14] great work [06:58:12] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10elukey) >>! In T294946#7665517, @Papaul wrote: > Can someone please update this task with the Partitioning/Raid information? > > Thanks. Hi Papaul! Th... [08:07:45] * elukey errand for ~1h [10:32:39] I applied https://gerrit.wikimedia.org/r/c/operations/puppet/+/758779 manually to ores1001 and it seems working [10:32:54] I can see external IPs in logstash when possible [10:33:11] it should help us to track down issues with external clients [10:33:27] I don't think that we have something like that for lift wing [10:33:47] ideally we should have a logstash dashboard with kserve logs like the ORES one [11:28:34] rolling out the change for the XFF header [12:31:36] * elukey lunch [14:38:10] very interesting data from ORES [14:40:29] I believe that the Go-http-client/1.1 UA, that keeps requesting /v2/scores, is an health check [14:41:05] the UA is weird, there is "Twisted PageGetter" doing the same, that should be pybal (part of our load balancer infra) [14:41:24] with the new settings I don't see XFF external ip addresses [14:42:04] then it remains mediawiki (where the extension is turned on) [14:42:16] change-prop (doing the constant precache) [14:42:27] and others UA related to known ORES-related bots [14:42:36] but making very few ruquests once in a while [15:04:06] The Go UA is just the stdlib http client, so it could be anything using that. [15:04:31] My best guess would be Prometheus [15:15:24] elukey: updated the AGIPGW doc and reviewed your change, btw [15:17:21] klausman: I don't think it is prometheus, we have a generic prometheus statsd exporter (so ores pushes to a local statsd endpoint, and then the prometheus masters poll from it) [15:17:47] hmm. maybe a custom healthcheck used by icinga? [15:18:09] do the client IPs of that UA indicate anything useful? [15:23:43] there is no XFF so only local, with some tcpdump we may be able to get what it is, but probably not super important. The sad thing is that "Real" traffic, from bots, is a tiny tiny fraction of the whole machinery [15:25:02] Yeah, only really warrants follow-up if there is a problem. [15:25:22] I have also cleaned up a little https://grafana-rw.wikimedia.org/d/HIRrxQ6mk/ores [15:25:41] and moved per-minute graphs (the prom settings were a little wrong) to per-second graphs [15:25:48] irate[5] basically [15:25:55] and it depicts a different picture [15:25:58] That looks a lot better, nice! [15:27:41] I wish Graphana had a better way to manage that "middle" set of graphs. I.e. not throwaways, but also not permanent, either. The kinda stuff that will live maybe 2-3 weeks, and 5% or so of it eventually becomes a permanent part of some dashboard [15:28:31] (that said, Grafana is _so_ much better than any graphing UI I have used before... I shudder when I think of the dreadful experience that was Cacti and its ilk) [15:28:53] (checked the gdoc, all good afaics!) [15:29:28] (excellent. I have some more unformed ideas/aspects in my head. Will ruminate on them some more and then put them in) [15:33:53] morning all [15:35:49] morning! [15:37:51] \o 'lo Chris [15:38:00] Is it possible to know how many request (bots, etC) are getting a score from the precache? Basically, how much is the API used by real users (not health checks, not precache, etc) [15:43:10] WOuldn't it be tricky to distinguish a bot (automated query) from someone who uses an interactive tool that happens to use the same library? [15:45:18] we can isolate in kibana/logstash the requests that are related to bots, and possibly check the response_time [15:45:42] maybe we a little breakdown about how much time a response take [15:46:10] but it will be something related to a specific time window, nothing aggregate afaics [15:46:44] there is also the webrequest dataset that could help [15:46:59] we could filter requests for certain UAs and have a breakdown of response time [15:47:17] that would be more generic and possibly spanning a wider time window [15:47:31] but my impression is that the traffic is really tiny [15:47:59] Okay cool, it isn't important. I am just trying to figure out how much actual usage by external users (as opposed to WMF systems like pre-cache, etc) [15:48:04] if I made the graphs right, eqiad sees 30 rps (precache + normal requestS) [15:48:14] if you remove precache it remains very little :D [15:48:38] right but I didn't know if a ton of people were getting predictions from the precache [15:50:43] IIUC all the uris ending up in /precache are hit only by changeprop, so the rest should be people's traffic [15:51:06] and the only way to get the answer cache/not-cache is the response time [15:51:22] but I am probably wrong, this is an unknown territory for me [15:51:49] if this is the right understanding, sometimes I see a mozilla-related UA but very rarely [15:58:39] ah interesting, I just noticed the draft quality pods crashlooping and causing a ton of messages to logstash :D [15:59:38] FileNotFoundError: [Errno 2] No such file or directory: '/mnt/models/model.bz2' [15:59:45] mmmm [16:02:28] ^ uh oh [16:02:30] o/ [16:02:43] Hey Andy [16:06:52] o/ [16:06:55] so I see //wmf-ml-models/draftquality/enwiki/202107141649/model.bz2 [16:09:18] and in the values.yaml file I see [16:09:20] value: "s3://wmf-ml-models/draftquality/enwiki/202107141649/" [16:09:28] that seems ok [16:10:00] ah wait, now I am starting to remember that maybe the storage initializer wants model.bin [16:10:23] there was a task about it, totally out of brain cache now :D [16:13:06] elukey: draftquality should be fine with model.bz2 [16:13:07] https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revscoring/draftquality/model-server/model.py#15 [16:13:10] chrisalbon: btw, with the Tech Dept Updates/OH tomorrow coinciding with our two meetings, are we gonna skip them or bump the elsewhere? [16:13:26] ^^ was wondering the same [16:13:32] it does? [16:13:49] It does! [16:13:52] accraze: yeah I meant the storage-initializer part [16:14:40] ohhh good point [16:15:08] Alright let's skip it. I hate to skip it, but rescheduling is hard for us given timezones [16:15:20] accraze: it is the only one with a bz2 model file right? [16:15:22] skip the meeting, no the storage-initializer part [16:15:49] elukey: yep only one with bz2 model file [16:16:26] it was working though... only thing that changed is we added a transformer? [16:16:51] do we know if the log is coming from transformer or predictor? (both need to load the model...) [16:18:18] good point, the transformer [16:18:23] the predictor is up [16:18:49] does the transformer need the model? [16:18:58] (I verified, the storage init doesn't care about model file names) [16:19:37] yeah transformer needs the model loaded in order to extract features [16:19:40] https://gerrit.wikimedia.org/r/plugins/gitiles/machinelearning/liftwing/inference-services/+/refs/heads/main/revscoring/draftquality/transformer/draftquality_transformer/draftquality_transformer.py#46 [16:19:43] :( [16:20:50] i wonder if we need to include the `STORAGE_URI` to the transformer as well. i just assumed it would be able to access `/mnt/models/` from inside the cluster [16:22:43] ahhhh right now I see it! [16:22:57] so the transformer is a different pod [16:23:07] it doesn't share anything with the predictor one [16:23:15] so we need to add the STORAGE_URI etc.. as well [16:23:23] (for the transformer) [16:23:31] accraze: --^ [16:23:49] aha! [16:25:15] that makes sense, i forgot they are two separate pods [16:31:06] accraze: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/758883/ [16:32:19] nice! +1'd [16:32:52] I am going to make a logstash dashboard for liftwing [16:33:00] so we can check logs etc.. coming from the pods [16:35:20] that would be excellent [16:52:06] lovely now it complains about credentials [16:52:44] swift credentials? [16:53:03] yeah but I know why [16:53:19] we create a service account, in the kserve chart, only for the predictor use case [16:53:19] oh we might need to add the service account to the transformer too [16:53:23] exactly yes [16:53:27] ^^ [16:54:06] I am going to log-off after the next meeting, I'll fix it tomorrow morning first thing :) [16:54:44] cool that sounds good! thanks for catching that elukey! [16:58:00] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) Thanks @ACraze! I've been testing t... [17:12:53] accraze: I think that there may be more problems related to the transformer sa [17:13:13] in theory the one that we have one, called "kserve", is generic and not associated with the predictor in any way afaics [17:16:41] will restart the debugging session tomorrow :) [17:16:46] have a nice rest of the day folks! [17:17:11] see ya elukey! [17:24:04] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [17:24:50] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [17:57:56] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with O... [18:16:03] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with OS buster executed with err... [18:25:08] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with OS buster [18:44:14] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-staging2001.codfw.wmnet with OS buster completed: - ml-s... [18:46:01] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-staging2002.codfw.wmnet with OS buster [18:52:20] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [19:19:51] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-staging2002.codfw.wmnet with OS buster completed: - ml-s... [19:33:47] 10ORES, 10artificial-intelligence, 10articlequality-modeling, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: ORES deployment - Winter 2022 - nlwiki articlequality/hiwiki editquality/ores observability - https://phabricator.wikimedia.org/T300195 (10Halfak) I have rebuilt the English Wikipedia... [19:46:35] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) [19:47:39] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-staging200[12] - https://phabricator.wikimedia.org/T294946 (10Papaul) @elukey all yours leaving the task open since i don't have the Packing Slip to receive the servers in Coupa [20:28:18] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10Papaul) [22:22:09] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host ml-serve2005.codfw.wmnet with OS b... [22:53:52] 10Machine-Learning-Team, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: (Need By: TBD) rack/setup/install ml-serve200[5-8] - https://phabricator.wikimedia.org/T294945 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host ml-serve2005.codfw.wmnet with OS buste...