[07:30:21] hi folks, the 3 days retention for webrequest sampled live in Druid worked nicely, I bumped it to 8 days [07:30:53] We are working with data engineering to possibly go to 30 days and drop the _128 datasource (batch from hadoop, with 2 hours of delay) [07:34:33] nice! [07:34:53] that's great elukey [10:05:52] godog: \o I have a question about the cadvisor rollout. It seems to correlate with one of our staging k8s worker nodes becoming offline. We've looked at other things, but there's nothing in the logs. Two days ago, ml-staging2001 just went AWOL from k8s' point of view. [10:06:42] in the kubelet logs we see [10:06:43] cadvisor[997437]: W0531 10:05:40.861998 997437 manager.go:159] Cannot detect current cgroup on cgroup v2 [10:07:56] mmm wait, systemctl cat cadvisor points to the kubelet + a puppet override [10:08:07] klausman: ha! ok, taking a look too [10:08:22] is it normal? [10:08:27] totally ignorant about it [10:08:49] We're not saying it's cadvisors fault, but the timeline matches up, and it's a bit puzzling what else it could be [10:09:34] yes yes we totally blame Filippo :D [10:09:39] don't hide from it [10:09:45] lolz [10:10:20] I'm not even sure ml-staging2001 is part of the rollout (not sure if k8s nodes always have a cadvisor running) [10:10:55] it is yeah, as far as I'm aware only cp* and mw* hosts already have cadvisor running explicitly [10:11:40] interesting that kubelet.service has Conflicts=cadvisor.service [10:11:45] godog: the thing that I don't understand is why I see a cadvisor override when I exec `systemctl cat kubelet` [10:11:51] and Alias=cadvisor.service [10:12:03] ok ok more info pointing to the same thing [10:12:28] in fact we don't have a kubelet running on 2001, that explains why the node is dead [10:12:32] klausman: --^ [10:12:59] So cadvisor is masking the kubelet? [10:13:22] ExecStart=/usr/bin/cadvisor yep [10:14:32] mhh also metrics from cadvisor have a whole bunch of extra labels, I'm assuming from the fact that the node runs k8s [10:14:43] curl $(hostname):4194/metrics that is [10:14:55] yeah, it probably exports all the container info from stuff that was already running there [10:16:27] so on 2002 we have cadvisor as well, but the kubelet's unit is not overridden [10:17:01] if I check the cadvisor's unit status I get the kubelet's one though (that's Alias=cadvisor.service ?) [10:17:07] would you mind filing a task with this issue in the meantime for tracking? a child of T108027 [10:17:07] T108027: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 [10:18:08] elukey: is that ml-staging2002 ? cadvisor isn't installed there so yes I believe that's an effect of Alias [10:18:46] so 2002 isn't part of the rollout yet? [10:20:12] godog: yes correct! [10:20:23] same thing on other kubernetes nodes [10:21:00] klausman: do you mind to open a task as Filippo suggested? To track the work [10:21:05] otherwise I can do it [10:21:15] on it [10:21:22] super thanks [10:22:09] T337836 [10:22:09] T337836: Cadvisor may be breaking Kubernetes worker nodes - https://phabricator.wikimedia.org/T337836 [10:22:19] cheers [10:22:23] I probably missed some info, feel free to add/edit [10:22:30] yeah 2002 isn't part of the rollout yet [10:23:58] klausman: nono all good [10:25:27] I tend to be a bit conservative with initial bug reports (dialing back speculation to not prime whoever helps debugging) [10:25:40] as I understand it there's two different but related issues: having kubelet and cadvisor play nice together, and ask cadvisor to not bother with docker/k8s stats [10:26:08] do you mind if I test out stuff on ml-staging2001 with puppet disabled ? [10:27:05] I have no obections [10:27:20] even if we had lots of prod traffic, it's still a staging cluster [10:27:34] godog: +1 [10:27:47] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/refs/heads/v1.23/debian/kubernetes-node.kubelet.service states the cadvisor alias for sure [10:27:47] ok! thank you folks [10:28:02] maybe our override.conf interferes? [10:31:23] for cadvisor? the override.conf changes only the execstart [10:33:04] I was searching for ways to stop cadvisor from talking to docker, https://github.com/google/cadvisor/issues/2848 [10:37:58] I mean --docker /dev/null "works" [10:38:05] godog: what I mean is that the kubelet service unit has an alias to cadvisor, so IIUC if we apply an override to it the kubelet unit gets the same [10:38:59] elukey: ah now I get what you mean, yeah that must be it [10:39:15] I'll try removing the Alias [10:39:41] it seems shipped with our kubelet deb though [10:39:55] (better - kubernetes-node) [10:42:48] mmmm [10:43:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/kubernetes/+/477ac9742258bf26348b269befb06db828978b98%5E%21/debian/kubernetes-node.kubelet.service [10:43:46] from a long time ago, maybe we should remove the Alias? [10:43:48] ok so what I did: install an override for kubelet.service that removes the conflicts= and alias=, then remove /etc/systemd/system/cadvisor.service symlink [10:44:08] yeah I think we can remove Alias [10:44:09] super [10:44:26] I am wondering if there was a motivation or if upstream removed it at some point [10:49:30] I don't know, I'll update the task [10:50:27] godog: added some updates as well [10:52:01] cheers [10:53:04] that's basically what I wanted to write too, I'll skip mine [10:53:55] I'll reenable puppet on ml-staging2001 and revert my changes [10:54:48] super thanks! [10:54:53] ok all done [10:55:42] <3 [11:00:30] going to lunch, will check back later [11:01:40] thanks for your help! [11:01:55] sure np! thanks for reaching out [11:11:28] elukey: not sure if it was covered but the "Cannot detect current cgroup on cgroup v2 [11:11:35] is just a warning and can be ignored [11:11:50] * jbond sees the same thing locally [11:12:54] Well, it got us pointed in the right direction :) [11:13:10] :) [12:10:30] effie: where are we at with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/923588? [12:11:23] are we good to enable this for enwiki+frwiki+dewiki? [12:58:31] jbond: yeah my concern was more about why cadvisor was writing logs as it was the kubelet service [13:43:39] duesen: amir merged te patched mentioned there yesterday EU night [13:44:29] duesen: do frwiki? [13:54:54] Amir1, effie: heads up: main stash writes from VE are spiking, as expected after swithing small+medium to direct mode: https://grafana.wikimedia.org/goto/u27-wXwVz?orgId=1 [13:55:18] effie: just frwiki? Ok, I'll adjust the patch. We can do it tomorrow morning. [13:57:40] cool [13:57:55] effie: we merged the patch but it got reverted [13:58:05] it was merged by accident [13:58:11] unless it got deployed again [14:00:55] ? [14:01:09] I am confused now [14:01:23] it got deployed again https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/924358 [14:01:27] but earlier today [14:03:58] ok that is another patch, not duesen's patch, not your jobrunner patch [14:05:39] okay, then sorry [14:05:45] but the jobrunner was done last night [14:06:15] https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=1685452786547&to=1685539186547&viewPanel=74 [14:06:34] Amir1: yea I saw that, that's awesome! [14:06:51] anyway... [14:06:58] so now I can isolate it, feel free to do whatever you want the jobrunners [14:07:00] I scheduled this for 7:00 UTC tomorrow: Enable parser cache warming jobs for parsoid on frwiki [14:07:06] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/923588 [14:08:25] sounds good to me [14:08:34] but effie needs to monitor it [14:14:21] effie: can you? [14:30:23] yes [15:22:32] hmmm [15:22:46] jbond: we are having some issues with puppet on cp2035, it's getting stalled loading facts [15:22:57] May 31 15:14:03 cp2035 puppet-agent[1751348]: Loading facts [15:27:16] vgutierrez: ack give me 5 mins and i can take a look (id try `facter -d -p` if you havn;t already) [15:28:34] 2023-05-31 15:28:04.952025 DEBUG leatherman.execution:93 - executing command: /usr/bin/sh -c /usr/sbin/ipmi-oem dell get-system-info idrac-info [15:28:47] that seems to be the culprit [15:29:22] ipmi is timeouting pretty slowly [15:31:10] id try an racreset if safe to do so [15:34:18] jbond: mgmt interface is unreachable [15:36:00] ok cp2035 is known [15:36:02] vgutierrez: probably best top get dc-ops to look at it [15:36:10] https://phabricator.wikimedia.org/T323557 [15:36:17] i think the silence just expired [15:36:40] yep [15:36:59] We detected the issue while running a cookbook there [15:37:16] Host is currently depooled so no major impact [15:37:39] thanks vgutierrez [15:37:59] ack i think there may even be a task for the impi fact [15:38:01] is it worth adding a timeout to the ipmi-oem invocation? [15:38:14] cdanis: i have a feeling it did have one but got removed [15:38:21] * jbond looking [15:39:32] sukhe: yep.. but it's degrading.. now puppet-agent is unable to run there [15:41:29] stopped the port 80 cookbook and downtimed cp2035 [15:41:37] ok, makes sense [15:45:01] and of course the host is depooled [15:45:34] Is anyone not busy able to look up a trace in logstash? [15:47:42] RhinosF1: what are we looking for? I can do it (but note SRE meeting in ~10 mins) [15:48:56] RhinosF1: is it the thing in -tech? [15:50:03] sukhe: 16:39:12 [ec5527e2-100f-42e7-8c97-297b34f572ac] 2023-05-31 15:38:59: Fatal undtagelse af typen "Exception" [15:50:08] If you can stick it in a phab paste that would be great [15:50:15] Yes [15:50:22] My signal lost for a minute [15:58:06] RhinosF1: it looks to be more T337700 [15:58:07] T337700: Exception: preg_match_all error 4: Malformed UTF-8 characters, possibly incorrectly encoded - https://phabricator.wikimedia.org/T337700 [15:58:45] vgutierrez: cdanis: fyi i took a look and currently the ipmi commands have a timeout ~1m i cant find evednce of removing the timeout before so we could look at adding it back however id needs to be done in a number of places and would probably benefit fro some refactor soi everything that talks to impi is in the same file so we would surround everything in if `ipmi ping` then .. fi [15:59:37] that said im not sure if that was the only issue. the ipmi commands where actully timeing out so there could be something elses being funcky. i thinkthe box is being booted now so i couldn;t test further [15:59:48] feel free to rais a task and iu can take a closer look [16:00:02] yep... dcops is working on it right now [16:00:08] ack [16:01:15] RhinosF1: TheresNoTime: thanks folks, will keep an eye out on this and happy to follow up [16:31:02] sukhe: we've made the task UBN. It looks like dawiki is completely uneditable for people with language set to dansk [16:34:19] RhinosF1: thanks, tracking [19:38:09] How can I get telemetry for the mainstash? Either from MediaWiki, or on the database level? I'm mainly interested in read/write rates. Through it would also be nice to know the available disk space. [19:38:48] I am asking because today we switched VE on most wikis away from restbase, so now it's using mainstash for stashing edits. [20:00:42] /ac/ac