[12:45:10] About how many gigabytes of logs does toolforge generate in a day? [12:46:31] that's a reasonable question and I suspect we don't have a good way to answer it except very approximately... [12:46:53] * andrewbogott ponders [12:49:36] Very approximately is probably close enough, what's an order of magnitude among friends [12:55:02] In closing https://toolsadmin.wikimedia.org/tools/membership/status/1834 and https://toolsadmin.wikimedia.org/tools/membership/status/1835 should I mention that they appear to be duplicates of 1831 or just decline without comment? [12:55:22] I'm running a big 'find' command which will show the total accumulated logs on nfs, which I think will give us an upper bound... unless there's a bunch of k8s state that doesn't get written to files [12:55:42] I think it's fine to note them as duplicates for future readers. [12:56:40] topranks: are you in any cloud-vps projects? You should be able to reproduce the dns problem from basically any VM -- I can make you one if you like. I also emailed you a suggestion for testing on a bunch of VMs at once. [12:57:01] yeah I am [12:57:14] I tried to reproduce before but let me try again [12:59:26] I'm thinking it might be sensitive to traffic level because when I run the test on 30+ vms at once I usually get failures right away. [12:59:49] If you want I can use cumin to cause a flood of traffic while you run your test, we can see if that improves your chances :) [13:00:12] I searched around for conntrack errors last night but couldn't find any. Doesn't mean it's not a conntrack issue though. [13:00:30] andrewbogott: I thought that the current state of things was that the kubectl logs were not stored in nfs? [13:01:02] I fear that it's some of each. [13:01:23] And in any case the number I'm looking for would only be barely useful. [13:02:06] * andrewbogott googles for how to show all logs from all pods at once [13:02:08] Alrighty. I'll do a little looking around and seeing if I can find a little more data on if we're dealing with more than tens of gigabytes in a day. This is for guiding on what particular kind of loki install to try [13:02:33] I would be very surprised if it's more than 1GB/day. [13:06:43] andrewbogott: on what systems do you suspect it's hitting conntrack limits? [13:07:07] seems unlikely to me tbh, but certainly that would cause occasional connectivity problems if the limit was occasionally being exceeded [13:07:11] I didn't have much of a theory. cloudnet or cloudlb or cloudservices [13:07:36] I checked cloudnet and cloudservices, also cloudgw, max is about 10% of limit on those [13:07:47] yeah, that's how it looked to me. Not close to filling. [13:08:05] same for cloudlb just had a look [13:08:07] I also checked the size of the recursor cache and we're only at 35% or so there too, so not overflowing the cache. [13:08:27] even if you were it'd get the answer and expire something else you'd expect (though I'm only guessing that) [13:09:43] yeah, I was imagining some scenario where the recursor gets VERY upset and restarts itself or similar. [13:09:55] but the recursor logs don't show any distress or errors [13:10:18] (well, I mean, they show errors for random bogus requests but not for the requests that I'm seeing fail) [13:10:51] any serious problem on the recursor would completely tank all of cloud completely you'd expect [13:11:17] Rook: any idea how to ask kubectl 'show be all logs for all deploymebts in all namespaces?' I can write a script to loop on the namespaces but maybe there's a one-liner that'll do it. [13:11:46] topranks: I could believe that all recursor calls are failing everywhere for 500ms and most clients are well-behaved enough that they don't complain. [13:12:02] And that the issues that are getting reported are for clients that are overly sensitive. [13:12:08] All speculation though at this point [13:13:09] (If I were a betting man I would bet on + ) [13:16:55] andrewbogott: I'm getting the feeling that a loop is going to be the answer. I'm starting to sketch one on my end [13:17:15] ok! [13:23:22] Rook: I think the 'how much per day' question is a hard one but if you dig up 'how much total accumulated as of right now' that will at least give you an upper bound, and hopefully the upper bound will be 'not so big that this will be a problem' [13:55:14] what's the context on the log volume thing? [14:03:34] Rook: ^ ? [14:04:47] I'm trying to figure out which flavor of Loki to test. Monolithic is documented as not working well with volumes greater than 20 gigabytes a day [14:17:29] taavi: don't you already have a prototype for Rook to play with? [14:21:31] I have the manifests I use to run loki at home, and I might have some prior experiments saved somewhere [14:21:52] is Rook going to work on Loki for Toolforge? or for some other project? [14:22:48] yep, toolforge [14:23:19] ooooh [14:23:59] I've read through https://wikitech.wikimedia.org/wiki/User:Taavi/Loki_notes though don't know enough about loki to make sense of them so far [14:28:16] you should ignore most of what I wrote there, I'm pretty sure it predates some rather nice promtail features that mean we can run it as a daemonset with a single instance handling everything [14:28:47] https://git.majava.org/config/k8s-deploy/tree/deployment/apps/loki is what I'm doing at home, but it's missing multi-tenancy stuff [14:30:02] i can look later today if I can find some more recent drafts for toolforge, but pretty much it should be following the loki docs for a medium-sized deployment with helm [14:30:40] i think the main challenge will be figuring out a nice querying interface [14:34:30] Grafana wouldn't be the desired querying interface? [14:35:04] I was assuming that the options were either kibana or 'something tbd in the to-be-written tools UI' [14:38:17] topranks: any luck producing that failure? If not I can crank up the traffic and see if that helps [14:57:55] Rook: grafana is one option, yes. then you need to figure out how to provision access there [14:58:47] that's interesting, I've never seen grafana used for text, only for numeric things [14:58:48] i've also been dreaming of writing something that'll be integrated in striker instead of having yet another thing just for it [14:59:15] yeah, I think 'integrated in striker' is equivalent to what I called 'tools ui' [15:01:07] hopefully yes :-) [15:01:27] Provisioning grafana access would be a question of authentication? [15:07:35] yeah. automatically syncing groups from any external system is only in the "contact us for pricing" version of grafana. we have an existing script for the existing grafana.wikimedia.org (and grafana.wmcloud.org) installs to sync group data from LDAP, but i have no clue how well that works for what we want for toolforge (10-20x? more users, plus [15:07:36] much more real time updates) [15:09:18] Oh the issue is in access to grafana from the community? [15:11:53] at least in my mind tool maintainers being able to access logs for their own tools is a rather important feature [15:12:29] andrewbogott: basically grafana's "explore" feature gives you a nice graphical interface around Loki's prometheus-like custom query language. looks like this: https://prod-misc-upload.public.object.majava.org/taavi/pdBv8ujLJi9v9.png [15:14:26] huh, yeah [15:14:33] Ah I didn't realize that access was not had. I'll start looking into kibana [15:15:48] Rook: I wouldn't get too deep into presentation because I think david is hoping for a trivial/first-pass presentation within the custom web ui. Of course if it's trivial to show users the firehose in Kibana we would probably /also/ want that eventually, but when we discussed it last week we kind of agreed that 98% of use cases will be 'show me the error message in the tail of the log' [15:16:46] * taavi would really appreciate if whatever discussions you're having would be documented somewhere [15:17:55] I think there are design docs about the proposed UI but i sure don't have them :( [15:18:22] Sarai showed a few during a meeting but we mostly tore that design down immediately [15:23:31] taavi: everything I know about it is from https://phabricator.wikimedia.org/T127367 I'm only learning about how loki is installed which I don't believe makes for interesting documentation. [15:25:06] Could I get second opinions on https://toolsadmin.wikimedia.org/tools/membership/status/1837 I'm probably being overly suspicious from dealing with recent paws abuse. Though the account was created minutes before making the toolforge request, has no edits, and I find it odd to have such an extent of research for seemingly limited engagement. [15:26:59] I would probably approve it but I agree that the lack of edit activity is unusual [15:29:52] Alrighty, I'll approve it [15:36:38] Rook: anyway, if I was tasked to get a logging solution for toolforge up and running, I'd probably start from getting a Loki instance up and running on K8s, integrated with the toolforge-deploy scripts and backed by the ceph s3 thing [15:37:24] Sounds good, thanks [15:37:53] then coming up with the promtail config needed to ingest pod logs there "should" be relatively straightforward. there might be a few non-obvious things, like doing per-namespace quotas or something, and we probably want to exclude things like the ingress-nginx access logs at least for now [16:50:10] Raymond_Ndibe: in theory you're on clinic this week, is that OK? I created a copy/pasted etherpad which might need some work before Thursday. [17:08:09] andrewbogott: I think raymond is on PTO this week [17:08:18] oh dang [17:08:41] (and I'm off today and tomorrow, just lurking as one does) [17:09:08] dhinus: want to be on clinic this week? Since you're already doing the work? [17:11:47] I can do it until Thursday yep [17:12:02] or until Friday if easier [17:12:39] we only have 1 working day next week, then global holiday starting Tuesday [17:25:05] I will proably take it next week since my family is time-shifting the holidays to the next week [17:25:10] thanks [17:31:18] I think we can also pause all clinic duty activities during global holidays... and resume on Jan, 2nd? [17:31:48] in that case, I can cover until global holidays, and we need someone who can cover Jan 2nd -> Jan 9th [17:32:06] We could. Lots of volunteers catch up on projects over break so it's nice to keep them unblocked. [17:32:46] true that, I guess it can be a more "have a look every couple days" kind of thing, if you want to do it [17:35:39] * andrewbogott nods [18:15:17] andrewbogott: I'm about to log off, but I just noticed there were some "interesting" kernel errors logged by cloudgw1002 a few hours ago T382220 [18:15:18] T382220: KernelError Server cloudgw1002 may have kernel errors - https://phabricator.wikimedia.org/T382220 [18:15:51] I saw those too, are those different from what we'd expect after a reboot? [18:16:01] uptime is 61 days [18:16:10] ok. [18:16:17] I guess I'm trained to ignore kernel warnings now, that's not good [18:16:28] I will see what I can see [18:16:54] I think the alerts need some improvement, but so far they have found non-expected errors a few times [18:17:27] so it's generally worth double checking the full logs when they fire [18:17:47] ok [18:21:46] this looks related: T382222 [18:21:47] T382222: ProbeDown Service wan.cloudgw.eqiad1.wikimediacloud.org:0 has failed probes (icmp_wan_cloudgw_eqiad1_wikimediacloud_org_ip4) - https://phabricator.wikimedia.org/T382222 [18:37:07] this graph doesn't look good https://phabricator.wikimedia.org/T382222#10407145 [18:39:38] is that graph still dipping, even though the cloudgw hosts seems to have recovered? [18:39:52] topranks: if you're still working all of ^ may be of interest [18:40:21] I'm still trying to make sense of it, but I _think_ it means the success rate for that probe is becoming worse [18:40:31] but it's still around 99% success per day [18:40:56] there were a few failures overnight, even later than the last kernel error was reported [18:41:56] This would be a good time to reboot that cloudgw but I'm not 100% familiar with the failover scheme. [18:42:10] me neither [18:42:30] it doesn't seem critical, but it seems suspicious... [18:45:30] cloudgws use keepalived, and should(TM) failover fully automatically [18:47:38] Thanks taavi! That's what i thought but I still don't want to try it without a network engineer watching [18:51:42] I think we can wait until tomorrow (hopefully) [18:51:46] * dhinus checks who is on call :P [18:54:41] Rook is on call until 3 UTC, I am on call later. I'm 99% confident no one will be paged and we can wait until t.opranks or a.rturo are around. [18:55:30] I've added a 90-day graph to the phab task, it's definitely going down, but it might be unrelated to the kernel panic from last night [18:56:15] * dhinus vanishes in offline-land [21:09:28] No particular reason to think it's anything to do with the reported build issues anyway [21:12:02] A ping probe with 99% success seems ok to me [21:12:30] the drop a few weeks back I'd almost certainly ascribe to the cloudgw resource exhaustion due to the DOS attacks that were being run from the paws boxes