[06:06:48] hello folks [06:07:07] the kafka logging cluster seems to have some warnings about disk space used for /srv [06:07:39] there are a couple of topic with big partitions on disk [06:07:57] udp_localhost-warning and rsyslog-notice [06:09:17] the traffic pattern changed yesterday afaics [06:09:18] https://grafana-rw.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=rsyslog-notice&from=now-7d&to=now [06:10:03] it was showing more spikes before, but we may need to apply some log deletion/retention policies to reclaim space [06:11:17] udp-locahost-warning seems to be more stable over the past days [06:14:21] (maybe there is also some source of log too spammy, I am checking the topics) [06:25:04] there are some logs that keeps repeating, maybe there is a task about it [06:37:57] <_joe_> elukey: I'm confused. I was under the impression that kafka has infinite storage and we can just add new disks or servers seamlessly [06:38:17] <_joe_> you know, given some people want to move every transaction in our databases forever into kafka [06:38:29] <_joe_> with retention time: infinite [06:39:21] <_joe_> basically, confluent created a real-life turnig tape and so on [06:41:16] _joe_ you always forget that we are not in the "cloud" so our elasticity is limited by our on-premise capabilities that don't allow us to scale properly [06:41:43] and we are also not serverless [06:42:00] you're making it really really hard for me no not just troll everyone in here [06:42:05] Not Helpful [06:42:14] *not to just [06:42:45] side topic: how many kafka clusters do we have atm if anyone knows? [06:43:06] <_joe_> 2x main, 2x logging, 1xjumbo [06:43:31] and one testing :) [06:43:45] 6 total and likely to expand. gotcha [06:59:54] I commented in https://phabricator.wikimedia.org/T279342 about the current issue, it seems a steady growth, but nothing really horrible in the immediate future [07:52:07] PSA today is the last day of https://sigops.org/s/conferences/hotos/2021/ [07:52:25] I read https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf yesterday and I liked it [08:01:30] nice! thanks for sharing! [08:02:37] sure no problem! [09:59:35] phab now have a "Other Assignee" field [10:10:15] <_joe_> XioNoX: oh that's great [10:39:47] we'll be testing librenms paging through AM shortly, there will be a VO page that can be ignored [10:42:40] that would be it I guess :-) [10:42:42] godog: got it [10:43:12] hehe yep! thanks folks, shouldn't be too confusing [10:44:23] godog: given we're on topic, can I ask you if there is a solution for splunk oncall to keep being logged in on the android app? [10:46:02] volans: mmhh I recall us raising this issue with their support and supposedly fixed, how long are you able to stay logged in? are you using logging in via google SSO? [10:46:37] yes to google SSO, I don't recall when was last time I logged in, I opened the page and got redirected to login [10:46:59] that might be an issue during an incident, I can just relogin and let you know if it re-happens :) [10:47:11] I had to relogin again too [10:47:47] mmhh I haven't in a while, but definitely happened to me before [10:51:51] but yes please report whenever you have to relogin and how long approximately the session lasts [10:51:59] I'm opening a task for that [10:58:15] ok filed as T284215 feel free to edit the description and report your experience [10:58:15] T284215: Splunk oncall / victorops mobile app logout tracking - https://phabricator.wikimedia.org/T284215 [15:22:18] XioNox: Wondering if you'd any thoughts on this. [15:22:42] There is some discrepancy in Netbox for doh3002. [15:23:09] topranks: ah? [15:23:14] Or more to the point an IP element is there, not attached to any device, with the DNS name "doh3002.wikimedia.org" [15:23:35] The IP attached to the VM primary interface also has the same DNS name [15:23:36] https://netbox.wikimedia.org/ipam/prefixes/10/ip-addresses/ [15:23:40] I can explain [15:23:44] 91.198.174.8 ? [15:23:47] yeah [15:24:06] So I added the peering on the CRs to .9, but logging on to the VM it is configured with .8 [15:24:09] i love how volans quickly put in the 'i can explain' to head off spiraling into nerdsnipes ;D [15:24:19] haha [15:24:28] robh: and you know the fun part? this is doh3002 and I can't explain it :) [15:24:40] I thought was the others I've looked at in the past days [15:24:48] and I have no specific context on this specific one, sorry :) [15:25:07] bug in the provisioning script? [15:25:12] i can hear the sound of topranks's heart breaking from here volans ; D [15:25:14] it still show its interface as ##PRIMARY## too [15:25:17] hopes dashed. [15:25:57] actually I can.. [15:26:08] robh: absolutely, that moment of clarity I was hoping for just never arrrived :) [15:26:10] volans: https://netbox.wikimedia.org/extras/scripts/results/873119/ (dry run) [15:26:18] so XioNoX that's T263768 that I need to piack up since Cas has left [15:26:19] T263768: Ganeti -> Netbox sync: run PuppetDB import on new VMs - https://phabricator.wikimedia.org/T263768 [15:26:24] srsly i could feel it! [15:26:25] *pick [15:26:27] heh [15:26:34] volans is a emotional rollercoaster [15:26:38] I know [15:26:55] lol [15:27:04] we can run the import from puppetdb script manually for this if that's creating confusion [15:27:07] let me do it [15:27:26] {done} [15:27:30] that's my link about, lgtm to run/commit it [15:27:57] XioNoX: The requested page does not exist. [15:28:16] I think the dry runs are not saved as they are deleted right after [15:28:35] I think they're implemented as a DB transaction so that it gets just not commited and trashed at the end [15:28:39] but not sure [15:28:55] volans: deleting 91.198.174.9/25 then? [15:28:57] and matching v6 [15:29:01] I clicked it and it worked. [15:29:09] But if I re-click now I get 404 [15:29:21] XioNoX: why deleting? [15:29:27] https://netbox.wikimedia.org/ipam/ip-addresses/8703/ [15:29:48] volans: because .9 doesn't exist :) [15:29:58] ? [15:29:59] (it's not on the host) [15:30:30] Cool.... I see what you did changed Netbox IP to match what it on the VM in production XioNox [15:30:37] So deleting .9 makes sense to me. [15:30:48] I'll change the CR config to peer with .8 [15:30:50] wait [15:30:59] the script should do that [15:31:48] delete .9 ? [15:32:04] IIRC yes, it should [15:32:19] if not in puppetdb, but only in some cases [15:32:59] checking code [15:33:12] no .9 in https://puppetboard.wikimedia.org/node/doh3002.wikimedia.org [15:33:35] yeah I checked [15:36:07] no, apparently we're deleting only the interfaces to cover the case of iface renaming on OS upgrade [15:36:54] how did we ended up with this? is this something that will happen again with normal usage? [15:37:00] or was a one time rename/renumbering [15:37:11] volans: was going to ask you :) [15:37:34] was the VM created twice? [15:37:42] good question [15:37:49] (sorry was just observing) [15:37:57] doh3002 was indeed created twice [15:38:11] but let me check something quickly [15:38:44] ok checked [15:38:57] https://phabricator.wikimedia.org/T283852#7130568 [15:39:07] seems like it was doh3001 which was created twice [15:39:13] not 3002, which has the IP in question [15:39:56] sukhe: according to SAL also 3002 [15:40:29] and the makevm cookbook doesn't properly rollback yet pre-assigned IP on failure between the IP is assigned and the VM is created [15:40:30] ah 20:27 dzahn@cumin1001: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh3002.wikimedia.org [15:40:33] there is a task for that [15:40:33] yeah [15:40:39] we hit a disk limit [15:40:43] so both will need to clean one IP [15:40:46] probably [15:41:43] topranks: and yeah you can update the CR to peer with .8 instead of .9 [15:41:44] XioNoX, topranks, sukhe: so yeah we need to cleanup the additional IPs, if any [15:42:13] XioNox: if you wanna give a +1 https://gerrit.wikimedia.org/r/c/operations/homer/public/+/697993/ [15:42:22] doh3001 doesn't seem to have any additional IPs, but yeah [15:43:02] doh3001 has that same interface name "##PRIMARY##" [15:43:08] Although it has the correct IP. [15:43:18] volans: all good now [15:43:29] The .9 address is gone now, you update that XioNoX? [15:43:34] yep [15:43:44] and ran the script that renamed ##PRIMARY## to the proper interface [15:43:50] Cool, I see doh3001 interface name is now correct. [15:43:52] Cool. [15:43:58] You deleted .9 manually? [15:44:00] nice [15:44:05] topranks: yep [15:44:24] topranks: +1 to your cr CR :) [15:44:41] ok thanks for all the help guys :) [15:44:57] indeed, thank you topranks, XioNoX, volans :) [15:52:33] np, anytime [15:56:12] sukhe: also, doh! there was an issue :-P [15:56:40] * volans couldn't resist [15:56:50] haha [20:36:25] guessing j.bond's off duty - anybody have time for https://gerrit.wikimedia.org/r/c/operations/puppet/+/696024 ? [20:37:08] (not super urgent, but would be nice.) [20:40:14] the syntax lgtm but given the content, I'd rather get a review from an SRE more confident wrt the security implications [20:40:49] obviously we want that port open, I'm just not the right person to double-check all the prerequisites [20:44:08] rzl: fair enough, thanks.