[06:44:29] 10netops, 10Infrastructure-Foundations, 10Patch-For-Review: Adjust routing policy to increase SSH session speed from East Asia to toolforge - https://phabricator.wikimedia.org/T334530 (10ayounsi) a:03ayounsi [06:52:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10Puppet, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ayounsi) [07:29:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ayounsi) a:03ayounsi @MoritzMuehlenhoff is it ok to bump the RAM from 4G to 6G on the rpki* VMs? https://netbox.wikimedia.org/virtualization/virtual-machin... [07:37:50] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10MoritzMuehlenhoff) >>! In T300955#9086089, @ayounsi wrote: > @MoritzMuehlenhoff is it ok to bump the RAM from 4G to 6G on the rpki* VMs? https://netbox.wikim... [07:39:28] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Implement better filter on BGP_Customer_out - https://phabricator.wikimedia.org/T340448 (10ayounsi) a:03ayounsi [07:48:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ops-monitoring-bot) VM rpki2002.codfw.wmnet rebooted by ayounsi@cumin1001 with reason: None [07:54:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, and 2 others: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ops-monitoring-bot) VM rpki1001.eqiad.wmnet rebooted by ayounsi@cumin1001 with reason: bump ram to 6g [09:51:42] I've withdrawn the authdns Anycast prefix from BGP announcements in esams now [09:52:06] Only ~60 req/sec now coming in to dns3001/3002 [09:52:25] And it's not 24 hours since we changed ns2 A record in our own DNS yet, so some may still have that cached [09:52:39] overall looking very good though, no massive amount of "bad caching" observed [10:03:33] yeah, `dns3001:~$ sudo tcpdump host 91.198.174.239` still shows some traffic [10:06:52] similarly there is the long tail of not respecting the TTL toward esams text/upload https://w.wiki/7ELy [12:04:06] 10netops, 10Infrastructure-Foundations, 10SRE, 10Puppet, 10good first task: Routinator: use tmpfs - https://phabricator.wikimedia.org/T300955 (10ayounsi) 05Open→03Resolved All done! [13:51:56] topranks: hi! I am going to withdraw dns300x from the authdns pool [13:52:06] there is also the static routes on cr*-esams for ns2 that point to dns3001 and 2 [13:52:39] we should also remove those, preferably before removing the servers from the authdns pool because once removed, they don't get any updates [13:52:53] I think it best to wait until 16:15 UTC as that will be 24 hours since we changed our own A record for ns2.wikimedia.org [13:53:25] yes that's fine, we can definitely wait. but the order seems OK to you? [13:53:33] that could legitimately be cached until then so removing before doesn't seem wise [13:53:47] yeah withdraw statics first, then remove from authdns pool [13:53:50] and do you want to remove the static routes or should I do it and then remove it from the pool? [13:53:57] either is fine by me, but wanted to ask [13:54:26] I don't think we are missing anything else [13:54:29] I'll take care of it as there is a manual change on the esams CRs that a homer run would try to remove [13:54:39] ok thank you then [13:54:49] and once you are done, let me know and I will merge the authdns change [13:55:13] there is also authdns_addrs: [13:55:16] ok will do :) [13:55:24] in hieradata/common.yaml, line 1225 [13:55:40] theoretically, we can and should remove this as well, given that ns2 is out of the picture now, or will be [13:55:44] but leave that to me, I will check [13:55:51] [old ns2 IP] [13:56:28] seems to be used for the monitoring and setting the loopback IPs [13:56:36] which for the anycast one is already taken care of [13:56:47] I am pretty sure of this one, just need to see if it's being used in other places [13:57:23] makes sense to me but yep good to double check [14:04:00] I am going to make this a separate patch [14:04:02] just in case [14:20:24] in case someone wants to give a review https://puppet-compiler.wmflabs.org/output/948142/42858/dns1005.wikimedia.org/index.html [dns300x removal] [14:20:28] changes look fine to me [14:20:57] CR is https://gerrit.wikimedia.org/r/c/operations/puppet/+/948142/ [15:30:40] yep +1 those changes look ok I think [15:40:59] thanks! [16:09:17] sukhe: getting close to that 24h mark [16:09:30] I was thinking maybe does it make more sense to merge the puppet change first? [16:09:49] as in do any other systems try to reach that IP, which will stop after puppet merge? [16:10:03] thinking [16:10:11] might be safer in that case to leave it reachable until then, then remove the statics [16:10:25] ok [16:10:28] the danger is someone makes a change in our dns in the meantime, dns300x doesn't update, and someone gets an old record [16:10:47] but we don't have dns changes very often, and there is virtually nobody using ns2 anyway now [16:10:54] (down to less than 4 req/sec) [16:11:01] yeah no anycast traffic as well [16:11:08] ok I am merging both patches [16:11:09] one by one [16:11:15] ok cool [16:11:20] need to check a few things after merge and then you can remove the routes [16:11:24] I guess I will give it ~30 mins then and remove the statics [16:11:48] ok [16:16:00] topranks: starting now [16:25:41] 10netops, 10Infrastructure-Foundations, 10SRE: Add per-output queue graphing for Juniper network devices in LibreNMS - https://phabricator.wikimedia.org/T326322 (10ayounsi) Next steps here: * Decide which hosts will run gnmic, I can think of 4 options: ** netflowXXXX (my preferred option, as already monitori... [16:26:57] 10netops, 10Infrastructure-Foundations, 10SRE: Add per-output queue monitoring for Juniper network devices - https://phabricator.wikimedia.org/T326322 (10ayounsi) [16:37:46] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10observability: Prometheus: ingest SONiC metrics - https://phabricator.wikimedia.org/T335027 (10ayounsi) After more investigation, I'm going to roll out gNMIc for more real life testing. As it's multi-platform and should export the... [16:41:36] sukhe: all ok your end? I'll remove the statics now unless any reason not to [16:42:27] topranks: please wait for my yes [16:42:33] I am trying to remove this alert for ns2-v4 paging [16:42:36] don't want it to page again [16:42:43] and I want to be sure it's removed from everywhere [16:44:36] no rush at all :) [17:05:33] topranks: mind reviewing https://gerrit.wikimedia.org/r/c/operations/puppet/+/948167 please? [17:05:36] should be a quick one [17:05:36] thank you [17:06:23] np +1 [17:06:25] thanks [17:07:15] I think you can remove the statics [17:07:26] the hosts are gone out of authdns_servers anyway [17:07:38] I will figure out why this monitoring exists but I think the static routes can go [17:07:41] thoughts? [17:11:03] should be ok I expect [17:11:09] go for it :) [17:11:15] I've to dash out for a few mins will take a look when I get back [17:11:18] ok all good [17:40:22] ok back [17:40:34] did I miss anything exciting? [17:40:41] sukhe: good to proceed and remove those routes still? [17:42:11] topranks: yeah go for it [17:42:25] cool will do [17:42:28] I am still looking for the source of the Icinga alert but it doesn't matter as the hosts are not in the pool [18:17:02] topranks: that's about it for today? [18:17:09] guess we are not missing anything for what we needed to prepare [18:17:27] yeah pretty much [18:17:31] I've a few more IPs to add [18:17:48] after which I'll have a dns patch for authdns with new 'includes' for the ipv6 linknets I'm adding [18:17:52] that's about it [18:18:04] cool, happy to review if need be [18:18:08] thanks! gl [18:18:34] yeah if you're around - but no probs either way [19:10:23] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [19:10:25] 10Traffic: Package and deploy ATS 9.2.1 - https://phabricator.wikimedia.org/T339134 (10BCornwall) [19:10:28] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) trafficserver is having difficulty building 9.1.4 on Bookworm. Considering there are some security fixes 9.2.1 brings and it builds fine on Bookworm we'll work on upgrading trafficserver a... [20:18:23] 10Traffic, 10SRE, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) Thank you for the clarification, @Vgutierrez. Do you have a suggestion on how to reconcile this? My instinct is to remove the abstraction entirely and ma... [20:18:36] 10Traffic, 10SRE, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10BCornwall) 05Open→03Stalled [20:23:00] 10Traffic, 10WMF-Legal, 10Patch-For-Review, 10Performance-Team (Radar), 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) @Isaac Sorry for the horrendous response timing. I think that this would be best created in a new ticket. Thanks for bringi... [20:33:17] 10Traffic, 10DNS: additional DNS changes for WikiLearn - https://phabricator.wikimedia.org/T344073 (10Asaf) [20:45:54] 10Traffic, 10Observability-Metrics, 10Patch-For-Review: Add prometheus-https load balancer - https://phabricator.wikimedia.org/T326657 (10BCornwall) @fgiunchedi is that a matter of just updating https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/...