[06:49:33] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10810282 (10fgiunchedi) >>! In T371375#10808026, @cmooney wrote: >>>! In T371375#10807881, @cmooney wrote: >> Let me double check and report back. > > So i... [09:12:19] hello! fyi, I'm looking at upgrading cr3-eqsin tomorrow and esams routers on Wednesday, I'll depool the sites for the work [09:14:16] XioNoX: I won't be around on Wednesday morning, hopefully fabfur will [09:14:25] sounds good though [09:14:48] if you have a preference on the timing (EU timezone) let me know, I'm flexible [09:46:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10810721 (10cmooney) [10:11:51] 10netops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: Netbox: librenms report errors - https://phabricator.wikimedia.org/T379907#10810797 (10ayounsi) 05Open→03Resolved a:03Volans Fixed by @Volans in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1135381 [12:10:40] FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:11:09] FIRING: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:14:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811131 (10cmooney) [12:15:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:16:09] RESOLVED: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [12:30:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811243 (10cmooney) [12:32:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811259 (10cmooney) [12:35:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811287 (10cmooney) [13:00:27] \o So I have to reinstall (and IP-move) most of our k8s workers in eqiad. Since for at leats eight of them, the IP will change, there would be pybal restarts needed. Naturally, I don't want to pester you eight times in a short timefram to restart pybal. But after a reimage, the worker is not usable untila pybal restart. I'd do them all in one go, but that is not an option since prod [13:00:29] traffic depends on them. Even half-half would be pushing it, as we don't have that kind of spare capacity. Any ideas on how we could do this? I am planning not doing more than 1-2 machines a day. [13:05:40] FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:23:36] Guest1362: reinstall and IP move sounds like you're going to decom + install a new server? [13:24:15] oops. lemme fix my nic [13:25:06] maybe I'm missing something.. but why do you need a pybal restart? [13:25:33] because pybal needs to pick up the changed IP, AIUI [13:25:37] before decomm/reimage you remove the conftool entry for that server, that removes the server from pybal [13:25:40] RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [13:26:11] after the server is up & running you re-add it to conftool and pybal will pick it up on its ow AFAIK [13:26:41] Mh. maybe we/I forgot to that last time, and thus necessitated a pybal restart. I will give the remove/add approach a try on the first one I reimage. [13:27:33] that way you avoid pybal obsessing over a realserver that's not there anymore [13:27:58] and depool threshold is enforced to the new reality of the impacted clusters [13:28:31] given it could take a while to that realserver to come back (let's say you decomm it and for some reason you cannot reinstall it till the day after) [13:30:30] Yeah, makes sense. [13:31:02] XioNoX, topranks do we have the MAC address of the routers/L4 switches on puppet? [13:31:26] no [13:31:34] and we probably shouldn't :) why ? [13:31:40] context: I need to provide the MAC address of the default gateway to liberica control plane to configure katran [13:31:51] ha yeah that was my question, it's not really in any of our systems [13:32:07] can you do ARP/ND from the kernel to get it? [13:32:47] I can implement that in userland and "auto configure" the XDP program [13:33:57] I know it's probably a lot of trouble but I think it may be easier than trying to keep tabs on the MAC addresses overall [13:35:01] it should be as easy as fetching the default gateway of the system and performing the arp lookup [13:37:50] (last famous words) [13:40:29] oh.. it looks like I can query the RIB fairly easily [13:40:35] https://pkg.go.dev/golang.org/x/net/route [13:41:50] or not... [13:41:54] that only supports *BSD :) [13:45:17] it looks like netlink is the way to go in linux [13:45:19] let's see :) [13:51:44] 06Traffic: Benchmark different options - https://phabricator.wikimedia.org/T393671#10811765 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=42dc30b3-0a6e-47db-af62-f310b24aee4d) set by fabfur@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Testing in progress ` cp700... [14:01:05] vgutierrez: yeah that would be ideal [14:09:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [14:11:35] ^^ me [14:14:42] $ ./gateway [14:14:42] route = {Family:2 DstLength:0 SrcLength:0 Tos:0 Table:254 Protocol:4 Scope:0 Type:1 Flags:0 Attributes:{Dst: Src: Gateway:192.168.88.1 OutIface:2 Priority:100 Table:254 Mark:0 Pref: Expires: Metrics: Multipath:[]}} [14:14:59] so fetching the default route was fairly easy :) [14:15:45] XioNoX, topranks is it safe to assume that the default gateway should always be in the neighbour table? [14:15:59] I was thinking about this [14:16:10] I think it is safe (it might be flagged STALE) [14:16:29] But the prometheus scrapes are more frequent than the ARP tiemout so I don't think it should ever disappear [14:39:40] $./gateway [14:39:40] IP = 192.168.88.1, MAC address = 84:aa:9c:af:08:03 [14:39:40] $ ip neighbor |grep 88.1 [14:39:40] 192.168.88.1 dev enp86s0 lladdr 84:aa:9c:af:08:03 REACHABLE [14:40:01] a small amount of netlink magic [14:44:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [14:52:28] 06Traffic, 10Liberica: control plane should fetch default gateway MAC address dynamically - https://phabricator.wikimedia.org/T393903 (10Vgutierrez) 03NEW [14:52:34] 06Traffic, 10Liberica: control plane should fetch default gateway MAC address dynamically - https://phabricator.wikimedia.org/T393903#10812004 (10Vgutierrez) p:05Triage→03Medium [15:40:11] 06Traffic: Benchmark different options - https://phabricator.wikimedia.org/T393671#10812455 (10Fabfur) 05Open→03Resolved Latest benchmarks done: https://wikitech.wikimedia.org/wiki/User:FFurnari-WMF/HaproxyGeoIPTest#Using_benchmark-curl_script_targeting_cp7001 We can proceed with tests on single hosts t... [16:15:24] 06Traffic: Deploy geoip lookup script on 2 hosts - https://phabricator.wikimedia.org/T393927 (10Fabfur) 03NEW [17:03:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936 (10cmooney) 03NEW p:05Triage→03Medium [17:03:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10812996 (10cmooney) [18:42:33] 06Traffic: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533#10813446 (10BCornwall) [19:22:24] 06Traffic, 06SRE: Long-running throttling/timeouts during batch uploads of images to Commons - https://phabricator.wikimedia.org/T393938#10813651 (10Aklapper) [19:52:39] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10813758 (10cmooney) [19:53:21] hey Traffic! We merged https://gerrit.wikimedia.org/r/c/operations/dns/+/1143891/4/templates/wmnet + ran authdns-update and I'm still getting NXDOMAIN for `search-chi.svc.eqiad.wmnet` and other domains I'd expect, even after wiping cache. Any ideas? [19:58:36] sukhe looks like `sudo -i authdns-update` failed on dns1004, should I roll back the changes? [19:59:04] `error: plugin_geoip: Invalid resource name 'disc-search-chi-https' detected from zonefile lookup [19:59:04] error: Name 'search-chi.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-search-chi-https'` [19:59:25] checkign [20:00:08] cool, I have a tmux up on dns1004 if you want to see the full output [20:02:40] maybe we need a DYNC record instead of DYNA? [20:03:01] I am trying to follow what the intent was [20:04:32] The general idea is to implement DNS discovery for the search services. Right now, they are all using the same hostname per DC (search.svc.${dc}.wmnet) but with different ports [20:06:06] inflatador: [20:06:06] disc-search-chi-https => { map => mock, dcmap => { mock => 192.0.2.1 } } [20:06:10] disc-search-psi-https => { map => mock, dcmap => { mock => 192.0.2.1 } } [20:06:13] disc-search-omega-https => { map => mock, dcmap => { mock => 192.0.2.1 } } [20:06:16] but in wmnet: [20:06:44] hm nevermind [20:09:51] it's hard to get the full context working back from this [20:09:59] let's revert and then can you add to the reviews again? [20:10:09] i suspect the problem is in puppet we have search-https, search-psi-https, and search-omega-https. but here we have search-chi-https, search-psi-https, and search-omega-https. Sorry about that :( [20:10:10] and that way we can unblock authdns-update from being broken [20:10:19] sukhe sure np, will revert ASAP [20:11:15] ebernhardson: but there is search-psi-https? [20:11:25] ah, not search-*chi*-https though [20:11:42] sorry, still wrapping my head around these different names and how they interact :P [20:12:34] yeah, it's not great. This is (hopefully) one of the steps to isolate them better [20:12:44] so you have search-omega-https as well, so that's fine [20:12:51] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10813835 (10BCornwall) @VRiley-WMF Yes, A7 as detailed in T387145#10720903. It's idling at the moment and can be serviced. Thanks! [20:13:06] OK, change is reverted [20:13:10] thank you [20:13:25] running authdns-update [20:13:50] inflatador: ebernhardson: can you add me to the reviews for this please? I will go over it tomorrow and we can start again [20:14:06] if you have the full context and we think it's only search-*chi* missing, that makes sense and we can try doing that right now [20:14:43] sukhe nah, we can wait until tomorrow. We'll CC you on the patches [20:14:48] thanks [20:15:01] all clean now, thanks for taking care of it folks [20:36:48] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10813893 (10RobH) They finally answered back first asking simple questions like if the network port or cable are bad (they aren't) and then after another 48 hours requesting firmwar... [21:22:46] 06Traffic, 06DC-Ops, 10ops-codfw: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968 (10BCornwall) 03NEW [21:42:35] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814103 (10RobH) idrac updated, applying bios now [22:03:05] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814233 (10RobH) bios updated, applying nic firmware update now [22:14:53] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814251 (10RobH) NIC updated. @ssingh: I'll let this sit idle for a day or so and we can see if it errors, if not can we then return to service and check for errors this week whil... [22:17:10] 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops-radar, 10Content-Transform-Team (Work In Progress): Block traffic to RESTBase /page/related endpoint and sunset it - https://phabricator.wikimedia.org/T376297#10814281 (10DDFoster96) The API documentation at https://en.wikipedia.org/api/rest_... [22:33:07] anyone from traffic online? [22:44:59] \o [22:45:01] topranks: What's up? [22:45:20] hey brett [22:45:30] I was wondering if maybe you could take a look at this patch: [22:45:30] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144666 [22:45:48] seems we missed to add these vlans to the lvs hosts in eqiad when we brought racks e8/f8 online [22:46:00] okay, looking [22:46:41] thanks <3 [22:49:42] +1. Thanks for doing that! [22:54:45] np brett thanks for looking :) [22:55:16] can we go ahead and merge it do you think? [22:55:25] or should we roll out gradually? [23:04:12] this again?! can't wait for it to go away [23:05:15] topranks: how are you rolling it out [23:05:37] I disabled puppet on lvs1017-19, and I'm doing a run on lvs1020 now to add it there [23:05:41] +1 [23:05:45] +1 on the going away :) [23:05:46] it's late for you dude [23:05:53] let us take care of it [23:06:40] looks good there fwiw [23:07:23] ah it's no bother [23:07:27] looks good where? [23:07:40] I have not verified the IPs because not near a computer so you and brett have [23:07:43] 1020 [23:07:48] https://puppetboard.wikimedia.org/report/lvs1020.eqiad.wmnet/67d548752b7f2a137f0f12e1ea3ca417b075b3c0 [23:08:01] should be hte link [23:08:20] i mean, it won't break anything on puppet anyway [23:08:55] the IPs are ok yeah [23:09:10] I guess you need to just verify the connectivity once merged but also presumably, it has to be merged everywhere before a real tes? [23:09:11] I'm just not seeing any config added to e/n/i on lvs1020 [23:09:13] test [23:09:26] there is no vlan1061 interface for instance [23:09:39] though vlan1061 is mentioned in /etc/systemd/system/multi-user.target.wants/ipip-multiqueue-optimizer.service [23:09:50] ipip host [23:10:52] no I don't think so, or at least I think the suspicion on the search team was it was cos the L2 was broken to that vlan on the lvs [23:11:06] I assumed they wouldn't have said so if it was using IPIP [23:12:39] do we know which service? [23:12:54] search.svc.eqiad.wmnet [23:13:44] talking to cirrussearch1124.eqiad.wmnet on 10.64.166.2 [23:16:58] sorry. getting to a computer but I am 10 mns away [23:18:14] ah don't stress it [23:18:19] also I thought we fixed this one?? [23:18:48] https://phabricator.wikimedia.org/P75947 [23:21:39] we probaly did not rebbot it? [23:21:42] :) [23:21:51] also this is bullseye [23:22:03] so not rebooted recently for bookworm stuff as well [23:22:26] probably yeah [23:22:37] what is the uptime? [23:22:51] 292 days [23:23:03] yeah [23:23:07] yep that is it [23:23:26] do a reboot single host, try again ? [23:23:44] but then that also means the other ones need to be rebooted [23:26:04] or just remove manually [23:26:13] though soes not explain /e/n/i missing [23:26:20] not sure on that [23:38:09] yeah that is the issue, reboot won't fix it [23:46:34] happy to look when home [23:46:39] what's the state of the search thing? [23:55:06] ok looking. I don't recall the puppetization of it [23:55:09] it's been a wihle