[06:49:33] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Do we need prometheus-ethtool-exporter? - https://phabricator.wikimedia.org/T371375#10810282 (10fgiunchedi) >>! In T371375#10808026, @cmooney wrote: >>>! In T371375#10807881, @cmooney wrote: >> Let me double check and report back. >  > So i...
[09:12:19] <XioNoX>	 hello! fyi, I'm looking at upgrading cr3-eqsin tomorrow and esams routers on Wednesday, I'll depool the sites for the work
[09:14:16] <vgutierrez>	 XioNoX: I won't be around on Wednesday morning, hopefully fabfur will
[09:14:25] <vgutierrez>	 sounds good though
[09:14:48] <XioNoX>	 if you have a preference on the timing (EU timezone) let me know, I'm flexible
[09:46:22] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10810721 (10cmooney)
[10:11:51] <wikibugs>	 10netops, 06Infrastructure-Foundations, 10netbox, 13Patch-For-Review: Netbox: librenms report errors - https://phabricator.wikimedia.org/T379907#10810797 (10ayounsi) 05Open→03Resolved a:03Volans Fixed by @Volans in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1135381
[12:10:40] <jinxer-wm>	 FIRING: [2x] VarnishHighThreadCount: Varnish's thread count on cp5027:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[12:11:09] <jinxer-wm>	 FIRING: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[12:14:57] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811131 (10cmooney)
[12:15:40] <jinxer-wm>	 FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[12:16:09] <jinxer-wm>	 RESOLVED: [8x] LVSHighCPU: The host lvs5005:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5005 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU
[12:30:26] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811243 (10cmooney)
[12:32:21] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811259 (10cmooney)
[12:35:00] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: codfw: setup MPC10E-10C and SCBE3 - https://phabricator.wikimedia.org/T393552#10811287 (10cmooney)
[13:00:27] <Guest1362>	 \o So I have to reinstall (and IP-move) most of our k8s workers in eqiad. Since for at leats eight of them, the IP will change, there would be pybal restarts needed. Naturally, I don't want to pester you eight times in a short timefram to restart pybal. But after a reimage, the worker is not usable untila pybal restart. I'd do them all in one go, but that is not an option since prod
[13:00:29] <Guest1362>	 traffic depends on them. Even half-half would be pushing it, as we don't have that kind of spare capacity. Any ideas on how we could do this? I am planning not doing more than 1-2 machines a day.
[13:05:40] <jinxer-wm>	 FIRING: [16x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:23:36] <vgutierrez>	 Guest1362: reinstall and IP move sounds like you're going to decom + install a new server?
[13:24:15] <Guest1362>	 oops. lemme fix my nic
[13:25:06] <vgutierrez>	 maybe I'm missing something.. but why do you need a pybal restart?
[13:25:33] <Guest2286>	 because pybal needs to pick up the changed IP, AIUI
[13:25:37] <vgutierrez>	 before decomm/reimage you remove the conftool entry for that server, that removes the server from pybal
[13:25:40] <jinxer-wm>	 RESOLVED: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish  - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount
[13:26:11] <vgutierrez>	 after the server is up & running you re-add it to conftool and pybal will pick it up on its ow AFAIK
[13:26:41] <Guest2286>	 Mh. maybe we/I forgot to that last time, and thus necessitated a pybal restart. I will give the remove/add approach a try on the first one I reimage.
[13:27:33] <vgutierrez>	 that way you avoid pybal obsessing over a realserver that's not there anymore
[13:27:58] <vgutierrez>	 and depool threshold is enforced to the new reality of the impacted clusters
[13:28:31] <vgutierrez>	 given it could take a while to that realserver to come back (let's say you decomm it and for some reason you cannot reinstall it till the day after)
[13:30:30] <Guest2286>	 Yeah, makes sense.
[13:31:02] <vgutierrez>	 XioNoX, topranks do we have the MAC address of the routers/L4 switches on puppet?
[13:31:26] <topranks>	 no
[13:31:34] <XioNoX>	 and we probably shouldn't :) why ?
[13:31:40] <vgutierrez>	 context: I need to provide the MAC address of the default gateway to liberica control plane to configure katran
[13:31:51] <topranks>	 ha yeah that was my question, it's not really in any of our systems 
[13:32:07] <topranks>	 can you do ARP/ND from the kernel to get it?
[13:32:47] <vgutierrez>	 I can implement that in userland and "auto configure" the XDP program
[13:33:57] <topranks>	 I know it's probably a lot of trouble but I think it may be easier than trying to keep tabs on the MAC addresses overall
[13:35:01] <vgutierrez>	 it should be as easy as fetching the default gateway of the system and performing the arp lookup
[13:37:50] <vgutierrez>	 (last famous words)
[13:40:29] <vgutierrez>	 oh.. it looks like I can query the RIB fairly easily
[13:40:35] <vgutierrez>	 https://pkg.go.dev/golang.org/x/net/route
[13:41:50] <vgutierrez>	 or not...
[13:41:54] <vgutierrez>	 that only supports *BSD :)
[13:45:17] <vgutierrez>	 it looks like netlink is the way to go in linux
[13:45:19] <vgutierrez>	 let's see :)
[13:51:44] <wikibugs>	 06Traffic: Benchmark different options - https://phabricator.wikimedia.org/T393671#10811765 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=42dc30b3-0a6e-47db-af62-f310b24aee4d) set by fabfur@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: Testing in progress ` cp700...
[14:01:05] <XioNoX>	 vgutierrez: yeah that would be ideal
[14:09:38] <jinxer-wm>	 FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[14:11:35] <fabfur>	 ^^ me
[14:14:42] <vgutierrez>	 $ ./gateway 
[14:14:42] <vgutierrez>	 route = {Family:2 DstLength:0 SrcLength:0 Tos:0 Table:254 Protocol:4 Scope:0 Type:1 Flags:0 Attributes:{Dst:<nil> Src:<nil> Gateway:192.168.88.1 OutIface:2 Priority:100 Table:254 Mark:0 Pref:<nil> Expires:<nil> Metrics:<nil> Multipath:[]}}
[14:14:59] <vgutierrez>	 so fetching the default route was fairly easy :)
[14:15:45] <vgutierrez>	 XioNoX, topranks is it safe to assume that the default gateway should always be in the neighbour table?
[14:15:59] <topranks>	 I was thinking about this 
[14:16:10] <topranks>	 I think it is safe (it might be flagged STALE)
[14:16:29] <topranks>	 But the prometheus scrapes are more frequent than the ARP tiemout so I don't think it should ever disappear 
[14:39:40] <vgutierrez>	 $./gateway 
[14:39:40] <vgutierrez>	 IP = 192.168.88.1, MAC address = 84:aa:9c:af:08:03
[14:39:40] <vgutierrez>	 $ ip neighbor |grep 88.1
[14:39:40] <vgutierrez>	 192.168.88.1 dev enp86s0 lladdr 84:aa:9c:af:08:03 REACHABLE 
[14:40:01] <vgutierrez>	 a small amount of netlink magic 
[14:44:38] <jinxer-wm>	 RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.224:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS
[14:52:28] <wikibugs>	 06Traffic, 10Liberica: control plane should fetch default gateway MAC address dynamically - https://phabricator.wikimedia.org/T393903 (10Vgutierrez) 03NEW
[14:52:34] <wikibugs>	 06Traffic, 10Liberica: control plane should fetch default gateway MAC address dynamically - https://phabricator.wikimedia.org/T393903#10812004 (10Vgutierrez) p:05Triage→03Medium
[15:40:11] <wikibugs>	 06Traffic: Benchmark different options - https://phabricator.wikimedia.org/T393671#10812455 (10Fabfur) 05Open→03Resolved Latest benchmarks done: https://wikitech.wikimedia.org/wiki/User:FFurnari-WMF/HaproxyGeoIPTest#Using_benchmark-curl_script_targeting_cp7001  We can proceed with tests on single hosts t...
[16:15:24] <wikibugs>	 06Traffic: Deploy geoip lookup script on 2 hosts - https://phabricator.wikimedia.org/T393927 (10Fabfur) 03NEW
[17:03:02] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936 (10cmooney) 03NEW p:05Triage→03Medium
[17:03:53] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10812996 (10cmooney)
[18:42:33] <wikibugs>	 06Traffic: Update libvmod-netmapper to 1.10 - https://phabricator.wikimedia.org/T392533#10813446 (10BCornwall)
[19:22:24] <wikibugs>	 06Traffic, 06SRE: Long-running throttling/timeouts during batch uploads of images to Commons - https://phabricator.wikimedia.org/T393938#10813651 (10Aklapper)
[19:52:39] <wikibugs>	 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Move connections on ssw1-f1-codfw to match normal pattern - https://phabricator.wikimedia.org/T393936#10813758 (10cmooney)
[19:53:21] <inflatador>	 hey Traffic! We merged https://gerrit.wikimedia.org/r/c/operations/dns/+/1143891/4/templates/wmnet + ran authdns-update and I'm still getting NXDOMAIN for `search-chi.svc.eqiad.wmnet` and other domains I'd expect, even after wiping cache. Any ideas?
[19:58:36] <inflatador>	 sukhe looks like `sudo -i authdns-update` failed on dns1004, should I roll back the changes?
[19:59:04] <inflatador>	 `error: plugin_geoip: Invalid resource name 'disc-search-chi-https' detected from zonefile lookup
[19:59:04] <inflatador>	 error: Name 'search-chi.discovery.wmnet.': resolver plugin 'geoip' rejected resource name 'disc-search-chi-https'`
[19:59:25] <sukhe>	 checkign
[20:00:08] <inflatador>	 cool, I have a tmux up on dns1004 if you want to see the full output
[20:02:40] <inflatador>	 maybe we need a DYNC record instead of DYNA?
[20:03:01] <sukhe>	 I am trying to follow what the intent was 
[20:04:32] <inflatador>	 The general idea is to implement DNS discovery for the search services. Right now, they are all using the same hostname per DC (search.svc.${dc}.wmnet) but with different ports 
[20:06:06] <sukhe>	 inflatador: 
[20:06:06] <sukhe>	 disc-search-chi-https    => { map => mock, dcmap => { mock => 192.0.2.1 } }
[20:06:10] <sukhe>	 disc-search-psi-https    => { map => mock, dcmap => { mock => 192.0.2.1 } }
[20:06:13] <sukhe>	 disc-search-omega-https  => { map => mock, dcmap => { mock => 192.0.2.1 } }
[20:06:16] <sukhe>	 but in wmnet:
[20:06:44] <sukhe>	 hm nevermind
[20:09:51] <sukhe>	 it's hard to get the full context working back from this
[20:09:59] <sukhe>	 let's revert and then can you add to the reviews again?
[20:10:09] <ebernhardson>	 i suspect the problem is in puppet we have search-https, search-psi-https, and search-omega-https.  but here we have search-chi-https, search-psi-https, and search-omega-https.  Sorry about that :(
[20:10:10] <sukhe>	 and that way we can unblock authdns-update from being broken
[20:10:19] <inflatador>	 sukhe sure np, will revert ASAP
[20:11:15] <sukhe>	 ebernhardson: but there is search-psi-https?
[20:11:25] <sukhe>	 ah, not search-*chi*-https though
[20:11:42] <sukhe>	 sorry, still wrapping my head around these different names and how they interact :P
[20:12:34] <inflatador>	 yeah, it's not great. This is (hopefully) one of the steps to isolate them better
[20:12:44] <sukhe>	 so you have search-omega-https as well, so that's fine
[20:12:51] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10813835 (10BCornwall) @VRiley-WMF Yes, A7 as detailed in T387145#10720903. It's idling at the moment and can be serviced. Thanks!
[20:13:06] <inflatador>	 OK, change is reverted
[20:13:10] <sukhe>	 thank you
[20:13:25] <sukhe>	 running authdns-update
[20:13:50] <sukhe>	 inflatador: ebernhardson: can you add me to the reviews for this please? I will go over it tomorrow and we can start again
[20:14:06] <sukhe>	 if you have the full context and we think it's only search-*chi* missing, that makes sense and we can try doing that right now
[20:14:43] <inflatador>	 sukhe nah, we can wait until tomorrow. We'll CC you on the patches
[20:14:48] <sukhe>	 thanks
[20:15:01] <sukhe>	 all clean now, thanks for taking care of it folks
[20:36:48] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10813893 (10RobH) They finally answered back first asking simple questions like if the network port or cable are bad (they aren't) and then after another 48 hours requesting firmwar...
[21:22:46] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-codfw: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968 (10BCornwall) 03NEW
[21:42:35] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814103 (10RobH) idrac updated, applying bios now
[22:03:05] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814233 (10RobH) bios updated, applying nic firmware update now
[22:14:53] <wikibugs>	 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10814251 (10RobH) NIC updated.  @ssingh: I'll let this sit idle for a day or so and we can see if it errors, if not can we then return to service and check for errors this week whil...
[22:17:10] <wikibugs>	 06Traffic, 10RESTBase, 10RESTBase Sunsetting, 06serviceops-radar, 10Content-Transform-Team (Work In Progress): Block traffic to RESTBase /page/related endpoint and sunset it - https://phabricator.wikimedia.org/T376297#10814281 (10DDFoster96) The API documentation at https://en.wikipedia.org/api/rest_...
[22:33:07] <topranks>	 anyone from traffic online?
[22:44:59] <brett>	 \o
[22:45:01] <brett>	 topranks: What's up?
[22:45:20] <topranks>	 hey brett 
[22:45:30] <topranks>	 I was wondering if maybe you could take a look at this patch:
[22:45:30] <topranks>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144666
[22:45:48] <topranks>	 seems we missed to add these vlans to the lvs hosts in eqiad when we brought racks e8/f8 online 
[22:46:00] <brett>	 okay, looking
[22:46:41] <topranks>	 thanks <3
[22:49:42] <brett>	 +1. Thanks for doing that!
[22:54:45] <topranks>	 np brett thanks for looking :) 
[22:55:16] <topranks>	 can we go ahead and merge it do you think?
[22:55:25] <topranks>	 or should we roll out gradually?
[23:04:12] <sukhe>	 this again?! can't wait for it to go away 
[23:05:15] <sukhe>	 topranks: how are you rolling it out
[23:05:37] <topranks>	 I disabled puppet on lvs1017-19, and I'm doing a run on lvs1020 now to add it there 
[23:05:41] <sukhe>	 +1
[23:05:45] <topranks>	 +1 on the going away :) 
[23:05:46] <sukhe>	 it's late for you dude
[23:05:53] <sukhe>	 let us take care of it
[23:06:40] <sukhe>	 looks good there fwiw 
[23:07:23] <topranks>	 ah it's no bother 
[23:07:27] <topranks>	 looks good where?
[23:07:40] <sukhe>	 I have not verified the IPs because not near a computer so you and brett have
[23:07:43] <sukhe>	 1020
[23:07:48] <sukhe>	 https://puppetboard.wikimedia.org/report/lvs1020.eqiad.wmnet/67d548752b7f2a137f0f12e1ea3ca417b075b3c0
[23:08:01] <sukhe>	 should be hte link
[23:08:20] <sukhe>	 i mean, it won't break anything on puppet anyway
[23:08:55] <topranks>	 the IPs are ok yeah 
[23:09:10] <sukhe>	 I guess you need to just verify the connectivity once merged but also presumably, it has to be merged everywhere before a  real tes?
[23:09:11] <topranks>	 I'm just not seeing any config added to e/n/i on lvs1020 
[23:09:13] <sukhe>	 test
[23:09:26] <topranks>	 there is no vlan1061 interface for instance 
[23:09:39] <topranks>	 though vlan1061 is mentioned in /etc/systemd/system/multi-user.target.wants/ipip-multiqueue-optimizer.service 
[23:09:50] <sukhe>	 ipip host
[23:10:52] <topranks>	 no I don't think so, or at least I think the suspicion on the search team was it was cos the L2 was broken to that vlan on the lvs 
[23:11:06] <topranks>	 I assumed they wouldn't have said so if it was using IPIP 
[23:12:39] <sukhe>	 do we know which service?
[23:12:54] <topranks>	 search.svc.eqiad.wmnet
[23:13:44] <topranks>	 talking to cirrussearch1124.eqiad.wmnet on  10.64.166.2 
[23:16:58] <sukhe>	 sorry. getting to a computer but I am 10 mns away
[23:18:14] <topranks>	 ah don't stress it 
[23:18:19] <topranks>	 also I thought we fixed this one??
[23:18:48] <topranks>	 https://phabricator.wikimedia.org/P75947
[23:21:39] <sukhe>	 we probaly did not rebbot it?
[23:21:42] <sukhe>	 :) 
[23:21:51] <sukhe>	 also this is bullseye 
[23:22:03] <sukhe>	 so not rebooted recently for bookworm stuff as well
[23:22:26] <topranks>	 probably yeah 
[23:22:37] <sukhe>	 what is the uptime?
[23:22:51] <topranks>	 292 days
[23:23:03] <sukhe>	 yeah
[23:23:07] <sukhe>	 yep that is it 
[23:23:26] <sukhe>	 do a reboot single host, try again ?
[23:23:44] <sukhe>	 but then that also means the other ones need to be rebooted 
[23:26:04] <sukhe>	 or just remove manually 
[23:26:13] <sukhe>	 though soes not explain /e/n/i missing 
[23:26:20] <sukhe>	 not sure on that 
[23:38:09] <topranks>	 yeah that is the issue, reboot won't fix it 
[23:46:34] <sukhe>	 happy to look when home
[23:46:39] <sukhe>	 what's the state of the search thing?
[23:55:06] <sukhe>	 ok looking. I don't recall the puppetization of it 
[23:55:09] <sukhe>	 it's been a wihle