[00:38:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 62% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[00:43:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[05:16:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[05:21:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 66% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[07:22:25] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm)
[08:50:03] <wikibugs>	 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) The DNS Name field in Netbox is an FQDN, the same Netbox UI help message for the field is: `Hostname or FQDN (not case...
[09:13:10] <wikibugs>	 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) Im  not sure i understand this response.   The value entered which caused an error was `ns-recursor0.openstack.codfw1de...
[09:16:11] <wikibugs>	 10Traffic, 10SRE, 10Developer Productivity, 10Performance-Team (Radar): Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794 (10fgiunchedi) p:05Triage→03Medium
[09:31:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[09:33:11] <wikibugs>	 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Sure, but they could cause various unwanted issues in different contexes, like not matching the fingerprint in the kno...
[09:36:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[09:56:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[09:58:58] <wikibugs>	 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) 05Open→03Stalled As per an offline conversation with @Volans.  newer versions of netbox allow us to preform [[ http...
[10:01:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[10:10:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) firing: 63% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[10:15:56] <jinxer-wm>	 (HAProxyEdgeTrafficDrop) resolved: 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop
[10:34:56] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) Created https://gerrit.wikimedia.org/r/786264 to kick off the discussion about the next steps, let...
[10:38:00] <wikibugs>	 10Traffic, 10DNS, 10SRE, 10Wikimedia Enterprise: 301 redirect setup for wikimediaenterprise - https://phabricator.wikimedia.org/T302756 (10Protsack.stephan)
[12:18:10] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @elukey thanks for the patch, certainly looks ok to me, if indeed it works in terms of the Calico...
[14:06:07] <klausman>	 G'day. I am finishing setup on the ML staging k8s, and would like to do some configging and pybal restarting :)
[14:06:24] <klausman>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/786319/ <- the change to switch the cluster to lvs_setup
[14:25:39] <vgutierrez>	 klausman: looking good from traffic's perspective
[14:26:01] <klausman>	 Thank you
[14:26:14] <klausman>	 running authdns-update atm
[15:11:16] <klausman>	 vgutierrez: submitted the lvs_setup change, we'd now run puppet-merge and restart pybal on 2010, and then 2009 (unless stopped by fireworks)
[15:12:21] <vgutierrez>	 ack
[15:12:54] <klausman>	 puppet-merge done
[15:14:19] <klausman>	 agent run done.
[15:14:43] <klausman>	 about to restart pybal on 2010, last chance to stop me :)
[15:14:47] <vgutierrez>	 go ahead ;P
[15:14:59] <klausman>	 and restart done
[15:16:43] <vgutierrez>	 no appservers pooled for the cluster?
[15:16:50] <elukey>	 yep I was about to say that
[15:16:55] <elukey>	 I think they are pooled=false
[15:17:09] <klausman>	 Yes, that was me last week in a chase of reducing alerts
[15:17:24] <klausman>	 we should be able to just pool them (make sure the apiservers are running)
[15:17:36] <elukey>	 klausman: yep let's do it
[15:17:40] <vgutierrez>	 pool them or lvs isn't gonna be happy
[15:19:06] <elukey>	 yep it is not happy 
[15:19:16] <klausman>	 `You cannot pool a node where weight is equal to 0`
[15:19:26] <klausman>	 I have no memory of setting a weight to 0
[15:19:40] <elukey>	 it is the default IIRC, let's add weight 10 or 1 to both
[15:19:43] <vgutierrez>	 indeed
[15:19:47] <vgutierrez>	 default one is 0
[15:19:55] <klausman>	 where is the weight set?
[15:20:32] <elukey>	 in the same way as you enable, it is another parameter
[15:20:37] <vgutierrez>	 set it at the same time... pooled=yes,weight=10
[15:21:11] <elukey>	 yep --^
[15:22:31] <klausman>	 I am puzzled.
[15:22:42] <klausman>	 confctl commandline>
[15:22:44] <klausman>	 ?
[15:23:27] <elukey>	 sudo -i confctl select name=$hostname set/pooled=yes,weight=10
[15:23:30] <elukey>	 klausman: --^
[15:24:07] <elukey>	 from puppetmaster
[15:24:16] <elukey>	 in this case we need ml-staging-ctrl2001 and 2002
[15:24:50] <klausman>	 Done for both, with a warning logged
[15:25:00] <klausman>	 (why there was a warning is unclear)
[15:26:03] <elukey>	 klausman: you missed .codfw.wmnet
[15:26:37] <elukey>	 now I see
[15:26:38] <elukey>	 {"ml-staging-ctrl2001.codfw.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=codfw,cluster=ml_staging,service=kubemaster"}
[15:27:16] <klausman>	 with the fqdn I get an error
[15:27:23] <klausman>	  $ sudo -i confctl select name=ml-staging-ctrl2001.codfw.wmnet set/pooled=yes,weight=10
[15:27:25] <klausman>	 ERROR:conftool:Invalid action, reason: Could not parse set instructions: pooled=yes,weight=10
[15:27:52] <elukey>	 ah ok it is not ',' but ':'
[15:28:07] <klausman>	 There we go
[15:28:24] <klausman>	 lvs2010 ~ $ sudo ipvsadm -L |grep ml-st
[15:28:25] <elukey>	 "The syntax for the set action is: set/key1=value1:key2=value2. "
[15:28:26] <klausman>	   -> ml-staging-ctrl2001.codfw.wm Route   10     4          1         
[15:28:28] <klausman>	   -> ml-staging-ctrl2002.codfw.wm Route   10     0          0         
[15:28:30] <elukey>	 super
[15:29:28] <vgutierrez>	 Apr 26 15:28:02 lvs2010 pybal[15788]: [ml-staging-ctrl_6443] INFO: Merged enabled server ml-staging-ctrl2001.codfw.wmnet, weight 10
[15:29:28] <vgutierrez>	 Apr 26 15:28:02 lvs2010 pybal[15788]: [ml-staging-ctrl_6443] INFO: Merged enabled server ml-staging-ctrl2002.codfw.wmnet, weight 10
[15:29:47] <vgutierrez>	 looking good now on lvs2010
[15:29:56] <elukey>	 +1, let's do 2009
[15:30:16] <klausman>	 Alertmanager has this: https://alerts.wikimedia.org/?q=alertname%3DPyBal%20IPVS%20diff%20check&q=%40receiver%3Dirc-spam
[15:30:44] <klausman>	 ah, that's >16m old
[15:30:54] <vgutierrez>	 forced a re check on icinga right now
[15:31:11] <elukey>	 thanks vgutierrez 
[15:32:41] <klausman>	 Ok, ready for 2009? Should be a lot less exciting than 2010 :)
[15:32:52] <vgutierrez>	 well.. lvs2009 handles some traffic, not like lvs2010 ;P
[15:33:21] <klausman>	 Psh, details :)
[15:33:37] <klausman>	 And I think 2010 should at least see monitoring traffic, right?
[15:33:46] <vgutierrez>	 nope
[15:33:52] <vgutierrez>	 lvs2010 generates monitoring traffic
[15:33:56] <klausman>	 I do see ActiveConns!=0, tho
[15:33:56] <vgutierrez>	 to the backend servers
[15:34:10] <vgutierrez>	 but lvs2010 isn't getting any kind of traffic
[15:34:11] <klausman>	 Ah
[15:34:28] <klausman>	 Alright, restarting pybal on 2009 in 5s or so
[15:34:51] <klausman>	 and done
[15:35:23] <klausman>	 ah damn. Forgot to do puppet agent run
[15:35:44] <elukey>	 klausman: let's check ipvsadm first
[15:35:54] <vgutierrez>	 err :)
[15:36:06] <klausman>	 No mention of ml-st in ipvsadm
[15:36:12] <elukey>	 yeah The last Puppet run was at Tue Apr 26 15:10:04 UTC 2022 (25 minutes ago)
[15:36:14] <vgutierrez>	 vgutierrez@lvs2009:~$ cat /etc/pybal/pybal.conf |grep 10.2.1.72
[15:36:14] <vgutierrez>	 vgutierrez@lvs2009:~$ 
[15:36:23] <klausman>	 doing the agent run now
[15:36:43] <elukey>	 I only see this in ipvsadm
[15:36:43] <elukey>	 TCP  10.2.1.72:6443 wrr
[15:36:48] <elukey>	 so yeah incomplete
[15:37:14] <klausman>	 agent run and abnothe pybal run fixed it
[15:37:20] <klausman>	 pybal restart*
[15:38:02] <elukey>	 I can reach https://ml-staging-ctrl.svc.codfw.wmnet:6443 from the ml nodes
[15:38:11] <klausman>	 Ditto
[15:38:15] <vgutierrez>	 lovely
[15:38:23] <elukey>	 thanks vgutierrez!
[15:38:28] <vgutierrez>	 np
[15:39:20] <klausman>	 Yup thanks for overseeing my continued stumbling-about :)
[15:39:48] <vgutierrez>	 🍿
[15:40:08] <klausman>	 Well, at least I provide entertainment
[17:57:37] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) The above patch is working, however I'm not 100% the resulting config is what we need.  Looking, for instance, at ml-se...