[00:38:56] (HAProxyEdgeTrafficDrop) firing: 62% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:43:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:16:56] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:21:56] (HAProxyEdgeTrafficDrop) resolved: 66% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:22:25] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) [08:50:03] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) The DNS Name field in Netbox is an FQDN, the same Netbox UI help message for the field is: `Hostname or FQDN (not case... [09:13:10] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) Im not sure i understand this response. The value entered which caused an error was `ns-recursor0.openstack.codfw1de... [09:16:11] 10Traffic, 10SRE, 10Developer Productivity, 10Performance-Team (Radar): Let X-Analytics response header pass through with WikimediaDebug - https://phabricator.wikimedia.org/T305794 (10fgiunchedi) p:05Triage→03Medium [09:31:56] (HAProxyEdgeTrafficDrop) firing: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:33:11] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Sure, but they could cause various unwanted issues in different contexes, like not matching the fingerprint in the kno... [09:36:56] (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:56:56] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:58:58] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE, and 2 others: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10jbond) 05Open→03Stalled As per an offline conversation with @Volans. newer versions of netbox allow us to preform [[ http... [10:01:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:10:56] (HAProxyEdgeTrafficDrop) firing: 63% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:15:56] (HAProxyEdgeTrafficDrop) resolved: 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:34:56] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) Created https://gerrit.wikimedia.org/r/786264 to kick off the discussion about the next steps, let... [10:38:00] 10Traffic, 10DNS, 10SRE, 10Wikimedia Enterprise: 301 redirect setup for wikimediaenterprise - https://phabricator.wikimedia.org/T302756 (10Protsack.stephan) [12:18:10] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10Patch-For-Review: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @elukey thanks for the patch, certainly looks ok to me, if indeed it works in terms of the Calico... [14:06:07] G'day. I am finishing setup on the ML staging k8s, and would like to do some configging and pybal restarting :) [14:06:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/786319/ <- the change to switch the cluster to lvs_setup [14:25:39] klausman: looking good from traffic's perspective [14:26:01] Thank you [14:26:14] running authdns-update atm [15:11:16] vgutierrez: submitted the lvs_setup change, we'd now run puppet-merge and restart pybal on 2010, and then 2009 (unless stopped by fireworks) [15:12:21] ack [15:12:54] puppet-merge done [15:14:19] agent run done. [15:14:43] about to restart pybal on 2010, last chance to stop me :) [15:14:47] go ahead ;P [15:14:59] and restart done [15:16:43] no appservers pooled for the cluster? [15:16:50] yep I was about to say that [15:16:55] I think they are pooled=false [15:17:09] Yes, that was me last week in a chase of reducing alerts [15:17:24] we should be able to just pool them (make sure the apiservers are running) [15:17:36] klausman: yep let's do it [15:17:40] pool them or lvs isn't gonna be happy [15:19:06] yep it is not happy [15:19:16] `You cannot pool a node where weight is equal to 0` [15:19:26] I have no memory of setting a weight to 0 [15:19:40] it is the default IIRC, let's add weight 10 or 1 to both [15:19:43] indeed [15:19:47] default one is 0 [15:19:55] where is the weight set? [15:20:32] in the same way as you enable, it is another parameter [15:20:37] set it at the same time... pooled=yes,weight=10 [15:21:11] yep --^ [15:22:31] I am puzzled. [15:22:42] confctl commandline> [15:22:44] ? [15:23:27] sudo -i confctl select name=$hostname set/pooled=yes,weight=10 [15:23:30] klausman: --^ [15:24:07] from puppetmaster [15:24:16] in this case we need ml-staging-ctrl2001 and 2002 [15:24:50] Done for both, with a warning logged [15:25:00] (why there was a warning is unclear) [15:26:03] klausman: you missed .codfw.wmnet [15:26:37] now I see [15:26:38] {"ml-staging-ctrl2001.codfw.wmnet": {"weight": 1, "pooled": "no"}, "tags": "dc=codfw,cluster=ml_staging,service=kubemaster"} [15:27:16] with the fqdn I get an error [15:27:23] $ sudo -i confctl select name=ml-staging-ctrl2001.codfw.wmnet set/pooled=yes,weight=10 [15:27:25] ERROR:conftool:Invalid action, reason: Could not parse set instructions: pooled=yes,weight=10 [15:27:52] ah ok it is not ',' but ':' [15:28:07] There we go [15:28:24] lvs2010 ~ $ sudo ipvsadm -L |grep ml-st [15:28:25] "The syntax for the set action is: set/key1=value1:key2=value2. " [15:28:26] -> ml-staging-ctrl2001.codfw.wm Route 10 4 1 [15:28:28] -> ml-staging-ctrl2002.codfw.wm Route 10 0 0 [15:28:30] super [15:29:28] Apr 26 15:28:02 lvs2010 pybal[15788]: [ml-staging-ctrl_6443] INFO: Merged enabled server ml-staging-ctrl2001.codfw.wmnet, weight 10 [15:29:28] Apr 26 15:28:02 lvs2010 pybal[15788]: [ml-staging-ctrl_6443] INFO: Merged enabled server ml-staging-ctrl2002.codfw.wmnet, weight 10 [15:29:47] looking good now on lvs2010 [15:29:56] +1, let's do 2009 [15:30:16] Alertmanager has this: https://alerts.wikimedia.org/?q=alertname%3DPyBal%20IPVS%20diff%20check&q=%40receiver%3Dirc-spam [15:30:44] ah, that's >16m old [15:30:54] forced a re check on icinga right now [15:31:11] thanks vgutierrez [15:32:41] Ok, ready for 2009? Should be a lot less exciting than 2010 :) [15:32:52] well.. lvs2009 handles some traffic, not like lvs2010 ;P [15:33:21] Psh, details :) [15:33:37] And I think 2010 should at least see monitoring traffic, right? [15:33:46] nope [15:33:52] lvs2010 generates monitoring traffic [15:33:56] I do see ActiveConns!=0, tho [15:33:56] to the backend servers [15:34:10] but lvs2010 isn't getting any kind of traffic [15:34:11] Ah [15:34:28] Alright, restarting pybal on 2009 in 5s or so [15:34:51] and done [15:35:23] ah damn. Forgot to do puppet agent run [15:35:44] klausman: let's check ipvsadm first [15:35:54] err :) [15:36:06] No mention of ml-st in ipvsadm [15:36:12] yeah The last Puppet run was at Tue Apr 26 15:10:04 UTC 2022 (25 minutes ago) [15:36:14] vgutierrez@lvs2009:~$ cat /etc/pybal/pybal.conf |grep 10.2.1.72 [15:36:14] vgutierrez@lvs2009:~$ [15:36:23] doing the agent run now [15:36:43] I only see this in ipvsadm [15:36:43] TCP 10.2.1.72:6443 wrr [15:36:48] so yeah incomplete [15:37:14] agent run and abnothe pybal run fixed it [15:37:20] pybal restart* [15:38:02] I can reach https://ml-staging-ctrl.svc.codfw.wmnet:6443 from the ml nodes [15:38:11] Ditto [15:38:15] lovely [15:38:23] thanks vgutierrez! [15:38:28] np [15:39:20] Yup thanks for overseeing my continued stumbling-about :) [15:39:48] 🍿 [15:40:08] Well, at least I provide entertainment [17:57:37] 10netops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) The above patch is working, however I'm not 100% the resulting config is what we need. Looking, for instance, at ml-se...