[06:01:56] (HAProxyEdgeTrafficDrop) firing: 42% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:06:56] (HAProxyEdgeTrafficDrop) resolved: (6) 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:04:47] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar), 10Upstream: ATS cache read p999 metrics shows up requests taking up to 1 second on cache read operations - https://phabricator.wikimedia.org/T317748 (10Vgutierrez) It looks like I've found a mitigation for this issue, tested in [[ https://... [10:43:11] akosiaris, _joe_ correct me if I'm wrong but It's my understanding that pybal should talk to any of the conftool servers available and not hit just one of them [10:43:55] <_joe_> vgutierrez: correct, but we never implemented any form of failover [10:43:59] so.. ideally instad of talking just to conf1009 it should hit conf100[7-9] [10:44:21] right.. on the replacement I'm using an etcd client and not just an HTTP library [10:44:34] so I can point it to an array of endpoints [10:44:43] I am gonna say that it doesn't. It has one conf host in the config and if that one goes down, it does not talk to anyone else [10:45:06] akosiaris: yep, that's the current scenario [10:45:06] e.g. I see in the config [10:45:08] config = etcd://conf1007.eqiad.wmnet:4001/conftool/v1/pools/eqiad/kubernetes/kubesvc/ [10:45:26] ah, by "should" you meant the ideal scenario? [10:45:58] in the idea scenario it should use the DNS client srv record and pick any server failing over to anyone else if it fails to connect [10:46:24] coudl it read the _etcd._tcp.conftool dns records? [10:47:09] yeah, what alex said :) [10:47:44] indeed [10:47:50] I got available a NewSRVDiscover [10:48:04] so I could point it out to eqiad.wmnet and let discover the available endpoints [10:52:37] vgutierrez: if it's RO then you can probably use _etcd._tcp instead, that will read from the local cluster in codfw, but I can see pro/cons of doing that [10:52:51] the conftool records are the ones considered RW and hence point to the current active etcd cluster [11:00:51] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10aborrero) [11:01:24] pybal is definitely R/O. _etcd._tcp should be fine. conftool uses anyway _etcd._tcp.conftool [11:02:03] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10aborrero) [11:02:31] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10aborrero) p:05Triage→03Lowest [11:05:47] which happens because the configured SRV domain is "conftool.%{::site}.wmnet" [11:05:55] for conftool specifically that is ^ [11:17:05] hmm [11:17:08] vgutierrez@lvs6001:~$ ./l4lb etcd --domain eqiad.wmnet [11:17:08] 2022/10/10 11:16:02 dns lookup errors: lookup _etcd-client-ssl-conftool._tcp.eqiad.wmnet on 10.3.0.1:53: no such host and lookup _etcd-client-conftool._tcp.eqiad.wmnet on 10.3.0.1:53: no such host [11:17:24] it seems like we are missing the -ssl variant for the conftool cluster? [11:18:17] actually we don't have _etcd-client SRV records at all [11:18:35] just _etcd-server [11:20:15] per https://etcd.io/docs/v3.3/op-guide/clustering/#dns-discovery it seems that those should be added [11:20:44] I'll open a task [11:20:54] (after lunch) [11:26:21] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) [11:26:31] ^^ reported there [11:26:32] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) p:05Triage→03Medium [12:10:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10ayounsi) LibreNMS have a 5min resolution, while Prometheus is much more fine grained. [12:12:24] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10cmooney) The difference is just to do with sampling / how it's graphed. The Prometheus query there is using irate([5m]), which I... [12:21:04] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Prometheus & librenms differences in traffic graphs - https://phabricator.wikimedia.org/T320395 (10cmooney) > at least one of them is not accurate Neither of them is accurate. It's almost impossible to have an accurate represe... [12:49:16] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) Checking the client implementation for `go.etcd.io/etcd/client/v2 v2.305.4` it looks like the SRV discoverer share code with v3: https://github.com/etcd-io/etcd/blob... [12:52:36] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Joe) The correct domain to test for read-only clients is `conftool.eqiad.wmnet`, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/... [12:57:28] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) you're right in that regard: ` vgutierrez@lvs6001:~$ ./l4lb etcd --domain conftool.eqiad.wmnet 2022/10/10 12:55:44 dns lookup errors: lookup _etcd-client-ssl._tcp.co... [12:59:28] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Joe) yeah this changed with v3. The problem is that AIUI confd uses an older version of the library and expects the simpler form we have now. We can either add a new set of rec... [13:36:37] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Volans) >>! In T320397#8304869, @Joe wrote: > The correct domain to test for read-only clients is `conftool.eqiad.wmnet`, see https://gerrit.wikimedia.org/r/plugins/gitiles/oper... [13:57:56] 10Traffic, 10SRE, 10serviceops: _etcd-client SRV record missing for conftool cluster - https://phabricator.wikimedia.org/T320397 (10Vgutierrez) hmm from the mentioned documentation on the task description: ` If etcd is using TLS, the discovery SRV record (e.g. example.com) must be included in the SSL certifi... [14:25:52] 10netops, 10Infrastructure-Foundations, 10SRE: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10ayounsi) For context: {T170369} [15:19:01] 10netops, 10Infrastructure-Foundations, 10SRE: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10cmooney) To clarify, I guess the question I was interested to know if people had opinions on was whether it would be a bad ide... [17:35:04] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:53:28] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:56:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bullseye [18:28:28] 10Traffic, 10decommission-hardware, 10ops-ulsfo: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) a:03RobH [18:38:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bullseye completed: - ganeti4... [18:39:28] 10Traffic, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin2002 for hosts: `dns4002.wikimedia.org` - dns4002.wikimedia.org (**PASS**) - Downtimed host on Icinga/Aler... [18:44:09] 10Traffic, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) a:05RobH→03ssingh This is ready for full decom from puppet repo and resolution. [18:44:20] 10Traffic, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) [18:44:37] 10Traffic, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission dns4002 - https://phabricator.wikimedia.org/T320440 (10RobH) [18:44:43] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [19:13:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [19:18:19] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [19:21:45] 10Traffic, 10SRE: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) 05In progress→03Resolved [20:42:03] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host dns4004.wikimedia.org with OS bullseye [21:19:21] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host dns4004.wikimedia.org with OS bullseye completed: - dns4004... [21:19:56] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [21:21:16] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) @Ssingh: dns4004 installed fine, so its ready for role and reimage as needed by #traffic. I also kicked the dns4002 decom task over to you for pu...