[07:20:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10dcaro) Related to T314847 [09:39:38] (LVSHighCPU) firing: (2) The host lvs5002:9100 has at least its CPU 14 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:44:38] (LVSHighCPU) resolved: (8) The host lvs5002:9100 has at least its CPU 0 saturated - https://bit.ly/wmf-lvscpu - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs5002 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighCPU [09:49:45] 10Traffic, 10SRE, 10observability: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10Vgutierrez) Adding Traffic as it's affecting to several traffic metrics [10:06:25] 10Traffic, 10SRE, 10observability, 10Patch-For-Review: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10fgiunchedi) Reported upstream as https://github.com/google/mtail/issues/675 [10:08:47] 10Traffic, 10SRE, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10Vgutierrez) [10:13:07] 10netops, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10SRE: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847 (10ayounsi) Thanks for this task and the clear write-up. I agree with the overall problem statement and ideas to solve it. Adding some th... [13:31:22] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) 05Open→03In progress p:05Triage→03Medium [13:47:31] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10fgiunchedi) I suspect this being related to {T309074} cc @andrea.denisse [13:55:59] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) a:05cmooney→03None [14:03:16] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) a:03andrea.denisse [14:03:45] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Thanks @cmooney and @fgiunchedi , I'll work on this today. [14:33:56] (HAProxyEdgeTrafficDrop) firing: 54% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:18:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:48:04] (HAProxyEdgeTrafficDrop) firing: 56% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:49:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 66.36983952309849% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [15:52:56] (HAProxyEdgeTrafficDrop) resolved: (2) 64% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:54:16] (VarnishTrafficDrop) resolved: Varnish traffic in eqiad has dropped 66.7044721750061% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [16:33:26] 10Traffic, 10MediaWiki-General, 10SRE, 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [16:35:56] (HAProxyEdgeTrafficDrop) firing: 55% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:40:56] (HAProxyEdgeTrafficDrop) resolved: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:45:56] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:50:56] (HAProxyEdgeTrafficDrop) resolved: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:30:09] (HAProxyEdgeTrafficDrop) firing: 62% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:30:35] (PurgedHighEventLag) firing: (3) High event process lag with purged on cp2033:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:34:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:35:35] (PurgedHighEventLag) resolved: (54) High event process lag with purged on cp2027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [17:41:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) For a bit of context the above patch will augment the existing vars under network.interfaces, potentially ad... [18:00:35] 10Traffic, 10SRE, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10BCornwall) p:05Triage→03Medium [18:00:49] 10Traffic, 10SRE, 10ops-eqiad: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10ssingh) [18:02:11] 10Traffic, 10SRE, 10ops-eqiad: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10ssingh) p:05Triage→03Medium [18:35:56] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:39:23] purged is still lagging; Does this require manual intervention? [18:40:56] (HAProxyEdgeTrafficDrop) resolved: 63% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:49:31] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2042:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [18:54:10] 10Traffic, 10SRE, 10ops-eqiad: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10wiki_willy) a:03Cmjohnson [20:53:38] 10Traffic, 10MediaWiki-General, 10SRE: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [21:05:21] bblack, vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/819677/ is good to go now, and is low-risk (testwiki only). ptal and sync when you have availability. [21:21:24] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) @cmooney Found an interesting behavior regarding the 'rancid' user: `topranks The systemd file for rancid exports it as an environment var I think topranks... [21:39:09] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Definitely an odd issue. For comparisons sake we can see that netmon1002 was also trying to save the host key, but it continued after the failure, whereas netmon10... [21:59:05] ori: ack [21:59:34] thanks [22:20:36] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) There was some discussion on irc and interesting observations from Daniel about changes to OpenSSH betwen buster and bullseye which might account for the different... [22:30:12] 10netops, 10Infrastructure-Foundations, 10SRE: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10Dzahn) Confirmed this. The behaviour changed in the newer openssh version in bullseye it seems. On buster we have 7.9, on bullseye we have 8.4 In buster we have in `ssh.c`... [22:51:14] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2042:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown