[01:19:14] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10andrea.denisse) Hello team, after further testing it the least disruptive and simplest approach is to create the `.ssh` directory using Puppet. It nee... [03:13:07] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE Observability (FY2022/2023-Q1): LibreNMS seemingly not scraping many devices after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10CDanis) [03:13:14] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE Observability (FY2022/2023-Q1): LibreNMS seemingly not scraping many devices after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10CDanis) p:05Triage→03High [03:14:23] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE Observability (FY2022/2023-Q1): LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10CDanis) [03:26:16] (VarnishTrafficDrop) firing: Varnish traffic in eqiad has dropped 67.70246608342956% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [03:26:56] (HAProxyEdgeTrafficDrop) firing: (5) 45% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:31:16] (VarnishTrafficDrop) firing: (3) Varnish traffic in eqiad has dropped 59.13671236125279% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [03:31:56] (HAProxyEdgeTrafficDrop) firing: (6) 47% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:33:08] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) I did a pywikibot edit on testwiki from my Dallas test instance. The time between the completion of the last codfw sessionstore write and the eqia... [03:34:36] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [03:36:16] (VarnishTrafficDrop) resolved: (4) Varnish traffic in eqiad has dropped 59.106982879207024% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [03:36:56] (HAProxyEdgeTrafficDrop) resolved: (6) 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:47:29] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, 10SRE Observability (FY2022/2023-Q1): LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10ayounsi) Looks like permission issues: `name=netmon1003 ayounsi@... [04:19:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10ayounsi) > @ayounsi do you anticipate any fallout from this? I agree that it's better to check host keys, so +1 as long as: * there is some kind of al... [04:53:50] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) I think that the owner is override to '`deploy-librenms`' during the [[ ht... [07:57:24] 10netops, 10Infrastructure-Foundations: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ayounsi) p:05Triage→03High [07:58:03] 10netops, 10Infrastructure-Foundations: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox interface ID cr3-ulsfo:xe-0/1/1 --- **Interface cr3-ulsfo:xe-0/1/1** - admin-status: up - ⚠️ oper-status: down - interface-fl... [08:05:36] 10netops, 10Infrastructure-Foundations: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ayounsi) Email sent to Telia's NOC - https://netbox.wikimedia.org/tenancy/contacts/21/ [08:06:25] 10netops, 10Infrastructure-Foundations: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ops-monitoring-bot) ===== Automated diagnostic for Netbox interface ID cr3-ulsfo:xe-0/1/1 --- **Interface cr3-ulsfo:xe-0/1/1** - admin-status: up - oper-status: up - interface-flapped... [08:07:46] 10netops, 10Infrastructure-Foundations: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ayounsi) And of course it went back up as I'm sending the email. Also got a quick reply from Telia: > Please be informed that your circuit is affected by a Major Disturbance being track... [08:34:03] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10fgiunchedi) My apologies! I ran the quickdatacopy the other day ahead of the failover and... [09:12:45] 10netops, 10Infrastructure-Foundations, 10Observability-Metrics, 10SRE, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10fgiunchedi) I looked into why quickdatacopy didn't do the right thing: * the rsync server... [09:14:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10fgiunchedi) Agreed on the short term fix to create the `.ssh` directory. However if we were not checking host keys to begin with I think we should keep... [09:41:05] 10netops, 10Infrastructure-Foundations, 10SRE: Telia ulsfo-eqord transport link down - https://phabricator.wikimedia.org/T314978 (10ayounsi) 05Open→03Resolved a:03ayounsi [09:42:56] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:47:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [12:13:15] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) @fgiunchedi yeah that may be an option. I'm not sure how easy it is to change Rancid to add that to the command when running ssh, but I'm sur... [12:17:38] 10Traffic, 10SRE, 10observability, 10Patch-For-Review, 10Upstream: mtail histograms don't work as expected - https://phabricator.wikimedia.org/T314922 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Me and @Vgutierrez have fixed the existing histograms and I've added a test for `buckets -1` so we d... [12:23:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Another oddity here with rancid from netmon1003. The permission change has removed the problem for most of our estate (all the Juniper device... [12:32:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) Logs suggest a timeout: ` scs-oe16-esams.mgmt.esams.wmnet oglogin error: Error: TIMEOUT reached scs-oe16-esams.mgmt.esams.wmnet: missed cmd(s... [13:01:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) I believe the issue is that the expect script Rancid is running for these is not saying "yes" to accept the host key. This did not happen in... [14:57:56] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:02:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:22:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Rancid on netmon1003 unable to login to network devices - https://phabricator.wikimedia.org/T314936 (10cmooney) [17:28:56] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:33:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:57:56] 10Traffic, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ottomata) Are there actionables on this task? I'm considering removing the Event Pla... [18:19:21] 10Traffic, 10Data-Engineering, 10Event-Platform Value Stream, 10SRE, and 2 others: Incident: 2022-03-4 Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10jcrespo) @Ottomata: The actionables of the task pending is to understand what the act... [18:50:19] 10Traffic, 10SRE, 10ops-codfw: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) [18:50:49] 10Traffic, 10SRE, 10ops-codfw: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) p:05Triage→03Medium [19:03:56] (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:06:40] 10Traffic, 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10Gehel) [19:06:51] 10Traffic, 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10Gehel) We check the ferm rules, which seem to open those ports as expected. I suspect there is something going on at a lower networ... [19:08:56] (HAProxyEdgeTrafficDrop) resolved: 60% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:09:01] kwakuofori: we have what looks like networking issues between racks F2 and F3 for elasticsearch hosts and we'll need some help. See T315038. Could you make sure this is on your team's radar? [19:09:01] T315038: Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 [19:10:17] gehel: hey! taking a look at the task [19:12:05] kwakuofori: I'm not entirely sure this is really network related, but we'll need your help to diagnose! [19:13:25] gehel: looks very much like an issue for netops to look at but will see how we can assist as well [19:20:57] kwakuofori: Oh, sorry, I still though that traffic was also in charge of networking. So that should be Infrastructure Foundation instead? [19:21:06] ryankemper: ^ [19:21:27] gehel: correct [19:21:38] kwakuofori: gehel: ack, thanks for the pointer [19:21:46] I'll check with jobo [19:21:58] all the best! [19:22:29] (removed traffic tag and added infrastructure-foundations) [19:31:01] 10Traffic, 10SRE, 10ops-codfw: cp2042 is down: can't SSH; management interface works (no errors); ipmitool doesn't work - https://phabricator.wikimedia.org/T315041 (10ssingh) 05Open→03Resolved a:03ssingh Thanks to recommendations by @Dzahn, I did the following: ` racadm serveraction powercycle ` This... [19:44:43] gehel: I'm having a look, the easiest for such tasks is to tag "netops" :) [19:44:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) [19:44:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) p:05Triage→03High [19:45:23] XioNoX: thanks! We don't get those kind of tasks very often, never really sure how to tag them. [19:45:57] gehel: that's a good thing :) [19:46:12] he, he, he... [19:56:07] XioNoX: isn't it late for you? This is important, but not urgent enough that you need to loose sleep over it! [19:57:34] thanks :) "just a quick look" I can't find anything obvious though so far [20:02:59] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) The above patch uses the new puppet facts to define vlan sub-interface and bridge relations as described in... [20:21:21] 10netops, 10Data-Engineering, 10Infrastructure-Foundations, 10Product-Analytics, and 3 others: Maybe restrict domains accessible by webproxy - https://phabricator.wikimedia.org/T300977 (10herron) [20:47:23] 10netops, 10Infrastructure-Foundations, 10SRE, 10Discovery-Search (Current work): Possible problem communicating between eqiad elastic hosts in racks F2 and F3 - https://phabricator.wikimedia.org/T315038 (10ayounsi) I had a quick look and can't find any smoking gun so far. The issue seems to be related to... [21:24:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) > My goal is to proceed and update the automation to set switch interface access/trunk and allowed vlans onc... [21:27:16] (VarnishTrafficDrop) firing: (2) Varnish traffic in eqiad has dropped 55.89565966416183% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [21:27:56] (HAProxyEdgeTrafficDrop) firing: (4) 40% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:32:16] (VarnishTrafficDrop) resolved: (8) Varnish traffic in drmrs has dropped 54.731145771467254% - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DVarnishTrafficDrop [21:32:56] (HAProxyEdgeTrafficDrop) resolved: (5) 40% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop