[00:26:56] (HAProxyEdgeTrafficDrop) firing: (3) 68% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:31:56] (HAProxyEdgeTrafficDrop) resolved: (3) 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:19:10] 10Traffic, 10SRE: Create Ganeti VMs for Wikidough in drmrs - https://phabricator.wikimedia.org/T300156 (10ssingh) 05Open→03Resolved [01:19:45] 10Traffic, 10SRE, 10Patch-For-Review: Create Ganeti VMs for durum in drmrs - https://phabricator.wikimedia.org/T300158 (10ssingh) 05Open→03Resolved Resolved for quite a while now. [09:55:56] (HAProxyEdgeTrafficDrop) firing: 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:00:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [10:16:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) 05Open→03Resolved Ok thanks @nskaggs. I'm going to close this task now as I believe everything is confi... [10:34:09] 10netops, 10Infrastructure-Foundations, 10SRE: Packet Drops on Eqiad ASW -> CR uplinks - https://phabricator.wikimedia.org/T291627 (10cmooney) Good suggestion. The discrepancy isn't ideal but I think a little asymmetry is worth it if we can improve performance. +1 [10:37:18] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox: Improve Netbox import script to avoid port-number collisions in JunOS - https://phabricator.wikimedia.org/T301392 (10cmooney) 05Open→03Resolved [10:41:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Configuration of New Switches Eqiad Rows E-F - https://phabricator.wikimedia.org/T299758 (10cmooney) 05Open→03Resolved [10:48:54] 10netops, 10Infrastructure-Foundations: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) p:05Triage→03Low [10:49:09] 10netops, 10Infrastructure-Foundations: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [10:49:17] 10netops, 10Infrastructure-Foundations, 10SRE: Move interface VRF assignment to Netbox - https://phabricator.wikimedia.org/T310715 (10cmooney) [11:05:42] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) I took a naive approach with the above patch (only checks for the usual "dot" delimiter. Similarly I tested... [12:35:31] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @ayounsi yes i can complete cr1 but of course with your help. Thanks [13:35:45] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) Super work! I'll maybe try to dig into the puppet custom facts stuff, be a chance to learn some Ruby I gues... [15:26:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) Depending on the depth of this rabbit hole, it might be better to focus on DHCP option 97 (which solves the... [17:12:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10netbox, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) I agree it's not worth massive effort, Option 97 is the better way to resolve the initial problem for sure.... [19:13:44] Anyone know where I can verify that all the alertmanager rules I've written are actually running? I can't actually find where alertmanager is running ._. [19:24:46] brett: alert1001 [19:24:59] 10Traffic, 10Observability-Alerting, 10SRE, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) @fgiunchedi Looks like the rules mentioned in the ticket have all either been ported or confirmed as... [19:25:24] see modules/role/manifests/alerting_host.pp [19:25:44] and the role which is applied to the hosts, in manifests/site.pp [19:25:50] node /^alert[12]001\.wikimedia\.org$/ { role(alerting_host) [19:25:50] } [19:27:17] sukhe: Thanks! [19:31:22] hm, but I don't actually see the AM rules anywhere, just the Karma dashboard stuff on alert. I *do* see that operations/alerts is cloned for *prometheus* servers in /srv/alerts though [19:33:10] but the actual AM binary is running on alert. A wee bit confused about that :S [19:35:48] brett: prometheus (or thanos-rule) is the component responsible for evaluating the alert rules and sending information about firing ones to AM [19:36:06] AM just keeps track of which alerts are firing and is responsible for sending out any notifications [19:37:33] which alert are you looking for? I can probably give some pointers where to look [19:40:39] taavi: Thanks for the explanation. An example alert would be VarnishHighMmapCount [19:44:59] brett: so that's coming from operations/alerts.git/team-traffic/varnish.yaml, it doesn't have a comment restricting where it's deployed so it should get deployed to all prod prometheus instances [19:45:41] taavi: Makes sense, and I see it in /srv/alerts. I'm just not seeing where it's configured to look into /srv/alerts. [19:45:45] executing parts of the query on thanos.wikimedia.org reveals that those metrics are on the 'ops' prometheus instance (that's where most stuff is), so I'd look at /srv/alerts and /srv/alerts/ops on any prometheus* box [19:46:07] rule_files in /srv/prometheus/ops/prometheus.yml [19:46:27] ah! [19:46:42] I was looking into /etc/prometheus [19:46:53] that makes more sense now [19:47:10] Thanks a bunch for the help, taavi [19:49:29] ooookay, that clears a bunch up, all configuration of prometheus lies in a dedicated LVM volume [20:16:40] 10Traffic, 10Observability-Alerting, 10SRE, 10Patch-For-Review, 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) Ah, I've since learned where to look and verify where the rules are. Are we comfortable enough with th...