[00:02:57] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [00:07:57] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [05:55:56] (HAProxyEdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [06:00:56] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:25:20] 10netops, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Marostegui) [08:31:27] 10Traffic, 10DBA, 10MediaWiki-General, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) [08:31:36] 10Traffic, 10DBA, 10MediaWiki-General, 10Pybal, and 2 others: SELECT query arriving to wikidatawiki db codfw hosts causing pile ups during schema change - https://phabricator.wikimedia.org/T284981 (10Marostegui) 05Open→03Resolved a:03Marostegui I am going to consider this fixed as it never happened a... [08:32:41] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: drmrs: primary software task - https://phabricator.wikimedia.org/T282788 (10Marostegui) @BBlack anything left here or can this be closed? [08:56:23] 10netops, 10Cloud-Services, 10DBA, 10Infrastructure-Foundations, and 2 others: DBA plan to mitigate asw-c2-eqiad reboots - https://phabricator.wikimedia.org/T155999 (10Marostegui) [09:12:33] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH @RobH I've drained all primary instances away from ganeti4002. Before you swap the DIMM simply set downtime and power the server down. And when t... [09:44:59] 10netops, 10Infrastructure-Foundations, 10SRE: 2021-08-26 Primary inbound port utilisation over 80% page for mr1-esams.wikimedia.org - https://phabricator.wikimedia.org/T289820 (10Marostegui) 05Open→03Resolved a:03ayounsi Fixed per the above comment [10:36:58] 10Traffic, 10SRE, 10Patch-For-Review: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10Marostegui) @BBlack anything else pending? [11:20:25] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @AlexisJazz for that suggestion. I think it might well help in terms of findi... [14:45:52] bblack: working on T155761 I made a series of patches for the dns repo's zone_validator.py script and a couple for the script that generates the DNS data from Netbox. Very kindly jbond has already reviewed them. I'd like to know if you plan/want to have a look, so that I know if I should wait for your review too or not :) [14:45:52] T155761: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 [14:46:10] FYI the 2 series starts at https://gerrit.wikimedia.org/r/c/operations/dns/+/793056 and https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/793483 [15:19:15] volans: yeah I can try to take a look around the next hour mark [15:19:37] I'm pretty time-boxed today, but at least a pass over them all for anything basic I can think of :) [15:20:00] ack, thanks a lot! No hurry at all, just to know if I should wait or not. It's ok if you have a look and say hey I want to look at them more in depth next week ;) [15:41:30] 10netops, 10Infrastructure-Foundations: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Papaul) [16:18:07] 10netops, 10Infrastructure-Foundations: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Volans) That's probably the change in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/789089 I'll have a look. [16:24:57] (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:29:57] (HAProxyEdgeTrafficDrop) resolved: (2) 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:31:47] 10netops, 10Infrastructure-Foundations, 10SRE: codfw: Provision a server script can not run without a cable ID" - https://phabricator.wikimedia.org/T308768 (10Volans) @Papaul the above patch was merged and deployed. I think it should fix the issue. Please resolve the task if that's the case or let me know wh... [17:31:02] volans: one thing that puzzles me: the patch that adds basic support for checking the netbox fragments.. does this work on our laptops anymore? some method of checking out the other repo seperately and linking it in or something? [17:32:52] volans: other than that, it all seems pretty straightforward! [18:46:24] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) 05Open→03Resolved @MoritzMuehlenhoff this host is now ready to return to service, its memory has been replaced. [22:13:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3060:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [22:18:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3060:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [23:05:43] 10Traffic: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10BBlack) p:05Triage→03High