[06:48:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:58:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:01:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:06:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [09:16:09] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Vgutierrez) System Event Log shows a failure on DIMM A1: ` ------------------------------------------------------------------------------- Record: 49 Date/Time: 12/24/2021 03:32:47 Sourc... [15:00:05] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Deprecate interface-range external - https://phabricator.wikimedia.org/T296935 (10ayounsi) 05Open→03Resolved a:03ayounsi Deployed! [15:06:13] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, and 3 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) [15:30:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): connect 2nd cloudcontrol200x-dev NIC to vlan 2105 - https://phabricator.wikimedia.org/T297588 (10Papaul) [15:37:17] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) @Vgutierrez Happy new year can I power this server off so I can swap DIMM A1 with DIMM B1? [15:38:30] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Vgutierrez) @Papaul yes, go ahead please. Happy new year :) [15:42:30] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 0:30:00 1 host(s) and their services with reason: Swapping faulty DIMM with B1 ` cp2029.codfw.wmnet ` [15:46:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2029:9331 is unreachable - https://alerts.wikimedia.org [15:47:43] ^^ expected :) [15:51:55] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) a:03Papaul [15:52:58] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) I swapped DIMM A1 wiht DIMM B1 to see if the error shows on B1. I am leaving the task open for now . [15:53:34] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Papaul) p:05Triage→03Medium [15:56:39] 10Traffic, 10SRE, 10ops-codfw: cp2029 crashed, hardware memory error - https://phabricator.wikimedia.org/T298293 (10Vgutierrez) @Papaul cool, I'll repool the server then [15:56:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2029:9331 is unreachable - https://alerts.wikimedia.org