[00:07:17] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10RobH) a:05RobH→03ssingh This host has had its ram replaced and booted into the OS successfully, detecting all memory without errors. When we were replacing the memory, it forgot its... [00:48:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10ssingh) 05Open→03Resolved >>! In T305423#7901638, @RobH wrote: > This host has had its ram replaced and booted into the OS successfully, detecting all memory without errors. > > When... [00:53:38] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10Papaul) @ssingh the host is still has failed as status in netbox https://netbox.wikimedia.org/dcim/devices/1611/ [01:05:48] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqsin: cp5002 memory errors on DIMM A4 - https://phabricator.wikimedia.org/T305423 (10ssingh) >>! In T305423#7901674, @Papaul wrote: > @ssingh the host is still has failed as status in netbox > https://netbox.wikimedia.org/dcim/devices/1611/ Thanks for letting me know @PP... [01:16:56] (HAProxyEdgeTrafficDrop) firing: (2) 32% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:21:56] (HAProxyEdgeTrafficDrop) firing: (5) 55% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:26:56] (HAProxyEdgeTrafficDrop) firing: (5) 64% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:36:56] (HAProxyEdgeTrafficDrop) resolved: (5) 64% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [01:58:48] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [02:33:05] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster execut... [02:41:49] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [03:11:13] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster execut... [08:02:25] (HAProxyEdgeTrafficDrop) firing: 61% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:07:25] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:43:33] godog: I'm not super happy with https://gerrit.wikimedia.org/r/c/operations/puppet/+/789102/1/modules/mtail/files/programs/cache_haproxy.mtail, are you aware of a neater approach that would potentially save me from repeating all of that? [10:46:55] vgutierrez: mmhh I'll take a deeper look this afternoon but does setting $cache_status = "none" or sth similar work for the len($cache_status) == 0 ? then the rest is common AFAICS [10:48:06] IIRC mtail doesn't allow that [10:52:13] hmmm but I guess I can define a hidden text and do that [12:20:11] godog: indeed, https://gerrit.wikimedia.org/r/c/operations/puppet/+/789102/3 looks good [12:36:00] vgutierrez: neat! LGTM [12:36:11] vgutierrez: when you get a change, https://gerrit.wikimedia.org/r/c/operations/alerts/+/789094 [12:45:16] godog: you mention on the commit message that the alert doesn't page but (hashtag)page is used on the summary of the alert [12:49:24] vgutierrez: the CPU alert doesn't page, the RX one does [12:49:48] also neat trick with (hashtag)page :) [12:50:14] oh right [12:51:12] yeah.. I get "paged" randomly when somebody discuss that kind of thing so... [12:53:07] heheh same here, thanks for the review! I'll let bblack comment/vote too and merge tomorrow, will be following up with the puppet.git change to remove the alerts from there too [13:57:21] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [14:17:27] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster execut... [14:19:00] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I had the beginnings of a theory, based on some reading around varnish, but now I don't think that it's va... [14:24:45] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10CDanis) Hi, haven't deeply read or understood this issue (sorry!) but I wanted to point out T264021 as potentially... [14:44:32] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster [14:45:12] 10netops, 10Infrastructure-Foundations, 10SRE: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @dcaro not just yet. I believe the one change we will need to test here is adding a route on the cloud-storage interfaces. What... [15:20:34] 10Traffic, 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Thanks @CDanis - Yes that looks very likely. Also I think that the latency ticket {T294911} is also probab... [16:03:22] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5002.eqsin.wmnet with OS buster comple... [18:12:09] 10Traffic, 10RESTBase-API, 10SRE: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Dzahn)