[09:28:28] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2036.codfw.wmnet with OS buster [09:34:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org [09:44:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org [09:46:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org [09:47:04] 10Traffic, 10Data-Engineering, 10Event-Platform, 10SRE, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ladsgroup) [09:51:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org [09:51:41] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org [09:55:44] godog: puppet's been disabled since Friday on cp6011 with the message "bblack - filippo", what's going on there? ;P [09:58:51] 10Traffic, 10DNS, 10SRE, 10Wikimedia Enterprise: 301 redirect setup for wikimediaenterprise - https://phabricator.wikimedia.org/T302756 (10Vgutierrez) 05Open→03Stalled we cannot perform that redirect cause we don't handle the DNS for that domain: `$ host -t ns wikimediaenterprise.org wikimediaenterpris... [10:01:23] vgutierrez: it was disabled by bblack, but we had to re-enable it temporarily to update the puppet facts (LLDP neighbor) [10:01:47] thx, I'll sync with bblack then [10:02:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2036:9331 is unreachable - https://alerts.wikimedia.org [10:10:29] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2036.codfw.wmnet with OS buster c... [10:15:07] vgutierrez: yep what XioNoX said :) should have been more explicit in the message! [10:17:59] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1084.eqiad.wmnet with OS buster [10:24:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:27:27] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) >>! In T302699#7740763, @dom_walden wrote: > ` > AH00288: scoreboard is full, not at MaxRequestWorker... [10:29:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:30:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:35:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:35:43] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [10:36:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:40:11] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:42:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1084:9331 is unreachable - https://alerts.wikimedia.org [10:45:11] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [10:47:26] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [10:55:11] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [11:00:54] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1084.eqiad.wmnet with OS buster c... [11:03:34] 10Traffic, 10MediaWiki-API: api.php not working (503, Backend fetch failed) - https://phabricator.wikimedia.org/T303165 (10AlexisJazz) [11:03:57] 10Traffic, 10Beta-Cluster-Infrastructure, 10MediaWiki-API, 10Beta-Cluster-reproducible: api.php not working (503, Backend fetch failed) - https://phabricator.wikimedia.org/T303165 (10AlexisJazz) [11:10:19] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5016.eqsin.wmnet with OS buster [11:12:40] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) >>! In T302699#7756154, @dom_walden wrote: >>>! In T302699#7740763, @dom_walden wrote: >> ` >> AH0028... [11:14:00] 10Traffic, 10Beta-Cluster-Infrastructure, 10MediaWiki-API, 10Beta-Cluster-reproducible: api.php not working (503, Backend fetch failed) - https://phabricator.wikimedia.org/T303165 (10Majavah) [11:14:19] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Majavah) [11:16:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp5016:9331 is unreachable - https://alerts.wikimedia.org [11:17:37] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) @Majavah are you sure T303165 is a dupe? That task is about api.php (and nothing else!) **consistentl... [11:18:08] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4036.ulsfo.wmnet with OS buster c... [11:19:55] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Majavah) >>! In T302699#7756393, @AlexisJazz wrote: > @Majavah are you sure T303165 is a dupe? That task is about... [11:20:51] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3060.esams.wmnet with OS buster [11:31:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3060:9331 is unreachable - https://alerts.wikimedia.org [11:35:21] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10AlexisJazz) >>! In T302699#7756404, @Majavah wrote: >>>! In T302699#7756393, @AlexisJazz wrote: >> @Majavah are y... [11:35:31] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:36:19] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [11:41:52] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Vgutierrez) in this case a 502 is emitted by ats-backend cause it isn't able to reach its backend server. The 503... [11:49:18] godog: I've updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/768057 to include the thanos rules [11:49:29] and fixed the issue detected by mmandere [11:51:57] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp3060:9331 is unreachable - https://alerts.wikimedia.org [12:01:57] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp3060:9331 is unreachable - https://alerts.wikimedia.org [12:04:02] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5016.eqsin.wmnet with OS buster c... [12:08:04] vgutierrez: you're working on cp3060? [12:08:12] yep [12:08:14] it's being reimaged [12:08:19] cool! [12:08:53] volans: shouldn't the re-image script update the netbox status as well? so the report doesn't alert [12:10:25] XioNoX: and what state should it be put on? the report excludes only inventory/offline/planned/decommissioning [12:10:59] and failed [12:11:13] the logical one would be staged [12:11:50] it's so temporary that dunno what it should be [12:12:54] 10Traffic, 10Observability-Metrics, 10SRE: Port Traffic dashboards to Thanos - https://phabricator.wikimedia.org/T302266 (10MMandere) [12:14:47] XioNoX: I think the only state that would not trigger other reports to fail is failed, but feels really wrong [12:15:01] another approach could be to change something else on netbox [12:15:41] a bit more hacky though (something like the comments or a tag) [12:15:56] (EdgeTrafficDrop) firing: 56% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [12:18:24] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3060.esams.wmnet with OS buster c... [12:18:27] volans: failed doesn't feel so wrong to me, as it match the temporary state (server not able to do its duty) [12:18:42] but yeah if there is nothing obvious might be better to keep it as it [12:20:25] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp2037.codfw.wmnet with OS buster [12:20:56] (EdgeTrafficDrop) resolved: 63% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org [12:23:37] we could do failed... the other question XioNoX is to decide what to do on failure of the cookbook, should it reset the netbox status to its previous one or not? [12:24:27] volans: what's the server's status of the cookbook fails? [12:26:22] in netbox or in general? [12:26:32] in netbox currently untouched, in general depends [12:26:38] the host might be fixed manually [12:26:41] or re-reimaged [12:27:28] so I guess failed would make sens [12:27:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2037:9331 is unreachable - https://alerts.wikimedia.org [12:42:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp2037:9331 is unreachable - https://alerts.wikimedia.org [12:45:56] (EdgeTrafficDrop) firing: 64% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [12:50:56] (EdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [12:58:23] bblack: GTT MTU issue fixed, and routers upgraded again to clear a minor bug (noisy logs). drmrs is ready for Traffic! [13:05:08] woho! [13:21:33] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Krinkle) [13:47:48] vgutierrez: ack [14:03:16] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp2037.codfw.wmnet with OS buster c... [14:09:20] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1085.eqiad.wmnet with OS buster [14:15:29] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp4030.ulsfo.wmnet with OS buster [14:15:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [14:25:56] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [14:45:56] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [14:50:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4030:9331 is unreachable - https://alerts.wikimedia.org [14:52:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [14:56:11] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [15:01:53] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Vgutierrez) [15:02:20] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Vgutierrez) p:05Triage→03Medium [15:02:43] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp4030.ulsfo.wmnet with OS buster c... [15:38:48] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1085.eqiad.wmnet with OS buster e... [15:40:06] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 30 days, 0:00:00 1 host(s) and their services with reason: HW issues see T303183 ` cp1085.eqiad.wmnet ` [15:50:07] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp5010.eqsin.wmnet with OS buster [15:57:56] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [16:10:46] 10Traffic, 10SRE, 10ops-eqsin: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10Vgutierrez) p:05Triage→03Medium @wiki_willy how should we handle this HW issue on eqsin? [16:11:11] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [16:18:25] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp3058.esams.wmnet with OS buster [16:26:11] (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [16:30:53] I've silenced that alert [16:30:59] as cp1085 has some HW issues (T303183) [16:31:00] T303183: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 [16:41:54] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp5010.eqsin.wmnet with OS buster c... [16:47:56] (EdgeTrafficDrop) firing: (2) 44% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [16:52:56] (EdgeTrafficDrop) resolved: (2) 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [16:56:38] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) @Vgutierrez @wiki_willy This server is out of warranty. Expired June 2021 [17:03:22] 10Traffic, 10SRE, 10ops-eqsin: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10wiki_willy) a:03RobH Hi @Vgutierrez - it's due to be refreshed towards the end of this calendar year (and will be on next FY's budget). Would you be able to go that lon... [17:06:11] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3058:9331 is unreachable - https://alerts.wikimedia.org [17:07:57] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [17:08:03] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp3058.esams.wmnet with OS buster c... [17:09:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Vgutierrez) could we replace the faulty DIMM somehow? missing one server on text@eqiad is far from a ideal scenario [17:14:24] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10wiki_willy) [17:15:56] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10wiki_willy) No problem @Vgutierrez. I just created T303203 with @RobH to procure a replacement DIMM Thanks, Willy [17:20:56] 10Traffic, 10SRE, 10ops-eqsin: SMART error (CurrentPendingSector) detected on host: cp5004 - https://phabricator.wikimedia.org/T303043 (10ops-monitoring-bot) Icinga downtime set by vgutierrez@cumin1001 for 30 days, 0:00:00 1 host(s) and their services with reason: HW issues see T303043 ` cp5004.eqsin.wmnet ` [17:31:11] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1085:9331 is unreachable - https://alerts.wikimedia.org [17:36:37] uh.. I've silenced that one ¬¬ [17:43:50] vgutierrez: I think that one's from alertmanager, which has separate silencing from icinga [17:44:52] hmm I silenced it on alertmanager [17:46:24] it isn't there anymore.. weird [18:00:39] vgutierrez: fyi https://gerrit.wikimedia.org/r/c/operations/puppet/+/768766/4 is an effort to move the wikimedia domain list t hiera. i also did 768739 and 768762 which are both effectivaly noop and just a bit of modernising like type validation and documentations etc. in relation to the documentation i have taken my best gusse so expecxt some of it to be wrong [18:01:17] so... I was going to add a comment there saying that's not the list of WM domains but the canonical domains [18:01:44] at the same time I've realised that we should be including wikiworkshop.org in that list [18:01:53] and of course it isn't a canonical domain :) [18:02:43] right [18:02:57] wikiworkshop.org really should be considered part of our canonical domain set [18:03:05] we just haven't updated various lists (e.g. wikitech) [18:03:20] then we should be including it in the unified cert [18:03:33] yeah that too, but it's a little bit tricky [18:04:30] because it would be added to the LE unified and the Digicert unified async from each other, yet we have a shared config across all the sites, some of which are using one or the other cert, and the TLS config doing SNI-based selection for the wikiworkshop.org cert [18:04:48] it's a generic problem we'd have with any new LE-only canonical becoming part of the canonical set and the unified certs eventually [18:05:05] but we've never put in the time to come up with an elegant solution for getting through those transitions [18:06:34] so for example, assume we first add it to the LE unified, but it's not yet in digicert (waiting on next renewal or whatever) [18:07:06] now the LE-unified sites need to have just the unified cert, but the digicert sites need to continue using the digicert unified + separate wikiworkshop.org LE (and continue refreshing it in acmechief too) [18:07:07] vgutierrez: expect this to take a few iterations, add anything to the cr and we can update, add or ameand :) [18:30:05] continuing on the "add a new canonical" thread - it's entirely possible that some or all of our TLS terminator softwares will make this easy for us (that they're ok with seeing the same SAN in multiple certs and deal with it some reasonable way, like picking the first cert in the config which contains it) [18:30:14] but we haven't tested that AFAIK [22:53:45] 10Traffic, 10SRE, 10envoy, 10serviceops: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) [22:55:29] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [22:55:35] 10Traffic, 10SRE, 10envoy, 10serviceops: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Open→03Stalled p:05Triage→03Low [23:02:30] 10Traffic, 10SRE, 10envoy, 10serviceops: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) [23:05:04] 10Traffic, 10SRE, 10envoy, 10serviceops: Refactor envoy access_log_path to access loggers - https://phabricator.wikimedia.org/T303231 (10RLazarus) p:05Triage→03Medium