[06:50:30] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10wiki_willy) Hi @ayounsi - I'm not sure if you're copied on the Interxion ticket, so just forwarding the info along that they completed th... [08:13:05] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I can confirm that (1), (2) and (4) are done. However cr2-drmrs is currently fully down (console is dead as well). My guess is... [08:35:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6014:9331 is unreachable - https://alerts.wikimedia.org [08:40:57] (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [09:07:31] (VarnishPrometheusExporterDown) firing: (6) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [09:10:36] (VarnishPrometheusExporterDown) firing: (6) Varnish Exporter on instance cp6010:9331 is unreachable - https://alerts.wikimedia.org [09:14:25] ^checking [09:16:06] mmandere: cr2-esams is fully down, so nothing to do on your side there ;) [09:16:27] netops are aware and trying to get it fixed [09:16:49] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I gave a call to Tarek: the power cord on cr2 was faulty, but he was able to find 2 spare ones which he will bill on the ticket.... [09:16:53] volans: ack [09:18:15] esams? [09:18:28] volans: that's drmrs [09:18:44] volans: it's up [09:19:31] sorry, I mistyped :) I meant cr2-drmrs [09:19:51] sorry for the confusion, was thinking about 2 different things at once, and my fingers decided to type the wrong one [09:24:18] XioNoX: great, since when? [09:24:25] volans: np [09:24:39] volans: 30min/1h ago? [09:25:13] also I fixed the BGP sessions, so even if cr2 goes down it won't impact anything [09:25:25] cr1 have a link to asw1-b13 and the other way around [09:25:53] great, thx, so mmandere if they didn't recover by themselves by now you probably should have a look to see why they are still failing [09:27:07] volans: checked there's some mask change and running puppet agent complains of memory [09:27:38] I'm currently on cp6010 [09:28:33] sigh.. that typo from volans scared the shit out of me [09:28:53] :D :D [09:29:10] is it me or every time that we deploy two brand new routers one burns in hell after a few days? [09:30:05] vgutierrez: was to check if you were paying attention :-P [09:31:00] my coffee levels aren't optimal yet.. hence the lag [09:33:52] very soon it will be less scary to lose cr2-esams [09:35:42] XioNoX: that's right :) [09:47:33] All good now :) [09:57:56] (EdgeTrafficDrop) firing: 62% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:00:54] Initially was checking the wrong service name `prometheus-varnish-exporter` instead of `prometheus-varnish-exporter@frontend` and it ended up crashing because it ran out of memory [10:01:22] That's why it did not recover by itself [10:03:23] probably worth opening a task to see if it's fixable, it's not great that it runs out of memory just because it lost connectivity [10:05:10] volans: got it [10:07:56] (EdgeTrafficDrop) resolved: 58% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:32:54] 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MMandere) [10:34:27] 10Traffic, 10SRE: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MMandere) [11:07:10] mmandere: ^^ whatever is eating memory on cp6014 already got the 400G again [11:07:59] right.. purged is quite angry after being offline for a few days [11:08:45] see https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&var-datasource=drmrs%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp6014&viewPanel=26 [11:12:54] vgutierrez: That's right... saw that on running top for top mem consumer on the instance... [11:13:03] 10Traffic, 10MediaWiki-Uploading: ATS 502 on uploading non-small files - https://phabricator.wikimedia.org/T299160 (10MatthewVernon) [11:17:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable - https://alerts.wikimedia.org [11:40:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6016:9331 is unreachable - https://alerts.wikimedia.org [12:15:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable - https://alerts.wikimedia.org [14:02:52] 10Traffic, 10DNS: Central and South American countries in geo-maps - https://phabricator.wikimedia.org/T301605 (10MatthewVernon) [14:06:18] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022. - https://phabricator.wikimedia.org/T301995 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [I think this task can be closed, since the issue was resolve... [14:06:24] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10MatthewVernon) [14:08:03] 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MatthewVernon) [14:17:56] (EdgeTrafficDrop) firing: 57% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [14:22:56] (EdgeTrafficDrop) resolved: 57% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org