[06:50:30] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10wiki_willy) Hi @ayounsi - I'm not sure if you're copied on the Interxion ticket, so just forwarding the info along that they completed th...
[08:13:05] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I can confirm that (1), (2) and (4) are done.  However cr2-drmrs is currently fully down (console is dead as well). My guess is...
[08:35:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6014:9331 is unreachable   - https://alerts.wikimedia.org
[08:40:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (3) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[09:07:31] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (6) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[09:10:36] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: (6) Varnish Exporter on instance cp6010:9331 is unreachable   - https://alerts.wikimedia.org
[09:14:25] <mmandere>	 ^checking
[09:16:06] <volans>	 mmandere: cr2-esams is fully down, so nothing to do on your side there ;)
[09:16:27] <volans>	 netops are aware and trying to get it fixed
[09:16:49] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-drmrs: Q3:(Need By: ASAP) rack/setup/install cr[12]-drmrs - https://phabricator.wikimedia.org/T300277 (10ayounsi) I gave a call to Tarek: the power cord on cr2 was faulty, but he was able to find 2 spare ones which he will bill on the ticket....
[09:16:53] <mmandere>	 volans: ack
[09:18:15] <taavi>	 esams?
[09:18:28] <mmandere>	 volans: that's drmrs
[09:18:44] <XioNoX>	 volans: it's up
[09:19:31] <volans>	 sorry, I mistyped :) I meant cr2-drmrs
[09:19:51] <volans>	 sorry for the confusion, was thinking about 2 different things at once, and my fingers decided to type the wrong one
[09:24:18] <volans>	 XioNoX: great, since when?
[09:24:25] <mmandere>	 volans: np
[09:24:39] <XioNoX>	 volans: 30min/1h ago?
[09:25:13] <XioNoX>	 also I fixed the BGP sessions, so even if cr2 goes down it won't impact anything
[09:25:25] <XioNoX>	 cr1 have a link to asw1-b13 and the other way around
[09:25:53] <volans>	 great, thx, so mmandere if they didn't recover by themselves by now you probably should have a look to see why they are still failing
[09:27:07] <mmandere>	 volans: checked there's some mask change and running puppet agent complains of memory
[09:27:38] <mmandere>	 I'm currently on cp6010
[09:28:33] <vgutierrez>	 sigh.. that typo from volans scared the shit out of me
[09:28:53] <mmandere>	 :D :D
[09:29:10] <vgutierrez>	 is it me or every time that we deploy two brand new routers one burns in hell after a few days?
[09:30:05] <volans>	 vgutierrez: was to check if you were paying attention :-P
[09:31:00] <vgutierrez>	 my coffee levels aren't optimal yet.. hence the lag
[09:33:52] <XioNoX>	 very soon it will be less scary to lose cr2-esams
[09:35:42] <vgutierrez>	 XioNoX: that's right :)
[09:47:33] <mmandere>	 All good now :)
[09:57:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 62% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[10:00:54] <mmandere>	 Initially was checking the wrong service name `prometheus-varnish-exporter` instead of `prometheus-varnish-exporter@frontend`  and it ended up crashing because it ran out of memory  
[10:01:22] <mmandere>	 That's why it did not recover by itself 
[10:03:23] <volans>	 probably worth opening a task to see if it's fixable, it's not great that it runs out of memory just because it lost connectivity
[10:05:10] <mmandere>	 volans: got it
[10:07:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 58% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[10:32:54] <wikibugs>	 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MMandere)
[10:34:27] <wikibugs>	 10Traffic, 10SRE: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MMandere)
[11:07:10] <vgutierrez>	 mmandere: ^^ whatever is eating memory on cp6014 already got the 400G again
[11:07:59] <vgutierrez>	 right.. purged is quite angry after being offline for a few days
[11:08:45] <vgutierrez>	 see https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&var-datasource=drmrs%20prometheus%2Fops&var-cluster=cache_text&var-instance=cp6014&viewPanel=26
[11:12:54] <mmandere>	 vgutierrez: That's right... saw that on running top for top mem consumer on the instance... 
[11:13:03] <wikibugs>	 10Traffic, 10MediaWiki-Uploading: ATS 502 on uploading non-small files - https://phabricator.wikimedia.org/T299160 (10MatthewVernon)
[11:17:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable   - https://alerts.wikimedia.org
[11:40:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6016:9331 is unreachable   - https://alerts.wikimedia.org
[12:15:57] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6016:9331 is unreachable   - https://alerts.wikimedia.org
[14:02:52] <wikibugs>	 10Traffic, 10DNS: Central and South American countries in geo-maps - https://phabricator.wikimedia.org/T301605 (10MatthewVernon)
[14:06:18] <wikibugs>	 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: The certificate for upload.wikimedia.beta.wmflabs.org expired on February 16, 2022. - https://phabricator.wikimedia.org/T301995 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon [I think this task can be closed, since the issue was resolve...
[14:06:24] <wikibugs>	 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10MatthewVernon)
[14:08:03] <wikibugs>	 10Traffic: Prometheus Varnish Exporter fails to start on some instances in DRMRS with Out of Memory Error - https://phabricator.wikimedia.org/T302206 (10MatthewVernon)
[14:17:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 57% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org
[14:22:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 57% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org