[03:07:42] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [03:34:32] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [06:54:13] 10netops, 10Infrastructure-Foundations, 10SRE: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10elukey) Hello folks! Not sure if already scheduled but it seems that the current icinga checks for the codfw ripe atlas are getting a 410 gone, do we need to update the `ripeatlas_measuremen... [07:16:57] (VarnishTrafficDrop) firing: 60% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [07:21:56] (VarnishTrafficDrop) resolved: 63% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [07:29:56] (VarnishTrafficDrop) firing: 60% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [07:34:56] (VarnishTrafficDrop) firing: (2) 65% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://alerts.wikimedia.org [07:59:56] (VarnishTrafficDrop) resolved: 65% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [08:52:56] (VarnishTrafficDrop) firing: 64% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [08:57:56] (VarnishTrafficDrop) resolved: 67% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [11:41:01] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) [11:41:08] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) p:05Triage→03High [11:45:26] fun of the day ^ [11:47:33] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) [11:48:07] fun indeed! [11:52:40] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to performance issues - https://phabricator.wikimedia.org/T293879 (10ema) [12:10:59] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) [12:11:23] even writing the task requires quite some iterations :) [12:14:37] 10Traffic, 10SRE, 10Performance-Team (Radar), 10User-ema: Package and deploy Varnish 6.0.8 - https://phabricator.wikimedia.org/T292290 (10ema) 05Open→03Resolved a:03ema All hosts upgraded. [12:23:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10ayounsi) >>! In T267714#7443286, @elukey wrote: > Hello folks! Not sure if already scheduled but it seems that the current icinga checks for the codfw ripe atlas are ge... [12:26:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: ripe-atlas-codfw is down - https://phabricator.wikimedia.org/T267714 (10cmooney) 05In progress→03Resolved Cool, thanks @ayounsi. Good insight into how those alerts are configured. I'll know for the next time to update them too :) [12:58:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp3062:9331 is unreachable - https://alerts.wikimedia.org [13:03:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp3062:9331 is unreachable - https://alerts.wikimedia.org [13:10:47] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) Trying the lowest possible hanging fruit first, namely rising vsl_space. I've first tried setting it to 512M as mentioned in the SAL... [13:14:35] 10Traffic, 10SRE, 10SRE Observability, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) [14:01:55] 10netops, 10Infrastructure-Foundations, 10SRE: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10cmooney) IRC update from Brandon. Traffic are checking if option 2B is viable with management. > Brandon Black > topranks: question_mark is going to talk with f... [14:35:35] 10Traffic, 10Observability-Logging, 10SRE, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10fgiunchedi) [15:26:51] bblack: I'm gonna leave puppet disabled on cp3062, I've given more space to /var/lib/varnish/ and passed -p vsl_space=3072M to varnishd [15:27:36] not anticipating any issues, but feel free to depool the node and roll back the changes in case anything breaks [15:28:33] ema: that's a text node? [15:28:48] vgutierrez: it is [15:29:08] hmm be sure to check OCSP then for wikiworkshop.org [15:29:23] disabling puppet stops OCSP response refresh for acme-chief certs [15:30:17] Next Update: Oct 24 05:59:58 2021 GMT [15:30:29] you should be OK as long as you re-enable puppet before than that ;) [15:30:42] sure, will do tomorrow morning :) [15:30:47] icinga is going to complain 2 days before that [15:30:48] thanks for checking! [15:30:57] but even with that you should be OK [15:38:48] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) a:05aborrero→03ayounsi [16:30:28] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [17:00:11] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [18:19:02] 10Traffic, 10Discovery-Search, 10SRE, 10observability: cloudelastic icinga TLS cert alerts - https://phabricator.wikimedia.org/T293826 (10Dzahn) same thing happened today for lists.wikimedia.org, it alerted and then recovered 2 minutes later. In general we have renewal = 7 days and alerting = 7 days. we... [18:19:53] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Dzahn) [18:45:10] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I read yesterday via Planet Debian [19:08:35] 10Traffic, 10SRE, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:08:51] 10Traffic, 10SRE, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:09:02] 10Traffic, 10SRE, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:12:59] 10Traffic, 10SRE, 10Wikimedia-Incident: 2021-09-18 Wikimedia sites down - https://phabricator.wikimedia.org/T291311 (10Krinkle) [19:23:44] 10Traffic, 10SRE, 10Wikimedia-Incident: 2021-09-26 (UTC) Wikimedia sites down - https://phabricator.wikimedia.org/T291765 (10Krinkle) [19:25:52] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10herron) >>! In T293826#7445653, @Legoktm wrote: > this sounds like https://etbe.coker.com.au/2021/10/20/strange-apache-reload-issue/ which I... [20:21:57] (VarnishTrafficDrop) firing: 65% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [20:26:57] (VarnishTrafficDrop) resolved: 65% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [21:09:38] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) db2078.mgmt mw2253.mgmt [21:57:57] (VarnishTrafficDrop) firing: 47% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [22:02:16] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) [22:17:57] (VarnishTrafficDrop) firing: (2) 67% GET drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://alerts.wikimedia.org [22:27:57] (VarnishTrafficDrop) resolved: (2) 67% GET drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://alerts.wikimedia.org [23:11:32] 10netops, 10Infrastructure-Foundations, 10SRE: Eqiad Expansion - LVS Connectivity Options - https://phabricator.wikimedia.org/T292630 (10RobH)