[02:00:57] (VarnishTrafficDrop) firing: 58% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [02:05:57] (VarnishTrafficDrop) resolved: 63% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [05:16:52] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10SRE, 10serviceops: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [06:57:09] elukey: re: VarnishTrafficDrop firing often lately, the alert is indeed spammy and annoying. The reason is that codfw depooled, for historical context see T201630 [06:57:10] T201630: False alarms on varnish-http-requests 70% GET drop in 30 min alert - https://phabricator.wikimedia.org/T201630 [06:58:45] I also see that the $labels.site information is gone from the alert now, weird [06:58:58] ema: thanks!\ [06:59:00] ie: drop in text@ during [...] instead of drop in text@codfw during [06:59:35] I'll open a task [06:59:41] ty elukey [07:09:49] 10Traffic, 10SRE, 10SRE Observability: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) [07:26:58] 10Traffic, 10SRE, 10SRE Observability: VarnishTrafficDrop IRC alert does not include DC name anymore - https://phabricator.wikimedia.org/T291149 (10ema) [09:48:12] 10Traffic, 10DNS, 10SRE: One more DNS request for Wikilearn - https://phabricator.wikimedia.org/T291090 (10Vgutierrez) a:03Vgutierrez [09:56:12] 10Traffic, 10DNS, 10SRE, 10Patch-For-Review: One more DNS request for Wikilearn - https://phabricator.wikimedia.org/T291090 (10Vgutierrez) 05Open→03Resolved ` $ host -t A learn.wiki learn.wiki has address 76.223.57.52 learn.wiki has address 13.248.190.88 ` [12:28:03] 10Traffic, 10SRE, 10SRE Observability: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10BBlack) The solution to this in the icinga version of this check was to include an additional term in the prometheus query that would cause a null result if the a... [13:04:24] 10Traffic, 10SRE, 10SRE Observability: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) p:05Triage→03Medium [13:07:08] 10Traffic, 10SRE, 10SRE Observability: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) [13:07:46] 10Traffic, 10SRE, 10SRE Observability: VarnishTrafficDrop alert false positives due to DCs depooled - https://phabricator.wikimedia.org/T291148 (10ema) >>! In T291148#7358739, @BBlack wrote: > The solution to this in the icinga version of this check was to include an additional term in the prometheus query t... [13:25:50] 10Traffic, 10SRE, 10SRE Observability, 10Patch-For-Review: VarnishTrafficDrop IRC alert does not include DC name anymore - https://phabricator.wikimedia.org/T291149 (10ema) p:05Triage→03Medium [14:32:38] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [14:44:37] 10Traffic, 10SRE, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [14:44:41] 10Traffic, 10SRE, 10Patch-For-Review: Deploy durum: check service for Wikidough - https://phabricator.wikimedia.org/T289536 (10ssingh) 05Open→03Resolved a:03ssingh durum has been deployed and is now running on all our PoPs. Marking this as closed. Thanks to @Dzahn for helping create all the VMs! [15:44:49] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [15:46:40] 10Traffic, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [16:00:57] (VarnishTrafficDrop) firing: 62% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [16:05:57] (VarnishTrafficDrop) resolved: 65% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [16:34:06] ^ this is still the codfw false alarms, still working out a more-robust solution [17:12:25] 10Traffic, 10Analytics, 10Analytics-Kanban: Review use of realloc in varnishkafka - https://phabricator.wikimedia.org/T287561 (10Ottomata) a:03odimitrijevic [17:18:57] (VarnishTrafficDrop) firing: 66% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [17:23:57] (VarnishTrafficDrop) resolved: 65% GET drop in text@ during the past 30 minutes - https://grafana.wikimedia.org/d/000000180/varnish-http-requests?viewPanel=6 - https://alerts.wikimedia.org [18:53:29] 10Traffic, 10DC-Ops, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [18:53:49] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05Open→03In progress [20:52:43] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) [21:04:33] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ganeti4004.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-rei... [21:04:38] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti4004.ulsfo.wmnet'] ` Of which those **FAILED**: ` ['ganeti4004.ulsfo.wmnet'] ` [21:04:59] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ganeti4004.ulsfo.wmnet ` The log can be found in `/var/log/wmf-auto-rei... [21:31:41] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti4004.ulsfo.wmnet'] ` and were **ALL** successful. [21:34:20] 10Traffic, 10DC-Ops, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10RobH) [21:34:40] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10RobH) 05In progress→03Resolved