[15:08:11] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10Papaul) [15:08:56] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: (Need By: TBD) setup/config PDU in drmrs ( ps1-b12 and ps1-b13) - https://phabricator.wikimedia.org/T294597 (10Papaul) 05Open→03Resolved Complete [18:50:56] (VarnishPrometheusExporterDown) firing: (2) Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [18:55:56] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [18:57:10] sukhe: 4036 wasn't one that alarmed in -ops [18:57:16] yeah [18:57:52] It just went off there [19:15:43] RhinosF1: prometheus exporter down = no more data / graphs about varnish but not the same as varnish itself down. still matters though of course [19:15:51] yeah [19:17:25] mutante: some were giving alerts in ops about pooled status [19:17:33] But yes graphs less important [19:18:44] Icinga is adding new checks that are switching from PENDING to actually alerting [19:18:49] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=1 [19:19:01] Makes sense [19:19:02] happens when the role is applied to new hosts [19:19:14] you can downtime them but only once they exist [19:19:43] Ah [19:20:03] Host wide wouldn't copy across to new? [19:20:55] they are adding mw4033 through mw4036 in https://phabricator.wikimedia.org/T290694 [19:21:04] at some point the puppet role gets applied [19:21:12] s/mw/cp [19:21:19] eh, yea, cp [19:21:21] But ye [19:21:32] you can only downtime hosts in icinga once they have been created [19:21:36] so after the first puppet run [19:21:59] I assume puppet on alert* is running before they're setup [19:22:02] Fully [19:22:02] to avoid this kind of thing you'd have to sit there and watch the run and downtime them individually [19:22:43] Icinga picked up that they exist, but the hosts not setup yet? [19:22:46] when puppet runs the first time on the new host that will create puppet resources [19:22:53] Ah! [19:23:04] and then when it runs the next time on alert* that will see the exported resources [19:23:17] and create the new Icinga config snippet [19:24:27] you can't tell icinga "hey, don't alert for these host regexes I will create in a minute". it will say I have not heard of this host yet.. so yea. that's why this is kind of common problem with Icinga in puppet and new hosts [19:28:04] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10Dzahn) Icinga alerts that were added by puppet started firing and raised some questions but confirmed it was just about these new hosts and they just switched... [19:29:56] https://wikitech.wikimedia.org/wiki/Icinga#Avoid_Icinga_spam_on_new_server_installs [19:30:51] ^ that's a hassle but yea [19:34:32] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10BBlack) Yeah sorry for the noise - we weren't anticipating the hosts re-puppeting themselves into the productions roles (incorrectly!) and should've just puppe... [19:35:16] mutante: the main problem is we decided to take a break between doing the puppet patch for them and doing the actual reimage, and I told marc he could go ahead and merge the patch and we'll reimage them tomorrow. [19:35:50] but I failed to think about the fact that we should puppet-disable them in the meantime, etc (so they don't half-configure themselves like they are now, because they're already running in the setup role) [19:37:02] bblack: gotcha:) yea, no problem. I know it's a balance somewhere also between how much effort we spend on avoiding alerts vs just saying "that's us" when they happen and you did warn us in the meeting about drmrs, yet still made people nervous when cp* hosts alerted in ulsfo [19:37:36] but then we could see they are matching the ones in the ticket [19:39:38] it's kind of tricky to actually avoid the alerts [19:40:21] basically just works in that small time window where icinga checks exist in the PENDING state [21:31:57] 10Traffic, 10Wikimedia Enterprise: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10RBrounley_WMF) [23:01:09] 10Traffic, 10SRE, 10serviceops: Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Legoktm)