[00:25:28] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gitlab1001.wikimedia.org` - gitlab1001.wikimedia.org (**PA... [00:30:28] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) [01:34:02] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) [02:37:59] 10serviceops, 10SRE, 10Shellbox, 10Sustainability (Incident Followup): Limit Lilypond shellouts from VisualEditor - https://phabricator.wikimedia.org/T312319 (10RLazarus) [03:20:20] (ProbeDown) firing: (6) Service gitlab1003:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:04:34] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Wikimedia-Hackathon-2022: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) \o/ https://noc.wikimedia.org/wiki.php?wiki=dewiki#A [07:20:20] (ProbeDown) firing: (6) Service gitlab1003:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:45:11] The url in the alert above for gitlab1003 doesn't make it possible for me to see what the issue is [07:45:44] the first panel says 0% unavailable probes, the 2nd panel does not contain gitlab1003 at all [07:47:13] oh wait, it's firing right on X:20:20 mark? I see 23:20:20, 03:20:20, 07:20:20 UTC in my IRC logs [07:48:27] akosiaris: logstash reports issues though [07:48:48] hmmm it's also marked as 1d ago (great ganularity!) in alertmanager [07:49:03] vgutierrez: ah interesting, got a link handy? [07:49:42] https://logstash.wikimedia.org/goto/86a69daafa2c849b0cbc663b6240752b [07:52:13] akosiaris: BTW, a quick test from bast6001 shows that port 443 on gitlab1003 isn't happy [07:52:20] blackbox checks for gitlab (and otrs) are quite new. At least for gitlab I think they are not configured properly and I'm about to upload a change to fix that hopefully. [07:53:16] And thanks for the link. It seems prometheus is using the wrong IP to check gitlab [07:54:08] yep... [07:54:17] it's using the main iface of the VM [07:54:37] and apparently is accepting traffic on 208.80.154.15 [07:54:52] that's gitlab-replica.wm.o [07:54:57] correct. And it's using the wrong "host" for the replica. I'll try to put that in a change and get a review [07:58:18] cool, thanks for investigating this more [08:50:05] (ProbeDown) firing: (6) Service gitlab1003:443 has failed probes (http_gitlab_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:03] ^ I merged a fix for that alert. let's wait some more time for puppet runs on prometheus hosts. In logstash I can see a "Probe succeeded" and there is some change in the grafana dashboard [09:16:11] FYI, I've disabled DRBD for kubetcd2004 again [09:17:43] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10fgiunchedi) >>! In T312194#8056198, @Dzahn wrote: >> description: gitlab1004:443 failed > > We configured the checks to test gitlab.wikimedia.org, not gitlab1004:443. I have clarified a bit the wording at https://wiki... [09:27:06] Blackbox check for GitLab looks fine for me now. No more errors and unavailable probes for GitLab. I'm not sure if there is a resolve message for probe down. [09:27:48] I'm going to test the blackbox probe by disabling gitlab on the replica (gitlab1003) for a short time [09:32:14] (ProbeDown) firing: (2) Service gitlab1003:443 has failed probes (http_gitlab_replica_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:33:44] Alert is visible in Prometheus alerts dashboard, email and irc message present. I'm going to start gitlab on the replica again. I'm a bit confused about the hashtag. But I guess that comes from the alert and not the severity [09:36:02] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Detect and alert on helm releases in unclean state - https://phabricator.wikimedia.org/T310714 (10JMeybohm) 05Open→03Resolved [09:37:05] jelto: https://github.com/wikimedia/puppet/blob/production/modules/prometheus/manifests/blackbox/check/http.pp#L159 [09:37:14] (ProbeDown) resolved: (2) Service gitlab1003:443 has failed probes (http_gitlab_replica_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:28] that probably needs a guard for if $serverity = page is [09:37:33] godog: ^ [09:38:19] ah yes indeed, good call [09:38:23] RhinosF1: Thanks for the hint! Yes hardcoding the hashtag may not be the best option. [09:38:46] godog: I would send a patch but I don't know what would be. Is it just 'critical' or? [09:39:00] RhinosF1: thank you! literally severity=page [09:39:11] godog: incoming then [09:39:19] cheers! brb [09:43:30] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811891 [09:44:37] RhinosF1: that doesn't seem to actually update the summary annotation of the alert? [09:45:06] also, it'd be nice if it didn't duplicate the rest of the alert string [09:45:07] taavi: yeah hang on [09:47:55] updated [09:51:03] aye, thanks RhinosF1 [09:51:29] np godog [09:52:14] looking at jenkins [09:53:59] fixed [09:55:52] Hello folks, I'd need to roll restart eventgate main's pods (following https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate/Administration#Roll_restart_all_pods) to pick up a new event stream [09:56:13] is it ok to proceed or is there a different/other procedure? [10:11:47] jelto: I haven't touched the line it moans about [10:12:23] modules/prometheus/manifests/blackbox/check/http.pp:166 WARNING variable not enclosed in {} (variables_not_enclosed) [10:12:54] RhinosF1: apologies in advance, there will be likely a merge conflict with https://gerrit.wikimedia.org/r/c/operations/puppet/+/811715 too btw [10:13:03] I'm about to merge that change [10:13:15] godog: ok, i can fix [10:28:15] <_joe_> elukey: lemme check [10:28:33] <_joe_> elukey: that's the correct procedure [10:32:35] godog: thanks for the rebase but no idea still why jenkins -1 [10:32:59] _joe_ can it be done anytime? (if so I'll do it this afternoon) [10:33:26] <_joe_> elukey: only with a full moon, if you don't have the Special Permit. [10:33:40] <_joe_> yeah, can be done at any time, like you'd do for restarting php [10:33:41] <_joe_> :) [10:33:55] super thanks [10:34:28] <_joe_> the restart, if the chart has good readiness probes, should be zero-impact for users. [11:35:24] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10phaultfinder) [12:32:06] 10serviceops, 10Analytics, 10Data-Engineering, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Yes. [12:34:29] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice, 10Wikimedia-Hackathon-2022: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) >>! In T308932#8058396, @Ladsgroup wrote: > \o/ http... [12:43:40] RhinosF1: I think it's a bit tricky to mix prometheus templating and puppet variables. We could either have some duplicate code/strings or try to escape the prometheus templating. I tried to use {{ \$labels.instance }} instead of {{ $labels.instance }} locally and puppet linting is happy. But I'm not sure if this generates valid prometheus alert config [12:45:46] jelto: *nod* and unfortunately impossible to test ATM with puppet compiler because exported resources [12:46:17] jelto: if you'd like to send a new PS with your fixes I'll give it a spin in o11y pontoon stack [12:46:31] otherwise we can # lint:ignore [12:50:20] (ProbeDown) firing: (2) Service otrs1001:443 has failed probes (http_ticket_wikimedia_org_ip4) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:25] Looks like there's no corresponding failure on the Grafana dashboard, I wonder why this alert is firing [12:55:30] godog: I uploaded a change which is escaping the prometheus variables: https://gerrit.wikimedia.org/r/c/operations/puppet/+/811984/1 Is this what you need to test this in the pontoon stack? [12:57:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/811985 fixes the actual vrts probe itself [12:57:33] jelto: I meant a new PS for https://gerrit.wikimedia.org/r/c/operations/puppet/+/811891 but a new review works too [12:58:02] sobanski: the https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1 link you mean? shows otrs at 0% availability [12:58:15] it could be a re-notification too, not necessarily a new alert firing [12:59:38] I can confirm the blackbox check for otrs is at 0% (the blue line for me). But only because it's probing port 443, which is not open on the machine. We should either probe 80, 1443 (envoy), or the service behind lvs(?) [13:01:19] godog: I thought the percentage was the percentage of failed probes [13:01:35] As in "0% of unavailable probes" [13:02:26] sobanski: ah, yeah I see what you mean, I'll change the panel title [13:02:39] jelto: taavi posted a patch to do just that (check 1443) a few lines above [13:02:47] oh, I had the same impression too! [13:02:49] it's the inverse! [13:04:13] {{done}} let me know if this reads better! [13:05:35] godog: less missleading now :) [13:05:50] sobanski: I saw that change, about to add a review there [13:06:43] jelto: testing the change now [13:07:00] the escaping one, not vtrs, to be clear [13:07:10] godog: thanks a lot! [13:08:48] Panel title is much clearer now :) [13:08:56] 👍 [13:10:34] neat! thanks for the feedback [13:13:19] I just got back [13:17:58] jelto RhinosF1 change with escaping LGTM https://phabricator.wikimedia.org/P30947 [13:18:50] godog: do you want me to add escaping to mine [13:19:10] RhinosF1: *nod* probably easier [13:19:30] RhinosF1: feel free to do so, I can abandon my test change [13:19:32] One sec [13:20:34] jelto, godog: done [13:20:44] Thanks for the suggestion too jelto [13:30:38] godog: Jenkins finally likes me [13:31:33] thanks for testing the escaping, jenkins is happy now :) [13:31:47] neat, I'll merge [14:03:01] hmmm did I miss any work on conf1004? [14:03:34] <_joe_> vgutierrez: what's wrong? [14:03:47] <_joe_> I think akosiaris is doing something with etcd [14:03:52] <_joe_> but shouldn't touch conf1004 [14:03:59] lvs1019 got some 400 from conf1004 [14:04:02] me me [14:04:10] and it's currently alerting [14:04:24] akosiaris: let me know when you're finished please [14:04:28] will do [14:04:33] thx <3 [14:04:53] <_joe_> ahh the tls cert? [14:06:40] yes [14:08:11] <_joe_> yeah pybal is picky with etcd restarts [14:26:59] 10serviceops, 10Dumps-Generation, 10Infrastructure-Foundations, 10SRE-tools, 10IPv6: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) [resuming this task, let me know if you instead prefer a separate one] Some clusters managed by the S... [15:40:30] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10phaultfinder) [16:50:20] (ProbeDown) firing: (2) Service otrs1001:443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:29] 10serviceops, 10serviceops-collab, 10GitLab (CI & Job Runners): DNS/networking not working on Trusted Runners - https://phabricator.wikimedia.org/T311241 (10dduvall) Thanks for explaining that, @Jelto ! I was pulling my hair out the other day trying to troubleshoot. Would it be possible to move that script... [16:59:58] 10serviceops, 10serviceops-collab, 10GitLab (CI & Job Runners): DNS/networking not working on Trusted Runners - https://phabricator.wikimedia.org/T311241 (10dduvall) >>! In T311241#8062475, @dduvall wrote: > (The only outlier at that point would be the default docker image which doesn't seem re-configurable.... [19:19:22] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice, 10Wikimedia-Hackathon-2022: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Quiddity) Re: Tech News - What wording would you suggest as the... [19:23:28] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice, 10Wikimedia-Hackathon-2022: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Ladsgroup) >>! In T308932#8063144, @Quiddity wrote: > Re: Tech... [20:10:35] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) [20:11:32] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) old VMs completely gone now. all decom boxes checked. [20:11:49] 10serviceops, 10Continuous-Integration-Infrastructure, 10SRE, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for h... [20:13:17] 10serviceops, 10Continuous-Integration-Infrastructure, 10SRE, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) :) yw doc1001.eqiad.wmnet has now been destroyed (via decom cookbook). [20:17:31] 10serviceops, 10Continuous-Integration-Infrastructure, 10SRE, 10serviceops-collab, and 2 others: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 (10Dzahn) 05In progress→03Resolved the original ticket is resolved. doc1001 is gone a... [20:20:05] (ProbeDown) firing: (4) Service otrs1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:20:09] 10serviceops, 10serviceops-collab: ProbeDown - https://phabricator.wikimedia.org/T312194 (10phaultfinder) [20:28:38] 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10User-notice, 10Wikimedia-Hackathon-2022: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 (10Quiddity) Ok, added like so https://meta.wikimedia.org/wiki/Tec... [21:23:51] 10serviceops, 10Infrastructure-Foundations, 10SRE-tools, 10Wikimedia-Mailing-lists: Support services VIPs with not marked as VIP in Netbox - https://phabricator.wikimedia.org/T295793 (10Dzahn) >>! In T295793#8050412, @Jelto wrote: > `gitlab1001` and `gitlab2001` will be decommissioned soon in T307142. So r... [21:29:58] 10serviceops, 10serviceops-collab: monitoring / VRTS - new blackbox check reports 'ProbeDown' - https://phabricator.wikimedia.org/T312194 (10Dzahn) [21:54:11] 10serviceops, 10Patch-For-Review: Put conf100[789] in production - https://phabricator.wikimedia.org/T311407 (10akosiaris) Took a while but: ` etcdctl --endpoints https://conf1004.eqiad.wmnet:2379 cluster-health member 4cdd4cdde64b18d3 is healthy: got healthy result from https://conf1004.eqiad.wmnet:4001 memb...