[03:06:44] win 8 [07:01:04] mmm splunk thinks I'm oncall [07:02:24] that's wrong [07:03:11] fabfur: According to the spreadsheet you are along with godog [07:03:23] I set an override for him [07:03:24] But not with jayme [07:03:36] I guess we have, yet, another week with issues on the oncall schedule [07:04:17] I override Jaime on the 14/04 week [07:05:27] The thing is, "Business Hours EMEA" shows it right [07:05:34] so maybe it is only a bot issue? [07:07:22] So, if it was me, I fixed it, but I think now fabfur you cannot receive p*ges [07:07:44] Either the bot or the workflow is wring [07:07:48] *wrong [07:12:54] and godog has the same issue [07:13:59] thanks jynus [07:14:27] well, it is not yet fixed [07:14:55] as in, I think it is wrong but the bot likes it [07:15:44] What is the difference between SRE Business Hours EMEA and SRE Business Hours (EMEA)? [07:18:32] I would be tempted to remove it and recreate it, but it doesn't let me create an override for today. So not touching it more [08:01:14] <_joe_> are the oncall issues fixed? [08:01:28] <_joe_> go.dog is off btw [08:02:11] I'm not oncall anymore, according to splunk [08:02:20] and that's correct [08:02:27] seems all fine now [08:02:39] I'm still oncall - which is also correct :D [08:02:47] <_joe_> ok [08:08:28] jynus: you typically have to set both if you're overriding someone's on-call (the docs do at least try to say this); I think your override for fab.fur is correct (mod the clinic duty override which I think is unnecessary) [08:09:04] The problem is- my override only has 3 inputs, not 4 [08:09:46] yes that's weird [08:10:12] See, something was weird- I did it right but something else is weird [08:10:13] <_joe_> indeed [08:10:23] it is not like my first override [08:10:29] makes me wonder if fab.fur is currently not on the batphone list [08:11:34] indeed that is my suspicion, as I said before [08:11:52] yeah, looking at batphone, fab.fur doesn't appear there, which would explain why there's no option to override for them for that [08:12:00] :-/ [08:12:15] I wouldn't mind a test message jayme, just to be 100% things are ok [08:12:24] if you are ok with that [08:12:29] sure [08:15:07] I am going to start a second mysql process in db1204 [08:15:24] that will be an intentional alert- please ignore it [08:26:36] So I got the alert, which is enough for me, but are we not printing it to -operations anymore? [08:27:31] jynus: Probably not related, but wikibugs has been down for hours [08:27:51] let me see which bot used to do that [08:29:07] in theory it is jinxer-wm, which was working [08:34:42] I don't think there is anyone that handle alertmanager awake atm, so be aware of this issue [08:41:35] we could try restarting the alertmanager-irc-relay service maybe? In its logs I don't see much, just some self-ratelimiting up to 6:41 this morning [08:42:55] It is weird, because other outputs, it seems to be producing them, just missed the victorops alerts [08:43:32] volans: if it is a simple restart command, please go ahead, but I suspect it may be something more subtle [08:44:30] do you have irc alerts since your test in any channel? [08:44:50] I've never restarted it and I don't see anything in wikitech to suggest "how to do it", but it's a systemd unit :) [08:45:49] last one (for not p*ging) was for me at 10:01:58 CEST [08:46:07] 45 minutes ago [08:46:59] let me do another test, a regular alert [08:47:35] ack [08:47:36] actually, there it is (it wasn't me) [08:48:02] jinxer-wm working normally, at least for regular alertmanager output [08:48:02] ok then the irc relay works [08:48:31] I'm curious why has so many logs for flooding if there were no alerts.. but not for now :) [08:48:36] it was just the victorops which is why I think I need someone from obs for further debugging [08:49:16] but emails and my notifications worked, so no worries for now [08:50:02] I will ping andrea later [08:50:35] ack, thx [08:51:31] volans: just for my knowledge, which host did you check? [08:51:54] the one about the floods? [08:52:33] alert1002: journalctl -f -u alertmanager-irc-relay [08:52:43] thanks [08:56:18] jynus: volans I'm going to take a look [08:58:14] oh, thank you! [09:05:24] jynus: jayme volans I'm going to push a test p.age via amtool to check if the whole pipeline is working as expected. Then I'll review the result to confirm everything's functioning properly. Does it sound good to you? [09:06:05] ok [09:06:39] sure [09:06:56] for the record I did somethig similar, but through icinga (which is what didn't work) [09:07:23] (the irc, the rest did) [09:08:45] (no print either on start or resolution) [09:10:42] Ah, ok ok, jynus ... sorry, I missed that message in the IRC log. So just to confirm, you got the alert on VictorOps, but nothing showed up on IRC? [09:11:24] yes, both the app and mail received it [09:11:35] no print by jinx [09:13:31] ok, I'll check ... [09:23:58] jynus: jayme volans Looks like ircecho was stuck, I’ve just restarted the unit.. [09:25:15] if you have later a summary of how to noticed that, I would be happy to hear it, so the next time I can do it myself [09:25:21] *you [09:47:42] jynus: The last message from icinga-wm was on 2025-05-15 at 13:54:57. I remembered a similar situation happening back in March, where ircecho was stuck, and I found the same error in the logs as described in https://phabricator.wikimedia.org/T389937 [09:48:16] Impact of thumbnail steps of cdns, thumbor and swift https://phabricator.wikimedia.org/T360589 [09:48:27] https://phabricator.wikimedia.org/T360589#10832485 [09:48:34] (correct link, I need coffee) [09:48:55] jynus: If you’d like, we can redo the test you ran this morning [09:54:55] Amir1: nice! [09:59:02] <_joe_> Amir1: I have some questions about your methodology, but also [09:59:31] <_joe_> the hit-local graph, the only one where there's an actual difference (assuming the grey areas are the confidence intervals) [09:59:41] <_joe_> I see "data" is lower than "regression" [09:59:52] <_joe_> meaning we have a worse hit-local ratio than expected? [10:00:35] <_joe_> I don't know if the units on the y axis are bananas per second squared or something else though, given there's no UNITS [10:01:01] <_joe_> sorry, you showed a graph to a recovering astrophysicist, you should know better :D [10:01:01] _joe_: the unit of the y axis is hit values not percentage [10:01:06] hit per day [10:01:09] <_joe_> ok still [10:01:19] <_joe_> it means less hit per day now than expected? [10:01:29] <_joe_> that would not be good, right? [10:01:56] <_joe_> sorry not trying to take you down, trying to understand what I'm missing :) [10:02:21] it depends on the context: if miss and hit local are both visibly going down, it means hit-front is going up [10:02:51] <_joe_> yeah but using the same analysis you don't see it [10:03:10] the problem is that hit-front is 70% of all requests, so if you add them, it gets lost in the noise [10:03:19] <_joe_> tbh, I don't think using cache-text trends to infer cache-upload trends is going to give you significant results anyways [10:03:23] <_joe_> and yes, that :) [10:03:55] the regerssion gives some decent results :D [10:03:59] <_joe_> what I mean is - do you see any variation in the relative percentages of hit-front, hit-local and miss ? [10:04:07] <_joe_> yeah well :P [10:04:28] <_joe_> like, how does the data look like before massaging :) [10:06:16] let me show you [10:06:21] tappof: thanks, should I do it again? [10:06:45] https://grafana.wikimedia.org/d/000000500/varnish-caching?orgId=1&refresh=1d&var-cluster=cache_upload&var-site=$__all&var-status=1&var-status=2&var-status=3&var-status=4&var-status=5&from=now-90d&to=now&timezone=utc&viewPanel=panel-8 [10:06:48] click on backend [10:13:57] <_joe_> if it ever loads [10:13:58] <_joe_> :D [10:14:15] jynus: I still don’t have proof that was the only thing broken, since no alerts have been triggered by Icinga after I restarted the service. So if you’d like to run the test again, that’s totally fine with me, otherwise, I’ll keep an eye on the Icinga dashboard and wait for the first successful notification [10:16:17] ok, doing [10:16:27] jynus: ack, thank you [10:17:01] I am going to start a second mysql process in db1204; that will be an intentional alert- please ignore it [10:22:17] tappof: I think the only thing left would be to file a ticket to improve reliability- either an alert or a detection and a restart [10:22:26] but that doesn't have to be now [10:23:01] I will comment on T389937 [10:23:01] T389937: ircecho (icinga-wm) was stuck on alert1002 - https://phabricator.wikimedia.org/T389937 [10:23:31] yeah jynus the task is still open [13:28:04] jayme Thanks for bringing up T394640 . I've just downtimed all of EQIAD cirrus/elastic hosts since they are not pooled. Sorry for the noise and please let me know if you're still seeing alerts [13:28:05] T394640: Reduce noise from Elasticsearch / OpenSearch alerts to make triaging easier for on-call - https://phabricator.wikimedia.org/T394640 [13:28:30] oops, that was for jynus [13:28:37] * jayme nods :) [13:31:55] arturo: [13:32:19] eh fat fingers :D if you come across my patches (2, in mw-cron) in your puppet merge, feel free to merge [13:36:35] arturo: can you finish up your puppet-merge? [13:45:54] claime: sorry, doing now [13:46:33] claime: done now [13:46:39] cc Raine [13:46:42] ty [13:47:58] Raine: merging your patches [13:48:17] ty [13:48:47] Raine: you can run puppet on deploy and do your helmfile apply [13:48:51] it's done merging [13:49:03] ok, thanks <3 [13:57:14] Is there a way to tell at a glance which hosts are failing in a specific pybal pool? [13:57:19] Failing health checks, that is [14:02:10] inflatador: one possible way, looking at the backend down alert: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DPybalBackendDown [14:02:38] if you remove the other filters you might see the silenced ones too [14:02:42] volans ACK, thanks for the advice [14:14:58] Created T394676 to kick this around when I have time, LMK if you are interested in working together on this [14:14:58] T394676: Create tool that displays real-time load balancer health status per pool/node - https://phabricator.wikimedia.org/T394676 [14:44:06] re ^^, looks like there is a PyBal dashboard that captures this info: https://grafana.wikimedia.org/goto/5trVHs-HR?orgId=1 [16:00:33] dhinus we are receiving alerts about puppet `PuppetZeroResources` on cloudvirtXXXX hosts on -traffic [16:00:52] don't know if you're the right person to ask about, but is there something we can do for this? [16:01:03] (looks like puppet is really stuck on these hosts) [16:03:59] fabfur: not at my laptop right now, can you try pinging in #wikimedia-cloud-admin ? [16:04:12] sure! [16:04:43] somehow they're coming as team=traffic on alertmanager? [16:05:08] yep [16:05:44] ah [16:05:51] they're installed as role(insetup_noferm) [16:05:56] which has profile::contacts::role_contacts: ['Traffic'] [16:07:27] and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1147772 moved them to a role which fails to compile [16:07:29] andrewbogott: ^ [16:27:29] :) [16:28:50] taavi: thanks!