[08:09:29] looking [08:11:13] arnaudb: I am looking at grafana [08:11:51] checking ATS then [08:12:51] I am getting errors in grafana [08:13:39] effie: that might be because of thanos query being down? Idk if there is a way around for grafana queries [08:15:31] I am not sure, what is certain is that we do not have a visual [08:17:36] graphs are back on grafana-rw [08:21:03] GitLab needs a short maintenance restart in at 9:30 UTC (in one hour) [09:41:32] GitLab maintenance done [09:41:48] 👍 [10:01:39] o/ [10:01:56] as FYI I just merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1224091 to add a new docker distribution backend for ML to the registry [10:02:05] I am rolling it out now, nginx will reload etc.. [10:02:10] but nothing should change [10:02:28] (namely no impact should be noticed by existing clients) [10:02:36] if you see any weird report, lemme know [10:02:37] :) [10:18:52] I'll be disabling Puppet fleet-wide for 5-10 minutes to rollout a Puppet change related to the agent, starting in 5m (unless someone has objections) [10:20:12] thanks for the heads up. Do we have a way to "lock" puppet deploys, like for mw? (not asking to do it, just I don't know) [10:22:22] no, there's no such thing ATM [10:22:47] what's a puppet deploy? if puppet is disabled is a noop :D [10:24:24] if it's for puppet-merge you can just run it and keep it at the prompt holding the lock (I think it needs a commit to show the prompt, but could be easily modified to show that also without a commit with a flag) [10:24:25] Acoording to Wikipedia, a puppet deploy is... [10:25:05] puppet-merge could tell you that your puppet-merge will be currently ineffective as a headsup [10:25:40] don't spend time on this, this was pure bikeshedding by me, I just wanted to know if it existed :-D [10:25:47] but given that the set of people merging changes is much smaller compared to mediawiki we've mostly been fine simply by syncing up on IRC [10:26:07] 100% agree [10:46:43] Puppet is re-enabled again [14:59:39] denisse, tappof, is https://status.wikimedia.org/ a part of the meta-monitoring project discussed on T393625 ? [15:01:17] andrewbogott: no [15:01:22] andrewbogott: no, it's not related to the meta-monitoring [15:01:25] oh dang [15:01:28] ok [15:04:52] for visibility in case people do not pay attention to -operations: [16:03:52] On site at eqiad just noticed alot of orange warning lights in Rack C3. looks like tripped breaker L3-L1 investigating right now [15:11:18] dual power is restored to all devices except kafka-main1008 [15:11:26] Failed PSU