[10:20:03] great work arturo andrewbogott on the cloudgw migration yesterday! [10:39:33] thanks! [10:48:44] FIRING: InstanceDown: Project tools instance tools-k8s-worker-nfs-7 is down [10:50:54] unrelated, there is also CRITICAL slave_sql_lag Replication lag: 75742.21 seconds on clouddb1013 [10:51:04] I'm looking at the clouddb alert, it looks like the lag is going down [11:07:42] data-persistence folks say it was caused by index rebuilds [11:07:48] tools-k8s-worker-nfs-7 is back up [11:11:51] ...and down again :/ [11:25:18] and back up. I'm looking at the logs, and I see some OOM errors in journalctl [11:27:43] those could simply be k8s terminating some pods exceeding their memory limit [11:28:12] I will ignore it for now, unless it goes down again [11:42:31] ack [13:57:11] dhinus: I have nudged dc people about that recurring heat alert again. So many emails! [13:57:41] ooh, there's another one [14:25:36] andrewbogott: thanks for the nudge :) if it keeps flapping, we can silence it for a week or so [15:24:24] ^ I added a 7-day silence for that clouddump temperature alert [15:45:29] thank you! [16:03:40] arturo: you have more or less hijacked T320750 from its original purpose of enabling more features for https://wikitech.wikimedia.org/wiki/Help:Using_OpenTofu_on_Cloud_VPS users ... [16:03:41] T320750: Support managing Cloud VPS project membership via OpenTofu - https://phabricator.wikimedia.org/T320750 [16:19:46] taavi: ok, I see [16:21:58] taavi: feel free to revert, I'm in a meeting, then going to go offline [17:14:26] taavi: I agree, I would de-couple that task from the tofu-infra task [17:22:38] taavi: I removed the parent link and rephrased again the task description [17:32:13] andrewbogott: I found how to modify the routing of wikitech-static alerts https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117575 [17:32:50] but I'm in two minds if it makes sense to change that if we don't know who's going to maintain it :) [17:33:08] if you remove us does it become a global alert of some sort? Or a no-one alert? [17:33:48] oh, I see, SRE has * [17:33:51] yep [17:34:05] it could be easier to ignore though... [17:34:24] So does that mean it's /already/ alerting SREs/ [17:34:28] Or is it a first match thing [17:34:34] first match [17:34:49] so no way to make it appear on both dashboards? [17:35:30] I'm happy to continue to be mostly responsible, I just want it to be visible to other SREs in case I'm not around [17:36:08] I don't think so unfortunately, it's one or the other [17:36:19] it's visible in icinga as well, but I'm not sure many people look there [17:36:43] I guess we can write a runbook mentioning who to escalate to when you're not around :) [17:37:13] there are 5 alerts defined in icinga for that host [17:38:27] yeah, I guess we need to leave it as is then for now. [17:40:50] thank you for looking [17:40:56] I can write the runbook if you know the answer to "who should we escalate to" :P [17:41:40] turns out we have a runbook, but I'm not sure it's up to date: https://wikitech.wikimedia.org/wiki/Wikitech-static [17:42:14] I think the other people who have worked on it are: chris danis, clément, Daniel. And the docs there are pretty accurate. [17:43:00] will they have access to the new AWS environment? [17:43:35] the AWS setup is a ways off, will need different monitoring anyway [17:43:44] Or, at least, some of the monitoring won't apply [17:43:46] ah ok, I think it was initially a 1:1 replacement [17:43:54] *I thought [17:44:02] I started off thinking it would be but got carried away :) [17:44:07] haha fair enough [17:44:28] since it won't run mediawiki there's no need to check sw versions, for one thing [17:46:15] makes sense yep [17:54:56] andrewbogott: I added a note at the top of https://wikitech.wikimedia.org/wiki/Wikitech-static, let me know if you think it's not accurate [17:55:29] yep, seems good. thank you [19:10:16] sort of related to T380095: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1114111 [19:10:17] T380095: Do not create DNS zones for projects outside default domain - https://phabricator.wikimedia.org/T380095 [19:14:50] looks good, thanks taavi