[00:00:53] ok, detaching and reattaching and now lsblk shows the right size... [00:00:58] now have to resize the filesystem... [00:02:54] ok, resized, let's see if postgres can start now... [00:04:41] bd808: can you log in now? [00:06:35] andrewbogott: I'm in! Let me see if I can build an image now too [00:07:32] andrewbogott: I think we are back in business! [00:10:25] great! I'm extremely late to family dinner, do you mind updating https://phabricator.wikimedia.org/T354714 and closing it out? [00:10:39] not at all. have a good dinner andrewbogott [00:10:41] thx [00:10:41] https://phabricator.wikimedia.org/T354714 [00:10:48] * andrewbogott rushes out the door [01:31:07] * bd808 off [09:44:31] I filed some as subtasks of T354714 [09:44:32] T354714: [harbor,trove] Trove DB filled disk and caused toolforge-build to fail as a result - https://phabricator.wikimedia.org/T354714 [09:44:40] some follow-ups* [09:46:01] I think that a clear one is using `harbor_up` to detect if all the components are up in an alert (currently we only use pingthing for the UI, but the UI was up) [09:46:30] I'll create a few tasks too [09:48:46] I was thinking on adding statistics pushing from the clis themselves, it's something I have done in the past in hosted environments and it helped enormously to debug/detect issues [09:49:17] but it means that the client has to push stats somewhere (so when we start allowing clis on users laptops, we might have to add some opt-out dialog or similar) [11:33:36] the "next up" column on https://phabricator.wikimedia.org/project/view/6918/ is getting quite big - are folks realistically planning to work on all of them during this iteration? [11:36:43] maybe we could move the unassigned ones back to the backlog? unless there are some unassigned tasks that should be done asap [11:37:45] yes there are, I normally don't assign tasks to me when they are in the `next up` only when I move them to `in progress` [11:37:52] (so anyone can pick them up if they want) [11:38:41] I agree that should be the rule (not to assign tasks in the backlog), I was suggesting it just for this specific instance to move back those as it seems we have enough workload for everyone (maybe) [11:39:30] that's ok, there's a few there that I will work on if nobody takes, so we can't just remove all of the unassigned ones [11:39:39] I can assign to me if that helps [11:40:04] I'm fine with them remaining unassigned, but 24 tasks in the backlog seem a lot, plus the ones already in progress [11:40:25] or just remove some of the less-priority ones [11:40:31] I was just trying to find an easy "filter" to reduce that number, but maybe the filter should be the priority and not if they're assigned or not [11:41:51] * dhinus wonders if we could experiment with priority triaging in the next toolforge checkin [11:42:15] we can bring it back yes [11:43:12] imo part of the issue is that people add lots of things to the next on column, and then they get carried over iteration over iteration. so maybe an option is that when starting a new iteration unassigned things go back to the backlog, and people can re-add them if they're planning to work on them [11:45:20] yep, we were doing that before too [11:45:33] until we were re-adding everything we removed xd [11:45:56] I just removed a few tasks that had nobody assigned and seemed less prioritary [11:46:17] LOL, maybe we can simply go through the backlog during the checkin meeting, and make sure there are not more than X tasks, discussing together which ones should stay and which ones can be postponed [11:47:20] sounds ok, feel free to propose it in the next meeting [11:48:32] we're also just out of the holidays, I guess things will become smoother if we go back to frequent checkins every 2 weeks [11:48:37] we also did some ticket grooming sessions for the backlog itself at some point, we can start doing that again (we ended up relying on me + karen doing it, but now it's just me and I've been absent/busy + now it's the whole toolforge) [11:49:15] that might be something better for a dedicated session though, it can get tedious and long, so timeboxed + <1h + periodical seems appropriate [11:52:24] some food for thought: https://www.atlassian.com/agile/scrum/backlog-refinement [11:52:34] btw. note that the 'next up' should try to have more tasks than we are going to finish up, so it's ok if there's more there than we can do in an iteration. I agree though that there has to be a limit or it becomes useless (compared to just having the long backlog) [11:53:52] dcaro: agreed, it's fine to have some extra tasks in the "next up", if they're not too many (hard to define how many is too many :P) [11:53:53] dhinus: yep, we were doing our own version of that (except we don't have sprint plans, scrum master or product owner xd) [11:54:25] dcaro: LOL I was also thinking having more people with time to spend on that process would make a big difference :D [11:56:40] I think that it was useful, specially when we started the build service workgroup + beta, we fleshed out a bunch of tasks + user stories and got it off the ground, was that your impression too? [11:58:02] yep. "sprint planning" is not too far from what we're doing in the meeting at the start of the iteration, and maybe we could have an additional "backlog refinement" meeting (every month?) that could be optional and not require everyone to attend. [11:58:10] * dhinus hates proposing to add a meeting :/ [11:58:15] hahahaha [11:58:17] been there xd [11:58:42] I wonder if we could do some backlog refinement in the "toolforge workgroup meeting" open to the community [11:58:51] though it might conflate too many things [11:59:22] I was thinking the same, though seems a bit off-topic for others, I think doing the grooming soon after discussing priorities/issues makes a lot of sense [11:59:38] so if not in the same (that I'm not very convinced) maybe soon after?V [11:59:50] or do 30min/30min [12:00:04] yeah or 45+30, something along those lines [12:01:07] taavi raises a simple question -> dhinus and dcaro discuss for 1 hour on process optimization :D [12:01:18] let's propose it next tuesday, that we have a monthly [12:01:21] xd [13:14:44] OK, I have one more prometheus question... I see the metric working https://grafana.wikimedia.org/d/ebJoA6VWz/wmcs-openstack-eqiad-nova-fullstack?orgId=1&viewPanel=33 but my alert isn't firing https://gerrit.wikimedia.org/r/c/operations/alerts/+/989255/1/team-wmcs/designate.yaml [13:14:53] what did I mess up with that alert definition? [13:18:10] andrewbogott: data loaded via node-exporter ends up in the 'ops' prometheus instance and not in 'cloud', so `# deploy-tag: cloud` needs to be `# deploy-tag: ops` [13:18:38] that metric is in ops no? [13:18:43] yep, what taavi said :) [13:18:58] Same fix from both of you, that's a good sign! [13:19:27] you can check by using the datasource 'eqiad/labs` -> this means wmcs, or 'eqiad/ops' -> this means ops, when graphing in grafana [13:20:10] I'm a little surprised that things starting with # do things [13:20:10] note that the alert will not fire if there's no data at all (as you are seeing) [13:20:30] yep, it's a wikimedia extra in the comments xd [13:21:12] dcaro: there's an alert for that :) https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem&q=name%3DCloudVPSDesignateLeaks [13:21:20] In this case the data source is thanos, is that always 'ops'? [13:21:25] no [13:21:39] thanos is a frontend that queries all the different prometheus instances around here [13:22:01] if you look at the metric from thanos it has prometheus="ops" as a label [13:22:06] ok, I'm trying to understand dcaro's comment about datasource then... [13:22:18] you can use `cloudvps_designateleaks OR on() vector(-1) != 0` to get '-1' when there's no metrics [13:22:22] oh, I see it now [13:22:53] dcaro: that's unnecessary here, pint has alerted already on that :-P [13:23:01] where is it? [13:23:06] I linked that above [13:23:12] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DAlertLintProblem&q=name%3DCloudVPSDesignateLeaks [13:23:26] it's flagging it team=sre :/ [13:23:40] NovafullstackSustainedLeakedVMs is there too [13:23:49] yeah, that linter alert is very useful but I wouldn't have found it without hunting [13:24:45] hmm.... I like the lint rule, but we should find a way to flag with the right team so it shows in our dashboard, otherwise we might want continue adding the exrpession [13:24:58] (until we sort the team flag or anything similar out) [13:25:12] yeah, maybe file a task to see if filippo has any ideas? [13:25:19] yep, sounds good [13:25:28] I think they've been triaging those and filing tasks about individual alerts, I remember one about NovafullstackSustainedLeakedVMs [13:27:35] I'll open a task, that would simplify a bunch of expressions we have around xd [13:27:51] (it was tricky to find a working one btw.) [13:31:49] several nova-fullstack alerts are showing up with that lint alert, will be good practice for me to sort that out [13:32:06] T354762 [13:32:07] T354762: [pint,karma] Find a way to forward AlertLintProblem to the right team (ex. using the team=wmcs label) - https://phabricator.wikimedia.org/T354762 [13:33:23] btw. the terraform alert has been failing for very long, is that something someone is looking into? (or has looked into) [13:33:50] andrewbogott: nova-fullstack is T351698, feel free to take that over [13:33:51] T351698: Linting problems found for NovafullstackSustainedFailures - https://phabricator.wikimedia.org/T351698 [13:33:59] thx [13:34:13] dcaro: https://gerrit.wikimedia.org/r/c/operations/puppet/+/989496/ [13:34:39] taavi: nice [13:34:52] running PCC just now [13:35:27] taavi: does it require any config/software installed on the host? (iirc pint is a separated binary) [13:35:42] prometheus::pint::source sneakily pulls it in [13:35:59] ack, +1 from me when you finish testing [13:44:02] ok, that change removed the linter alert, now I'm in suspense waiting to see the actual alert fire... [13:52:35] "Detected 5 stray dns records on " https://alerts.wikimedia.org/?q=alertname%3DCloudVPSDesignateLeaks&q=%40receiver%21%3Ddefault :) [13:54:07] yeah! [13:54:12] Now I guess I should fix them [14:00:19] \o/ [14:00:45] with a nice runbook and everything 🎉 [14:55:44] dcaro: re: the terraform alert, I looked into it before the holidays, and got a couple successful runs, but it still fails 50% or more of the times, usually it's the postgres trove creation. one step forward would be to create two separate alerts for postgres trove, and for everything else. [14:56:23] now I see the "destroy" alert is triggering, rather than the "create" alert, so it might be a different problem [14:56:27] * dhinus checks [15:00:40] hmm the magnum cluster failed to destroy, I think magnum and trove-postgres are the slightly less stable things [15:54:40] We think magnum is fragile just because it exercises every single api in quick succession. I don't know why postgres is more fragile than other trove things though [15:59:31] I did re-run the tf tests, and the "destroy" alert has gone, but the "apply" alert has appeared :D [16:00:54] and this time, it was the web-proxy that failed to create [16:10:55] taavi: myself and dhinus lined up a meeting for 10:00 UTC friday to discuss hiring panel for arturo replacement [16:11:25] I forwarded invite to you as we both agreed you might have good input. If the time doesn't suit let me know [16:53:28] topranks: thanks, that's fine [17:45:32] * dcaro off [17:45:36] cya tomorrow [19:29:15] * bd808 lunch