[09:00:13] question about command line UI: would you see ok to annoy the user with a (e.g.) --execute, and do --dry-run by default if the operation is dangerous and irreversible (deleting backups)? [09:00:35] jynus: +1 for dry-run by default [09:00:46] basically, what I mean is dry-run by default, or would it actually be conterproducing? [09:01:24] I am checking for examples and percona-toolkit does it, but I haven't seen many tools that do that [09:01:41] as long as the output is clear that it's a dry-run and people don't get confused thinking they actually did the operation [09:01:53] yeah, that is another fear [09:02:00] I'd vote for consistency with our other tools [09:02:32] XioNoX: which is the normal behaviour- execute by default? [09:05:24] so my intention here is for the user to be aware of the dangerous and irreversible operation [09:05:36] not sure how to achieve that [09:06:50] jynus: warning prompt with a yes/no [09:06:52] jynus: I would prefer still to go for a dry-run or a double confirmation after the initial run [09:06:53] and that it shouldn't be run casually, but also make it simple to avoid mistakes, given how critical is the operation [09:07:09] Yeah, what XioNoX said: "are you sure you want to DELETE this backup?" [09:07:11] Or something like that [09:54:53] I think I found some ways to minimize user error: [09:55:47] first, have a separate script for querying data (which may have to be used as input for the deletion one) [09:56:54] second, abort deletion of backups if the file is available on production- I don't see a use case for an "easy" deletion if the file is still available publicy on the wikis (only for deleted or non-existent files) [09:58:33] so if you ask to delete "Commons-logo.svg" if fails [09:58:51] smart [11:18:51] good afternoon, may I get a `puppet-merge` for a change made to the beta cluster please? It was made to let us deploy scap with scap and is already deployed on the local puppet master :) https://gerrit.wikimedia.org/r/c/operations/puppet/+/804568/ [11:19:54] hashar: done [11:19:57] thanks jbond! :] [11:20:07] I too easily forget puppet patches :D [11:20:12] :) [14:31:23] herron: may I trouble you with some sync on Prometheus for our (ML) staging cluster? [14:31:55] (feel free to redirect me to one of your compatriots :)) [14:34:16] klausman: hey, sure I can have a look. what do you mean by sync? [14:34:32] puppet broken on VMs it's me, fixing, sorry about that [14:35:25] herron: https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters/New#Prometheus has a section about "volumne creation", I just wanted to make sure I'm okay with going ahead on doing that bit (and restarting prometheus@k8s-mlstaging.service [14:35:44] As for size, we'd probably do the same as the other VGs, i.e. 80G [14:36:07] fixed [14:36:41] almost.. [14:41:34] puppet is fixed, I'll fix the motd later on [14:41:46] klausman: LGTM, this doc is a little out of date (the hosts have bene refreshed since then) so I'll update that now too [14:42:04] thanks! [14:44:11] Oh, and seeing as there is only vg0, I presume the whole ssd vs hdd thing is obsolete, too? [14:46:23] klausman: yes indeed, the doc should reflect that now too [14:46:28] just submitted [14:46:34] thanks again! [14:46:37] np! [14:54:22] all done [14:57:20] moritzm: quick question. if package X depends on package Y, and package Y gets removed, should package X be affected as well? [14:58:13] what happened here was that anycast-healthchecker depends on bird but we removed bird and installed bird2, that caused anycast-healthchecker to be removed as well? [15:00:37] relevant puppetboard changes: https://puppetboard.wikimedia.org/report/durum1001.eqiad.wmnet/f890be5a48c6198d5e3aff2fa1339596103215d6 (bird2 installed) [15:00:42] and then failure: https://puppetboard.wikimedia.org/report/durum1001.eqiad.wmnet/dffd3ec32ca38ff833b066daf71e2d5495832d23 [15:00:58] anycast-healthchecker itself: [15:00:59] Depends: python3:any (>= 3.2~), bird, python3-anycast-healthchecker (= 0.8.2-1) [15:06:47] depends on how the removal takes place, if you try to remove Y with e.g. dpkg, it will complain and abort, but higher level package managers like apt have resolution strategies which may lead to X also being removed [15:07:21] if anycase-healthchecker supports both bird and bird2 we can simply make it declare an alternative depends on X | Y [15:08:02] that should prevent removals [15:10:14] A few codfw servers have started alerting to say they've lost power redundancy. I assume that's not expected. [15:10:34] RhinosF1: Yep, I pinged papaul on -operations [15:10:44] Ty [15:17:29] there was a PDU maintenance scheduled for today happening in A3 IIRC [15:18:15] https://phabricator.wikimedia.org/T309957 [15:19:34] vgutierrez: Ah, good point then [15:19:36] I just created https://phabricator.wikimedia.org/T311245 [15:19:38] So closing that [15:20:43] BTW he logged it on SAL: https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:47] see 14:34 :) [15:20:56] moritzm: thanks! so I will just do bird | bird2 instead of doing just bird2 [15:25:20] ack, that will fix it [16:48:48] <^demon> #wikimedia-serviceops [18:18:04] godog: envoy on phab now listening on IPv6. we deployed your change [18:22:22] sukhe: re flapping mgmt. there is a good chance that is fixed if you can get local hands to do a DRAC firmware upgrade in eqsin https://phabricator.wikimedia.org/T311264#8023977 [18:23:00] thanks! just commented on the task as well [18:24:08] ack, linked ticket from codfw [18:24:26] that's a _detailed_ task :) [18:26:31] yea, and this should be the same thing as well. <+icinga-wm> PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:57] yeah... [18:29:01] I guess in eqsin we use smart hands? [18:29:17] I think so [18:35:10] sukhe: I asked. dcops can do them remotely and it already has the right tag [19:58:24] just caused a "widespread puppet failure" which has been fixed. just noticed though when you search Icinga for it there are 2 checks for that and one is OK while the other is CRIT https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=widespread [19:58:57] there used to be host names there instead [19:59:44] it would tell you which hosts actually have the issue [20:52:04] the makevm cookbook failed at the DNS sync step. '(0/14) success ratio (< 100.0% threshold) for command: 'cd /srv/authdns/...nippets --deploy'. Aborting." I was given the options of 'retry', 'skip' and 'abort' after that and picked abort. then when it tried to remove the DNS record again the same issue happened and I picked abort again