[00:03:39] denisse: look in netbox what the status of the host is there. look in icinga if it's gone.. hmm [00:04:10] but generally just repeating the cookbook is best you can do and should work [00:04:32] it might show some errors on second run because some things are already gone [00:05:55] The netbox status is decommissioning but I think it finished successfully after the 2nd run. [00:06:27] denisse: I think that's normal [00:06:29] BTW, netmon1002's status is also shown as 'decommissioning' but the cookbook passed everything with it. [00:06:34] it's still decom for dcops [00:06:51] yea, i think there is no problem [00:06:54] Yeah, it looks like the decommission worked, thanks!! [00:09:05] Is sirenbot up?? [00:09:09] !alerts [00:09:20] sirenbot: help [00:09:30] !incidents [00:09:31] 3192 (RESOLVED) [FIRING:1] PHPFPMTooBusy appserver (ops php7.4-fpm.service page codfw prometheus sre) [00:09:31] 3191 (RESOLVED) [FIRING:9] ProbeDown (probes/service ops page prometheus sre) [00:09:31] 3193 (RESOLVED) [FIRING:1] FrontendUnavailable cache_text (page thanos sre) [00:09:31] 3190 (RESOLVED) [FIRING:1] ProbeDown (10.2.2.44 ip4 docker-registry:443 probes/service http_docker-registry_ip4 ops page eqiad prometheus sre) [00:10:00] It's up, but It's not voiced. Maybe that's why it didn't update its status. [00:10:14] hmmm [00:37:29] !issync [01:32:43] !issync [01:32:43] Error: You don't have permission to update channel settings [01:32:55] grum [02:16:28] !issync [02:16:28] Syncing #wikimedia-sre (requested by legoktm) [02:16:29] No updates for #wikimedia-sre [02:21:34] !refresh-topic [02:21:47] :| [02:24:55] \o/ [08:07:31] FYI, bast1003/bast2002 reboots incoming in a few minutes [08:27:13] both done [09:38:37] !incidents [09:38:38] 3192 (RESOLVED) [FIRING:1] PHPFPMTooBusy appserver (ops php7.4-fpm.service page codfw prometheus sre) [09:38:38] 3191 (RESOLVED) [FIRING:9] ProbeDown (probes/service ops page prometheus sre) [09:38:38] 3193 (RESOLVED) [FIRING:1] FrontendUnavailable cache_text (page thanos sre) [10:09:15] sirenbot needs a header to out. [10:09:20] output* [10:09:29] something like "Incidents in the last 24H" [10:18:50] nah, it just needs timestamps [10:20:47] +1 on timestamp [10:22:02] remember- don't ask what sirenbot can do for you, but just submit your patch reviews on gerrit or gitlab! [10:25:24] timestamps are fine too. But an aggregation hint (so I don't have to mentally do hourly math) would be useful too [10:26:05] even more awesome would be to be able to do !incidents since=202X-YY-ZZTHH:MM [10:26:28] up to the expected limits ofc [12:10:15] Is it possible to look up codfw hieradata in eqiad and vice versa? Currently swiftrepl deploys credentials by hand(!), and in replacing it I'd like to do better. both eqiad and codfs have profile::swift::accounts_keys: which I need a particular value out of (mw_media). I could presumably duplicate the two values into something under hieradata/common/profile/ (like is done for swift::replication_keys: ) but that seems a bit sad [12:47:48] vgutierrez: available for a quick review of https://gerrit.wikimedia.org/r/868661 ? [12:48:18] Emperor: what's the role used? [12:49:13] if the credentials are the same for both DCs, you can use "role/common/%{::_role}.yaml" [12:49:20] and avoid the duplication [12:50:47] if the credentials are not the same, the role hierarchy will probably not help you much and you 'll have to fallback to hieradata/common/... [12:51:24] arturo: happy to take a look after lunch [12:52:44] vgutierrez: nevermind then. I was hoping for something more immediate. The server is undergoing reimage and failing to run puppet. Thanks anyway! [12:53:18] Looks good assuming that the fqdn is the right one [12:55:58] akosiaris: the eqiad and codfw credentials are different, but the thing I'm writing (swiftrepl replacement) will need _both_ credentials in _each_ DC [12:57:44] Emperor: ah, then put them under the role/common/{::_role} hierarchy in a hash keys eqiad, codfw and use them according to the $::site variable. [12:57:56] keyed* [12:59:57] Thanks, I'll try that (but I think it will be harder than that because of how this is currently used) [13:00:48] also, being private makes it all about 3 times more confusing :-/ [13:05:31] I guess I'll have to move all of profile::swift::accounts_keys: into common [13:06:18] Emperor: one thing I do to reduce confusion is to put a comment placeholder on the public repo (in addition to the public private one) [13:06:49] something like # the password for this goes in the private repo ::mypass::very_private [13:07:31] (I do it because it simplifies later greping in the public one) [15:36:38] effie: do you know if this monitoring script is still used? https://gerrit.wikimedia.org/r/c/operations/puppet/+/868528/3/modules/nagios_common/files/check_commands/check_all_memcached.php [15:38:13] I am not sure I have seen that before [15:39:44] wait [15:45:19] heh [15:45:31] Krinkle trying to make my hard work redundant [15:46:16] I think he is making me work on a friday afternoon right before I was about to shut my laptop lid [15:46:38] Krinkle: I cant say for sure, a quick look does not yield something that is active [15:46:50] and I do not remember such alert going off ever either [15:46:57] I think it's been a while [15:47:12] so I would recommend we remove it, I can merge it on Monday, and any leftovers [15:47:58] if you go to icinga and search for memcached, there are sane checks there already [15:48:35] so this one is either redundant, or not useful, or not active [15:49:37] it's been around since before 2014 [16:22:10] akosiaris: YM something like https://gerrit.wikimedia.org/r/c/operations/puppet/+/868721 and https://gerrit.wikimedia.org/r/c/labs/private/+/868718 ? [17:00:01] denisse, mutante: all yours, enjoy your Friday <3 [17:00:12] wrong channel.. proper message :) [17:00:31] vgutierrez: Thanks, happy weekend!! [17:15:14] jbond: I merged your spdx patch [17:24:09] andrewbogott: ahh thanks [18:20:26] bd808: I made that bug report, but really.. dont worry about it too much https://phabricator.wikimedia.org/T325381 [18:23:47] Thanks mutante. _j.oe_'s idea to look at the capture length and just ignore it if the task id is very short feels like it would be a simple and valid fix. Basically if it's less than T2001 ignore the capture. [18:23:48] T2001: [DO NOT USE] Documentation is out of date, incomplete (tracking) [superseded by #Documentation] - https://phabricator.wikimedia.org/T2001 [18:25:11] bd808: or just less than T23 I think [18:25:11] T23: Identify features Gerrit users would miss in Phabricator - https://phabricator.wikimedia.org/T23 [18:25:17] :p [18:26:15] because it would just be the hour field in a date [18:27:17] bonus points for not linking to Differential when talking about data center racks in row D, for example D4 [18:27:17] D4: iscap 'project level' commands - https://phabricator.wikimedia.org/D4 [18:27:30] agreed :) [18:28:24] I think at this point the Differential stuff should just come out of Stashbot, but agreed that this is another annoyance in it's regex approach. [18:31:00] I think most of the annoying D[1-9] linking happens inside Phabricator though right? [18:31:44] +1 for just removing differential stuff [18:39:33] In practice pholio mocks (M\d+) are really not used enough that Stashbot should care either. [18:40:26] M147 is still cool though ;) [18:40:27] M147: A badge of honorable shame - https://phabricator.wikimedia.org/M147 [18:52:31] after adding a new keyholder key in private repo, where do we add them so that they show up in /etc/keyholder.d on deployment server [18:54:27] ah, keyholder::agent I guess [18:55:38] found it I think, first not because I did not look in Hiera. hieradata/role/common/deployment_server/kubernetes.yaml [19:24:47] we are creating a new deployment key for keyholder to deploy jenkins with scap. but we have 2 sets of jenkins, one on contint* and one on releases*. Now wondering if one "deploy_jenkins" key is fine to use on both servers or if I am expected to have different keys for different sets of target servers. assuming the group of people who do them is the same.