[08:37:38] inflatador: Ah right, that's my bad I forgot to run the sync-netbox cookbook after handing it over to dcops [10:19:44] Maintenance_bot added the SRE tag to a bunch of frtech-related procurement tasks (T369947) [I don't know why they've just shown up on the clinic duty dashboard today]. Should it? ISTM that frtech things aren't necessarily anything to do with us, but I don't want to just have a tag-war with the bot [10:19:44] T369947: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947 [10:21:03] [I have experimentally tried removing the SRE tag from just that task, let's see if it comes back] [10:28:40] Emperor: you can ask Rob he is the one most likely to have the answer. There is also https://phabricator.wikimedia.org/T385208 that seems relevant but no recent changes [10:35:32] <_joe_> XioNoX: I spent 10 seconds wondering what Special:RecentChanges had to do with all the above :D [10:37:22] you should take a nap :) [10:37:38] <_joe_> or, this place has broken me [10:40:42] Emperor: its because of the ops- tags, and only happens once they become public and the bot can see them [10:40:42] https://gitlab.wikimedia.org/ladsgroup/Phabricator-maintenance-bot/-/blob/master/project_grouper.py?ref_type=heads#L37 [10:42:04] argh, now I must fix the typo in the comment [10:42:18] (or next action if the bot didn't do it at first) [10:44:28] Emperor: Might be easier just to rename the phab project ;) [13:32:59] claime NBD, I figured it wouldn't affect anything ;) [13:36:56] Emperor: thanks for sharing the link about the patches on Gitlab and the related debugging [13:37:25] it broke my mental model a bit of how I have usually done gbp and hence the confusion. still have many questions but I just merged the changes and left patches/ alone and moving on for now [13:38:26] in my mind, putting changes against a pristine tar in patches/ and seeing them applied helps :) [13:44:26] I wonder if it might be worth me doing another talk about how the process works at some point, given it's been a while and we now have the staging repo in place. [16:36:47] sukhe, herron, marostegui, akosiaris: moving here so it doesn't get lost in alert noise, I suspect https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135385 is the trigger [16:36:55] Amir1, _joe_ [16:37:29] Let's revert then? [16:37:32] Just to be on the safe side [16:37:44] https://gerrit.wikimedia.org/r/1135471 [16:38:49] <_joe_> rzl: this is a monitoring issue right? [16:38:53] <_joe_> the site is up [16:39:07] <_joe_> sorry, I was heading out, I'm in my 13th hour officially now [16:39:08] it smells like it to me yeah [16:39:09] Yes, everything is up [16:39:11] the hosts are up [16:39:19] And the read only is fine [16:39:21] how that could even be the cause [16:39:30] <_joe_> yeah I was looking at all the metrics that would explode if this was real [16:39:39] I don't even have any reason to suspect that patch except for the timing [16:40:34] it could be the alerting infra having hw issues? [16:40:37] where should I be running PCC and/or should I stagger it? [16:40:40] Could not connect to localhost:3306 though? [16:40:48] Amir1: doesn't seem to be anything else I cansee [16:41:09] does icinga also use that hiera variable to decide what ports to monitor? and do non-ms DB hosts not listen on port 3306? [16:41:59] sukhe: is it deployed? [16:42:15] I'd assume it'll take half an hour [16:42:15] marostegui: yeah, as in merged, but I am not really sure where to run PCC to quickly effect a change? [16:42:26] sukhe: Let me run it on db1180 and see if it recovers [16:42:40] herron: [16:42:41] tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 1575/mysqld [16:42:41] ob db2207 a mysql is listening on 3306. localhost is ::1 [16:42:42] >Host '::1' is not allowed to connect to this MariaDB serverConnection closed by foreign host. [16:42:44] tcp6 0 0 :::3306 :::* LISTEN 1575/mysqld [16:42:55] [18:42:45] <+icinga-wm> RECOVERY - MariaDB read only s6 on db1180 is OK: Version 10.6.20-MariaDB-log, Uptime 3655943s, read_only: True, event_scheduler: True, 5125.54 QPS, connection latency: 0.032303s, query latency: 0.000575s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:42:59] So it was the patch [16:43:42] yeah I mean, +3330 vs -3306 where Icinga expects it to [16:44:00] I have no idea beyond that but at least correlation causation [16:44:17] https://www.irccloud.com/pastebin/yBLATuBj/ [16:44:28] just ran on db1169 to be double sure [16:44:31] what mutante said just in paste form [16:44:33] and the recovery is there [16:45:09] Notice: /Stage[main]/Wmfmariadbpy/File[/etc/wmfmariadbpy/section_ports.csv]/content: content changed '{sha256}672d9b0df558dec61a74a317dd11da44d3602985974315eee932a309bd37c634' to '{sha256}921fdeb2a8618f03fb7554fad9c3e614948e362389848ace8553ac18681efd69' [16:45:15] 12:45:10 <+icinga-wm> RECOVERY - MariaDB read only s6 #page on db2229 is OK: Version 10.6.20-MariaDB-log, Uptime 3664405s, read_only: [16:45:51] so I am guessing, an agent run on O:mariadb::core? [16:46:03] +1 [16:46:07] yeah [16:46:11] staggering it a bit with -b11 and running [16:46:36] running, 193 hosts [16:47:31] sukhe: FYI you can safely use larger -b for puppet if in a hurry ;) [16:48:12] I think it's non-urgent, a monitoring problem only and it's not getting worse now that the revert is merged [16:48:27] ack [16:48:29] volans: yeah I do that but nothing was really broken so I didn't want to do it on 193 together [16:48:40] I could have done a higher batch I guess [16:48:53] I don't get why the patch broke that though [16:49:07] sure sure, T280622 still applies :) [16:49:07] T280622: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 [16:49:08] <_joe_> that's something we can figure out tomorrow [16:49:24] glad it was monitoring only [16:50:56] marostegui: my guess is that the ports are mutually exclusive, the monitoring doesn't do 3306 on s4, it goes with 3314 [16:55:48] resolves are coming in. thanks to rzl for spotting that patch. [16:58:35] so <3 [16:58:39] so much <3 [16:58:42] to everyone [16:58:52] \o/ [16:59:07] thanks sukhe for getting the fix out quickly [17:14:38] jhathaway: I have two questions! 1) Is there anything of yours on cloudcontrol1005 that I should salvage? (See email for context) 2) when my puppet run pauses for a long time while saying 'Loading facts' is it really spending all that time loading facts, or is it doing a bunch of other things that just don't write anything to the console? [18:28:06] andrewbogott: let me check cloudcontrol... [18:28:40] nothing to keep for me [18:29:09] great [18:29:14] on the second question, possibly [18:29:17] * andrewbogott has discovered facter --timing [18:29:19] have you tried running with debug [18:29:28] ah yes and that [18:29:28] it looks like out of the 40 seconds my puppet runs are taking, 30 seconds are facter [18:29:37] oooh, not great, which facts? [18:30:20] fact 'cloud.provider', took: (26.028736) seconds [18:30:44] So... going to figure out what that's doing! I bet it's a timeout [18:31:48] the docs say 'This is currently only populated on nodes running in Microsoft Azure.' which is clearly not true [18:32:03] :) [18:32:30] "cloud.provider" also turns out to be impossible to google [18:42:50] which box is this on andrewbogott? [18:43:11] basically any cloud-vps server. But, right now: abogott-perftests.testlabs.eqiad1.wikimedia.cloud [18:43:35] I'm trying to just switch off that fact entirely in facter.conf but no luck so far [18:56:34] blocking is wonky, because you have to block by fact group: [18:56:37] facts : { [18:56:39] blocklist : [ "virtualization" ], [18:56:41] } [18:57:00] but then you don't have the virtual and is_virtual facts, which your puppet code probably uses [19:00:18] this seems to work [19:00:23] https://www.irccloud.com/pastebin/n0EPX7di/ [19:00:51] not sure if that breaks other things, hopefully not [19:02:32] jhathaway: do you prefer if I do that for just cloud-vps hosts for for everything everywhere? That fact isn't especially slow in prod but it does evaluate. [19:06:07] andrewbogott: I think everywhere is find [19:06:33] I don't understand how adding cloud.provider fixes it, but I'm glad it does [19:06:45] since that isn't one of the block groups [19:09:08] me neither but I have to block /both/ those entries or nothing changes [19:10:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135482 -- going to apply just to the cloud-vps puppet server first and make sure it doesn't break creation of new VMs. [19:10:09] yeah, this bug confirms that, https://github.com/puppetlabs/facter/issues/2690 [19:12:27] it's weird! [19:13:31] * andrewbogott adds that bug to the commit message [19:15:30] looks like the facter docs are just a bit unclear, you can block single facts as well, so it is an array of block groups or single facts [19:19:12] My tests look good, can I get a +1 on that? And, ok if I merge now while you're around, just in case of prod catastrophe? [19:19:36] (it will have delayed effect since it has to apply on the first run, and then takes effect on the second run) [19:19:47] looking, [19:20:08] right, which has the nasty effect of it being impossible to rollback with a revert [19:20:21] if the catalog compilation is failing [19:20:27] yeah :( [19:22:13] ec2_metadata is used in modules/openstack/templates/nova/vendordata.txt.erb, does that matter? [19:23:26] Heh, didn't expect my mariadb patch to backfire that way. Sorry about that, once I saw the 2 +1s I assumed it was going to be fine. [19:23:44] jhathaway: let me double check... [19:25:11] oh, that's not coming from facter so it shouldn't matter (it's confusing, puppet compiles that .erb on the nova metadata server but then the actual lookup is done by cloud-init) [19:26:06] ok [19:31:24] comment added [19:32:50] +1'd [19:38:32] I did multiple runs on a prod host, no ill effects. [19:38:54] great I'll keep an eye on puppetboard [19:38:56] Thank you for your help with this! I still want puppet to be faster but this is going to make everything in cloud-vps much, much better. [19:39:12] yup, glad it was an easy improvement [19:39:27] before: [19:39:31] real 0m46.458s [19:39:31] user 0m10.197s [19:39:31] sys 0m3.893s [19:39:37] after: [19:39:39] https://www.irccloud.com/pastebin/N0hoP2Pe/ [19:43:40] very nice