[07:06:35] Krinkle: thanks a lot! [07:16:07] Krinkle, topranks: by looking at HTTP requests served by ats-tls, it seems to me that it took ~5 minutes for full recovery actually [07:16:10] https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&tab=alert&orgId=1&var-site=eqsin&var-cache_type=text&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=1623695447237&to=1623788020893 [07:17:30] eqsin was serving again about 11.5K rps at 2021-06-15T19:05:00, compared to 11.4K 24h earlier (2021-06-14T19:05:00) [07:19:43] ah, but that's the eqsin repool and you were wondering about the *de*pool [07:34:43] anyways yeah, add maybe one minute or two for the time it took me to ssh onto authdns1001 to run authdns-update after merging the depool patch, plus the time it takes the command to do its thing, plus DNS TTLs and I think we get in the 15 minutes ballpark [07:41:59] it's not that easy to tell from those grafana dashboards though I have to say, looking at turnilo data for the countries affected (eg: Japan) you can see pretty much a full recovery between 09:41 - 09:42, which is precisely at the 10 minutes mark -> https://w.wiki/3dM7 [10:43:33] godog: fyi https://gerrit.wikimedia.org/r/c/operations/puppet/+/704307/1 [10:46:51] jbond: neat, checking [10:49:44] godog: merged and pouppet run lgtm on thanos-fe1001 [10:51:21] awesomesauce [12:56:20] effie: joe Around? I'm planning to push the apcu fix now-ish https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/704176 [12:56:33] nope, but I am [12:57:12] Amir1: what is your new TTL? [12:57:18] YOLO [12:57:35] technically it's the actual value [12:57:43] can be minutes in some cases to hours [12:58:06] ok cool [13:04:45] effie: so I looked, there are three usecases I can find in wikibase: two have ttl of one hour and one has one minute [13:06:00] lets see how it goes, so far what we monitor closely is APCu fragmentation [13:06:13] I will check what else we are exporting [13:08:20] we should also check memcached as lots of these are basically wrapper around a memcached cache [13:10:04] ok, I will check memcached traffic too [13:13:09] made this in case of emergency https://gerrit.wikimedia.org/r/704331 [13:23:32] k [14:18:55] XioNoX topranks hi, is it okay if you check this? https://gerrit.wikimedia.org/r/c/operations/puppet/+/703909 [14:30:06] * topranks looking [14:31:12] Lucky jbon.d was so kind in running me through a bunch of puppet stuff yesterday or I'd be lost here :) [14:34:43] Thanks [14:37:10] looks good to me. [14:37:25] Amir1: why ydo you disable monitoring and logging? [14:38:15] Out of interest what's the normal process here, once it's been run by puppet the cronjobs are gone? Do you then go back to that file and remove the "absent" resources for them? [14:38:32] topranks: yes [14:38:41] cool, thanks. [14:39:24] jbond: because the command removes all logging [14:40:08] jbond: the command in cron has `>/dev/null 2>&1` [14:43:00] Amir1: ahh ok yes your change is too keep current behaviour makes sense. although it seems like monitoring and or logging could be usefull perhaps it was disabled to stop cron spamming email but on disk logs would be usefull. either way not for this patch thanks [14:44:14] yeah, my plan is one change at a time to reduce issues and surprises [14:44:23] Amir1: yep make sense [14:44:26] thanks [14:45:41] LibreNMS reccomend the 2>&1 in their setup guides, although I'm not sure why. Perhaps it's quite chatty. [14:50:26] jbond: topranks sooo since I'm not SRE, I can't +2 that patch, is it okay if one of you do it? [14:50:44] no rush of course [14:51:25] no probs let me do it now. John +1'd it [14:53:47] see its done :) thanks Amir1 [14:55:07] Thank you! now around 100 left \o/ [14:55:25] it should have been part of 2030 strategy :D [14:55:41] haha [14:55:42] lol :) [14:56:37] sorry to bother but I think puppet needs the force merge thingy [14:56:45] or it needs rebase? [14:57:01] at least it doesn't need manual rebase, that's for sure [14:57:41] Amir1: done noe [14:57:45] *now [14:58:04] Thanks [14:58:51] np [14:59:12] tomorrow I'll write the drop cron patch [14:59:47] ack thanks im off tomorrow (and the rest opf the week) but sure topranks can +2 [15:00:33] enjoy your time off [15:00:39] thanks :) [15:01:14] yep no problem... might even remember to submit and merge on the puppetmaster next time :) [15:01:48] Actually I've a bit of an open question, if anyone can advise. [15:01:54] It's not going to be possible to get the new switches, on which we'd hoped to test impact of buffer changes to live traffic, installed before we plan to make those changes in eqiad starting next week. [15:02:10] We wanted to do that to have an estimate of what to expect in eqiad next week, specifically how long traffic would be affected. [15:03:01] Given that is the case, I'm considering making the change on the two ASWs in ulsfo this week, after de-pooling the site in geo-dns. [15:03:06] seems we might want to postpone a bit the maintenance windows [15:03:09] Does that sound like a sensible way to go? Or is it risky / too disruptive? [15:03:29] do we have any idea when we could have that setup? [15:03:58] if it's even the same day but before the maitenance it might be already enough to have enough info [15:04:43] it's hard for me to guess how much this is risky [15:04:54] I suspect it might be a bit longer, DC-ops weren't able to get to it for the past few weeks, and given a lot of them are off this week they'll probably have a backlog to work through. [15:06:35] My guess is it's not risky at all and really not something to worry much about, I already feel it has a higher-than-deserved profile. But I am naturally cautious about such things, and sometimes there are bugs etc., so it would be good to test. [15:07:23] The advantage to doing ulsfo is that, even though the models are different (same range) they are in a virtual-chassis, which our prior test setup (with single switch) wouldn't emulate. [17:21:27] havn't had a chance to look properly at this yet, however looks promesing https://github.com/puppetlabs/pdkgo/ (latest iteration of modulsynk/pdk) [20:11:47] jbond: still working? I have a puzzle related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/703740 [20:12:19] specifically, why does CI hate https://gerrit.wikimedia.org/r/c/operations/puppet/+/704417 [20:16:46] andrewbogott: yes one sec its related to rspec let me send an update [20:17:12] oh it's not really compiling it's doing a spec test? [20:17:16] * andrewbogott confused by the error message [20:18:31] andrewbogott: it is compiling but in the spec test prometheus::class_config is redefined in the pre_condition with a mandatory site param, we can probably just drop the entrie pre_condition testing now [20:18:56] ok [20:28:05] I got briefly distracted by a phone call; thanks for fixing my patch :) [20:28:25] andrewbogott: looks like the pre-condition is needed as prometheus::class_config uses puppetdb. i have update the patch to just drop the site param from the spec pre_condition [20:28:45] that's what I would do :) [20:30:15] mostly speaking if the spec file includes require_relative '../../../../rake_modules/spec_helper' you can drop pre_condition statments. however for things that use puppetdb its a bit more complicated [20:30:41] its all green now :) [20:32:20] thank you jbond ! [20:34:13] ~. [20:34:35] no probs [20:34:47] * jbond steps away again [22:42:22] for those that didn't see it in #-staff (h/t kormat): https://www.youtube.com/watch?v=rK_7ozvm53o [22:42:25] <3