[04:39:11] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10wkandek)
[09:57:55] <mutante>	 jelto: in your timezone now. your mail re: backups makes a lot of sense to me. a single, daily full backup > full + incremental, +1
[10:00:50] <mutante>	 all: hi, i'm in Germany now. I will be working 50% of the time for now
[10:07:22] <jayme>	 mutante: welcome to CEST o/
[10:07:47] <mutante>	 jayme: thank you:) and also for the review that I just saw and amending
[10:23:21] <jelto>	 mutante: thanks for the feedback and welcome to Germany
[10:29:29] <mutante>	 thanks jelto :)
[11:03:12] <joe>	 I will be skipping today's meeting as I have a conflicting non-recurring one.
[11:08:17] <moritzm>	 mutante: welcone to CEST; just in time for the four remaining days of Spargelzeit, can't be a coincidence :-)
[11:09:03] <joe>	 jayme: if I merge a change and I don't bump the chart value, will the chart be updated in chartmuseum?
[11:14:32] <mutante>	 moritzm: I was wondering if it's too late or not :)
[11:25:14] <jayme>	 joe: no
[11:26:26] <joe>	 ERRORS: 106 requests attempted to staging.svc.eqiad.wmnet. Errors connecting to 1 host. 5 requests with failed assertions.
[11:26:34] <joe>	 this is pretty great
[11:26:47] <joe>	 I need to convert the http-only tests to https though
[11:26:50] <joe>	 or skip them
[11:27:11] <joe>	 but this means we have just 3 urls giving unexpected results (we have 2 failures in the tests currently)
[11:27:25] <joe>	 (that's an httpbb test run against mwdebug/staging
[11:27:52] <joe>	 the only big issue is response times are 3x what we get in production currently
[11:28:07] <joe>	 I'm sure there is some basic resource starvation/misconfiguration causing this
[11:28:29] <joe>	 ok, bbl
[11:44:01] <wikibugs>	 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Limit query parallelism from Flink based WDQS updater to Wikidata - https://phabricator.wikimedia.org/T275133 (10Gehel) 05Open→03Resolved
[12:05:28] <wikibugs>	 10serviceops, 10CX-cxserver: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10KartikMistry)
[12:55:28] <wikibugs>	 10serviceops, 10CX-cxserver: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10JMeybohm) The message comes from mw-api, `http://localhost:6500/w/api.php` is the local address of the https://wik...
[13:48:38] <wikibugs>	 10serviceops, 10CX-cxserver, 10Language-Team (Language-2021-April-June): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit)
[14:02:58] <rzl>	 joe: hm, it may be worse -- httpbb bails out as soon as it hits a connection error
[14:03:25] <rzl>	 so if https is working but http isn't, it went until it hit the first http test and then stopped, it didn't run any https tests after that
[14:04:02] <rzl>	 that behavior isn't great in that situation, I had been picturing something like an incorrect hostname, where it's impossible to connect at all, so why bother trying over and over
[14:06:45] <rzl>	 I can at least print the number of skipped tests so that message is at least clear about what's going on, will think about other ways to improve the logic there
[14:09:24] <rzl>	 oh, it does at least print SKIPPED for each test, so hopefully it's obvious if there are a lot, it's only the summary that doesn't say
[14:10:53] <joe>	 yes it is don't worry 
[14:11:35] <rzl>	 👍
[14:11:49] <rzl>	 I might still add that but at least it's clearer than I thought
[14:46:33] <wikibugs>	 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10jbond) p:05Triage→03Medium
[15:00:07] <elukey>	 hello folks
[15:00:32] <elukey>	 is there something to read about how to add TLS private keys to pods via deployment charts?
[15:00:44] <elukey>	 (other than checking the code :D)
[15:12:10] <jayme>	 elukey: we're in e meeting currently
[15:14:56] <elukey>	 yep yep anytime :)
[15:51:48] <joe>	 elukey: there is the dedicated page on wikitech
[15:52:08] <joe>	 https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments
[15:52:27] <joe>	 you only need a fraction of that stuff as you're not converting a service to tls
[15:53:12] <joe>	 https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments#Create_and_place_certificates is what you need I think, unless you're trying to do something completely different
[15:58:58] <elukey>	 thank youuuu
[15:59:20] <elukey>	 I need to add certs to a couple of places (istio ingress + kfserving's webhook)
[18:47:40] <legoktm>	 rzl: is https://phabricator.wikimedia.org/T266717 something we need to resolve before the switchover?
[18:48:33] <rzl>	 that was the original plan, but I'm not sure how critical it is
[18:49:42] <rzl>	 if we don't do anything, we'll do what we did last time -- manually kill maintenance scripts in eqiad, then expect them to start automatically in codfw when we change the active DC
[18:50:28] <rzl>	 (which isn't perfect, because they'll start automatically in eqiad before the switch, if we take a moment to start the switch and they trigger again in between)
[18:52:39] <rzl>	 legoktm: I had meant to assign you that ticket -- my recollection after we'd talked a few months ago is that I was passing that work over to you, but I'm not 100% positive in retrospect
[18:53:22] <legoktm>	 that's my recollection too, but I had forgotten about it until just now
[18:54:16] <rzl>	 hmm, the quick-and-dirty approach would be to just `systemctl stop` all the maintenance jobs in both dcs, then `systemctl start` em afterwards
[18:54:39] <rzl>	 you'd also want to pause puppet, which is why that's a little ugly
[18:54:51] <rzl>	 (since I'm pretty sure puppet would helpfully start them for you)
[18:54:52] <legoktm>	 we already do pause puppet
[18:54:56] <rzl>	 oh, so we do
[18:55:20] <legoktm>	 we could mask the units?
[18:56:05] <rzl>	 oops yeah, I meant disable not stop -- between disable and mask I'm not sure which is right but I trust your judgment
[18:56:16] <legoktm>	 https://fedoramagazine.org/systemd-masking-units/ "If you boot with a unit masked, it will not run even to satisfy dependencies. Masking is powerful for this reason."
[18:56:25] <legoktm>	 I think masking is like super disabling
[18:56:39] <rzl>	 nod
[18:57:44] <rzl>	 I think a proper fix to T266717 should actually be pretty doable, but with a week to go, maybe the manual fix is the way to go this time
[18:57:56] <rzl>	 I'd be interested in what j.oe thinks too, just in case I'm missing anything
[18:58:40] <rzl>	 I also forget offhand if systemctl can understand a glob like `mediawiki_job_*` to catch all the maintenance scripts -- if not you'll probably want to prepare the commands ahead of time
[18:59:04] <rzl>	 (you'll probably want to prepare them *anyway*, I don't think I typed anything into my terminal on the day, just pasted)
[18:59:22] <rzl>	 (you, or whoever)
[19:00:23] <wikibugs>	 10serviceops, 10SRE, 10conftool, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) a:05RLazarus→03Legoktm
[19:01:37] * legoktm nods
[19:08:20] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Legoktm) > As written, the warmup script only covers the app servers, and doesn't warm up the API servers at all. It's less harmful to have a spike of latency in the API servers than the app...
[19:16:57] <rzl>	 legoktm: re T269179 it's not just the URLs but it's also which hosts we warm up -- I haven't checked recently but I think the warmup is run directly against the appserver cluster and not the api_appservers
[19:17:09] <legoktm>	 oof
[19:18:07] <legoktm>	     appserver_warmup = "nodejs {dir}/warmup.js {dir}/urls-server.txt clone appserver {dc}".format(
[19:18:07] <legoktm>	         dir=warmup_dir, dc=datacenter)
[19:18:12] <rzl>	 it takes the destination as a command-line arg so it shouldn't be hard to send them something
[19:18:14] <rzl>	 yeah
[19:18:22] <legoktm>	 I guess I just c&p to add one for api_appserver
[19:18:36] <rzl>	 right exactly -- maybe with a different urls list
[19:19:22] <legoktm>	 ok, after lunch :D
[19:34:57] <joe>	 I think the work to add a confctl variable is pretty doable this week
[19:35:05] <joe>	 I can take a look tomorrow morning
[19:35:38] <joe>	 as an alternative, yes, we'll need to disable all scripts in the cookbook "manually"
[19:41:15] <rzl>	 oh yeah, of course we could just add the systemctl call to the cookbook
[20:00:19] <bd808>	 do y'all want stashbot in this channel to do phabricator Txxxx mention expansions?
[20:01:32] <rzl>	 no strong feeling here, I used the bare number in those cases cause the full link had just been mentioned :)
[20:01:54] <rzl>	 it never hurts but I don't particularly feel like it's missing either, I guess
[21:23:41] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Legoktm) >>! In T269179#7167263, @Legoktm wrote: >> As written, the warmup script only covers the app servers, and doesn't warm up the API servers at all. It's less harm...
[22:06:05] <legoktm>	 fun, systemctl can reliable stop units with a glob/wildcard but not necessarily restart them: https://github.com/systemd/systemd/issues/6379
[22:09:51] <rzl>	 ... huh! fair enough
[22:23:16] <wikibugs>	 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Updates to warmup script (2020-2021) - https://phabricator.wikimedia.org/T269179 (10Krinkle)