[04:39:11] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10wkandek) [09:57:55] jelto: in your timezone now. your mail re: backups makes a lot of sense to me. a single, daily full backup > full + incremental, +1 [10:00:50] all: hi, i'm in Germany now. I will be working 50% of the time for now [10:07:22] mutante: welcome to CEST o/ [10:07:47] jayme: thank you:) and also for the review that I just saw and amending [10:23:21] mutante: thanks for the feedback and welcome to Germany [10:29:29] thanks jelto :) [11:03:12] I will be skipping today's meeting as I have a conflicting non-recurring one. [11:08:17] mutante: welcone to CEST; just in time for the four remaining days of Spargelzeit, can't be a coincidence :-) [11:09:03] jayme: if I merge a change and I don't bump the chart value, will the chart be updated in chartmuseum? [11:14:32] moritzm: I was wondering if it's too late or not :) [11:25:14] joe: no [11:26:26] ERRORS: 106 requests attempted to staging.svc.eqiad.wmnet. Errors connecting to 1 host. 5 requests with failed assertions. [11:26:34] this is pretty great [11:26:47] I need to convert the http-only tests to https though [11:26:50] or skip them [11:27:11] but this means we have just 3 urls giving unexpected results (we have 2 failures in the tests currently) [11:27:25] (that's an httpbb test run against mwdebug/staging [11:27:52] the only big issue is response times are 3x what we get in production currently [11:28:07] I'm sure there is some basic resource starvation/misconfiguration causing this [11:28:29] ok, bbl [11:44:01] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 2 others: Limit query parallelism from Flink based WDQS updater to Wikidata - https://phabricator.wikimedia.org/T275133 (10Gehel) 05Open→03Resolved [12:05:28] 10serviceops, 10CX-cxserver: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10KartikMistry) [12:55:28] 10serviceops, 10CX-cxserver: cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10JMeybohm) The message comes from mw-api, `http://localhost:6500/w/api.php` is the local address of the https://wik... [13:48:38] 10serviceops, 10CX-cxserver, 10Language-Team (Language-2021-April-June): cxserver: https://cxserver.wikimedia.org/v2/suggest/source/Paneer/ca?sourcelanguages=en occasionally fails with HTTP 503 - https://phabricator.wikimedia.org/T285219 (10Nikerabbit) [14:02:58] joe: hm, it may be worse -- httpbb bails out as soon as it hits a connection error [14:03:25] so if https is working but http isn't, it went until it hit the first http test and then stopped, it didn't run any https tests after that [14:04:02] that behavior isn't great in that situation, I had been picturing something like an incorrect hostname, where it's impossible to connect at all, so why bother trying over and over [14:06:45] I can at least print the number of skipped tests so that message is at least clear about what's going on, will think about other ways to improve the logic there [14:09:24] oh, it does at least print SKIPPED for each test, so hopefully it's obvious if there are a lot, it's only the summary that doesn't say [14:10:53] yes it is don't worry [14:11:35] 👍 [14:11:49] I might still add that but at least it's clearer than I thought [14:46:33] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10jbond) p:05Triage→03Medium [15:00:07] hello folks [15:00:32] is there something to read about how to add TLS private keys to pods via deployment charts? [15:00:44] (other than checking the code :D) [15:12:10] elukey: we're in e meeting currently [15:14:56] yep yep anytime :) [15:51:48] elukey: there is the dedicated page on wikitech [15:52:08] https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments [15:52:27] you only need a fraction of that stuff as you're not converting a service to tls [15:53:12] https://wikitech.wikimedia.org/wiki/Enable_TLS_for_Kubernetes_deployments#Create_and_place_certificates is what you need I think, unless you're trying to do something completely different [15:58:58] thank youuuu [15:59:20] I need to add certs to a couple of places (istio ingress + kfserving's webhook) [18:47:40] rzl: is https://phabricator.wikimedia.org/T266717 something we need to resolve before the switchover? [18:48:33] that was the original plan, but I'm not sure how critical it is [18:49:42] if we don't do anything, we'll do what we did last time -- manually kill maintenance scripts in eqiad, then expect them to start automatically in codfw when we change the active DC [18:50:28] (which isn't perfect, because they'll start automatically in eqiad before the switch, if we take a moment to start the switch and they trigger again in between) [18:52:39] legoktm: I had meant to assign you that ticket -- my recollection after we'd talked a few months ago is that I was passing that work over to you, but I'm not 100% positive in retrospect [18:53:22] that's my recollection too, but I had forgotten about it until just now [18:54:16] hmm, the quick-and-dirty approach would be to just `systemctl stop` all the maintenance jobs in both dcs, then `systemctl start` em afterwards [18:54:39] you'd also want to pause puppet, which is why that's a little ugly [18:54:51] (since I'm pretty sure puppet would helpfully start them for you) [18:54:52] we already do pause puppet [18:54:56] oh, so we do [18:55:20] we could mask the units? [18:56:05] oops yeah, I meant disable not stop -- between disable and mask I'm not sure which is right but I trust your judgment [18:56:16] https://fedoramagazine.org/systemd-masking-units/ "If you boot with a unit masked, it will not run even to satisfy dependencies. Masking is powerful for this reason." [18:56:25] I think masking is like super disabling [18:56:39] nod [18:57:44] I think a proper fix to T266717 should actually be pretty doable, but with a week to go, maybe the manual fix is the way to go this time [18:57:56] I'd be interested in what j.oe thinks too, just in case I'm missing anything [18:58:40] I also forget offhand if systemctl can understand a glob like `mediawiki_job_*` to catch all the maintenance scripts -- if not you'll probably want to prepare the commands ahead of time [18:59:04] (you'll probably want to prepare them *anyway*, I don't think I typed anything into my terminal on the day, just pasted) [18:59:22] (you, or whoever) [19:00:23] 10serviceops, 10SRE, 10conftool, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10RLazarus) a:05RLazarus→03Legoktm [19:01:37] * legoktm nods [19:08:20] 10serviceops, 10SRE, 10Datacenter-Switchover: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Legoktm) > As written, the warmup script only covers the app servers, and doesn't warm up the API servers at all. It's less harmful to have a spike of latency in the API servers than the app... [19:16:57] legoktm: re T269179 it's not just the URLs but it's also which hosts we warm up -- I haven't checked recently but I think the warmup is run directly against the appserver cluster and not the api_appservers [19:17:09] oof [19:18:07] appserver_warmup = "nodejs {dir}/warmup.js {dir}/urls-server.txt clone appserver {dc}".format( [19:18:07] dir=warmup_dir, dc=datacenter) [19:18:12] it takes the destination as a command-line arg so it shouldn't be hard to send them something [19:18:14] yeah [19:18:22] I guess I just c&p to add one for api_appserver [19:18:36] right exactly -- maybe with a different urls list [19:19:22] ok, after lunch :D [19:34:57] I think the work to add a confctl variable is pretty doable this week [19:35:05] I can take a look tomorrow morning [19:35:38] as an alternative, yes, we'll need to disable all scripts in the cookbook "manually" [19:41:15] oh yeah, of course we could just add the systemctl call to the cookbook [20:00:19] do y'all want stashbot in this channel to do phabricator Txxxx mention expansions? [20:01:32] no strong feeling here, I used the bare number in those cases cause the full link had just been mentioned :) [20:01:54] it never hurts but I don't particularly feel like it's missing either, I guess [21:23:41] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Updates to warmup script - https://phabricator.wikimedia.org/T269179 (10Legoktm) >>! In T269179#7167263, @Legoktm wrote: >> As written, the warmup script only covers the app servers, and doesn't warm up the API servers at all. It's less harm... [22:06:05] fun, systemctl can reliable stop units with a glob/wildcard but not necessarily restart them: https://github.com/systemd/systemd/issues/6379 [22:09:51] ... huh! fair enough [22:23:16] 10serviceops, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Updates to warmup script (2020-2021) - https://phabricator.wikimedia.org/T269179 (10Krinkle)