[07:40:04] good morning folks [07:40:31] as announced last week, I am going to start rebalancing some kafka main codfw topics today for https://phabricator.wikimedia.org/T288825 [07:42:45] <_joe_> elukey: lmk if you need my help [07:44:51] _joe_ ack thanks! There may be some alerts related to kafka partitions unreplicated blabla, if any I'll try to ack/silence [07:45:20] <_joe_> I'm slightly worred about the jobqueue [07:45:31] <_joe_> but today you plan to rebalance codfw right? [07:46:04] <_joe_> if so, we might want to be sure to direct traffic to the codfw eventgate-main for the duration [07:48:51] yes yes only codfw, but it may take a couple of days, I'll do it slowly [07:49:19] in theory there shouldn't be any impact for live traffic, Kafka should increase the partition count and shrink it when completed [07:49:37] Good morning, as announced last week, I'm going to switch to the new reimage process ( https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage ). Please hold on any reimage for a bit, I'll update here on completion. [07:49:48] there are only some topics that are relatively big [07:49:52] (like resource-purge) [08:02:42] <_joe_> elukey: that's also quite mission-critical in all DCs [08:04:13] _joe_ yes yes I'll keep the most trafficated topics for last [08:05:58] <_joe_> I would have suggested doing it while the DCs that connect to the codfw kafkas for purged consumptions are least trafficated [08:06:10] <_joe_> but sure, start with some small ones [09:06:08] The swith to the new reimage process for physical hosts has been completed. You can resume reimages following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage [09:57:30] _joe_ I moved a lot of low-traffic topics and something is moving in Kafka metrics (related to traffic/data rebalancing), but to see the real effect I'll need to move the bigger ones [09:57:46] if you and others are ok I can start this afternoon [09:58:56] <_joe_> +1 [10:00:02] I have added all the info of what I have done to https://phabricator.wikimedia.org/T288825, and https://gitlab.wikimedia.org/Elukey/kafka_main_rebalance/-/tree/main/main-codfw/topicmappr_json is up to date too [10:00:45] (better https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-codfw) [13:01:22] Ready to restart the kafka main codfw topics move :) [13:01:59] elukey: i wish you exactly the amount of luck that you deserve. :) [13:02:26] kormat: I know that I can count on you :) [13:02:37] :D [14:17:22] the last/biggest kafka partitions take quite some time to replicate to the new nodes, but so far I don't see anything exploding (ping me in case). It will probably take 2/3 days to finish the work, and then there will be main-eqiad to do :D [14:18:31] (I am throttling the replication to max 50MB/s, but in theory we can raise the threshold a bit more) [14:19:20] <_joe_> let's make a call tomorrow morning [14:21:38] * elukey is scared [14:21:43] :D [14:31:28] <_joe_> rightfully so [14:32:44] https://grafana.wikimedia.org/d/000000027/kafka?viewPanel=6&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All [14:32:49] this one is a good sign :) [15:40:54] https://www.reddit.com/r/RedditEng/comments/q5vmf2/reddits_move_to_grpc/ [15:44:01] I will leave dewiki eqiad media backups during the night and tomorrow with very very low concurrency (3 MB/s)- as I won't be back until wednesday [15:45:23] <_joe_> godog: I recently defined grpc "the SOAP of 2020s", which is partially unfair but also not completely unfair [15:50:23] hehe [17:14:10] I am stopping the kafka main codfw's topic moves, summary of what's left in https://phabricator.wikimedia.org/T288825#7417879 [17:14:50] from the cluster metrics everything looks good, lemme know if you see any issue [17:15:06] (the traffic is way more spread out now, tomorrow I should be able to finish) [17:38:01] _joe_: I was going to get rid of /static/current in favour of /w/••/?static or something like that, but we can do the rewrite approach first to emulate what we have [17:38:26] it would require maybe a handful of changes and a month of rollover [17:38:48] <_joe_> Krinkle: have you seen my patch to /w/static.php? [17:38:50] yes [17:39:32] <_joe_> I think it does the trick; however reading through /w/static.php has made me think of what we do for expired assets [17:39:52] <_joe_> we keep static files around for some time for old cached content [17:39:59] I mean to make /w/foo/bar.png?static the actual URL as used in our configuration, and treat that with long caching (1year) unconditionally unlike for the ?hash URLs. [17:40:09] so that we can remove /static/current entirely [17:40:19] <_joe_> nod [17:41:18] we have 1) /w/foo with short caching and expand based on current wiki hostname and mwversion, 2) /w/foo?hash which picks the right version of the file from available mw versions and gives it long-term caching and changes immediately when a new hash is given, and hash is validated such that if it isn't found yet during deployment races it'll cache for very short time only to avoid stale/poison, and 3) "static/current"-like, which means I [17:41:19] don't care which version and still cache for long. [17:41:54] this is that third category and it would be nice to have that use the same /w/ path for easier debugging and so that it works by default on fresh MW installs (without the cache optimisation) whereas now the path itself is a wmf-specific hack and thus difficult to configure at times [17:42:19] we probably also dont need img vs non-img to differ. I'm curious where that came from [17:42:23] I'll look into that [17:43:00] <_joe_> it just came from a decade-old mod_expire configuration [17:47:07] <_joe_> which, if I had to bet, came off of copy-pasting from comp.* [17:47:18] <_joe_> (the stackoverflow of the old times [17:47:39] <_joe_> it used to be a pretty common configuration [17:47:46] _joe_: where is the month config? I see the year config in expires.conf [17:48:00] <_joe_> it's the default for mod_expire [17:48:16] <_joe_> so for file types that aren't handled the 1y expiry, 1 month applies [17:48:31] [17:48:44] right, but apart from perhaps /COPYING, I can't thnk of anything that doesn't match this [17:48:56] certainly all things that we'd reference with /static/current match this [17:49:04] incl images and css/js [17:49:13] <_joe_> all images? no webm? [17:49:21] we don't have webm files on disk [17:49:29] we don't serve css/js this way either [17:49:40] <_joe_> yeah I noticed looking at apache logs [17:49:42] only images and fonts currrently [17:50:06] I thnk we can ignore the fact that there's a default fallback and go with 1Y for this endpoint [17:50:13] <_joe_> sure [17:50:24] <_joe_> I was for now reproducing the state of the art :) [17:50:32] <_joe_> I think we can drop the Etag too, tbh [17:50:35] what's the timeline. shouhdl I fork yours and implement ?static = 1Y and update urls, or do you want ot roll out the rewrite first? [17:50:41] yes [17:50:51] matching what we do for /w/ will help [17:51:24] and the client-side maxage means it wont' cause much influx traffic other than internal from varnish [17:51:28] <_joe_> I think we should roll out the rewrite first [17:51:44] <_joe_> because it works immediately, and we can close the bug [17:52:00] <_joe_> also we ensure we don't break old URLs in the meantime [17:52:17] <_joe_> sorry, dinner time [17:52:54] ok. I'll update the patch to use 1Y always, drop Etag, add also forward-support for ?static, then we can both handle our side (you rewrite, me update URLs) [17:53:12] Itll then also move from a wmf-specific convention to a core recommendation and Ill update docs for that [19:56:35] Hi. I was wondering if it's known to SREs that gerrit.wikimedia.org is advertising the invalid/expired ISRG X1 intermediate in its SSL configuration? I've been tracing an issue for a couple hours and realized it's server-side, and confirmed such at https://www.ssllabs.com/analyze.html?d=gerrit.wikimedia.org section Certification Paths #2, item #3 [19:56:36] "sent by server". [20:24:40] atol: most browsers should work round it [20:25:08] Are you having trouble with access [20:59:39] While triaging some ancient code, I found a travis-ci build that's doing checkouts from Gerrit. (I don't know why, and I'll probably stop those.) I do know that browsers are patching/patched now to specifically block the expired root so that the invalid intermediate from Gerrit doesn't cause accessibility issues, but that won't make any difference [20:59:39] to Travis-ci's default Ubuntu builder (which is just trying to do a 'git checkout'), so I figured I owed the courtesy of pointing out the flaw. [21:00:22] (I'll also report it to Travis-ci, but I ~90% expect them to assign fault to the misconfigured intermediate.) [21:03:09] It sounds like I've delivered the concern report successfully, so beyond that I don't seek technical support or anything, just wanted to FYI. [21:14:06] atol: It's fairly late in the day both Europe and US west coast. I imagine they'll read this tomorrow, otherwise, perhaps file it to Phabricator to be sure. [21:16:30] Noted! I'll try to leave the tab open in case anyone has questions at some point, but if I drop at some point, I trust their judgment on whether to repair. Appreciate the guidance. [21:27:20] Which Ubuntu version would be useful [21:29:10] I tested Ubuntu 16 LTS and 18 LTS so far, and will test 20 next. Travis offers the full range of 12/14/16/18/20 and defaults to 16 unless you specifically pin one. [21:30:10] Ah ok [21:31:59] After 16.04 was supposed to be on the ok list [21:32:40] Anyway sleep time for me [21:33:03] https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry is the big help guide [21:33:09] Scott's blogs are interesting [21:36:47] It's okay _unless_ you include the third intermediate shown in the ssllabs link above, which directs clients to ignore the trusted root and instead follow the trust chain to the expired root. [21:39:11] This seems to happen mostly either when Let's Encrypt is deployed in a scenario where either the server is configured to use a static copy of intermediate.pem rather than the one that LE writes out each time it renews the cert, _or_ is one of the various webservers that has to be restarted/signaled/whatever'd to pick up changes to the intermediate [21:39:11] chain written on-disk by LE. [21:39:34] (I've only debugged a few, though, so I can't be certain those two scenarios cover it.)