[07:40:04] <elukey>	 good morning folks
[07:40:31] <elukey>	 as announced last week, I am going to start rebalancing some kafka main codfw topics today for https://phabricator.wikimedia.org/T288825
[07:42:45] <_joe_>	 elukey: lmk if you need my help
[07:44:51] <elukey>	 _joe_ ack thanks! There may be some alerts related to kafka partitions unreplicated blabla, if any I'll try to ack/silence
[07:45:20] <_joe_>	 I'm slightly worred about the jobqueue
[07:45:31] <_joe_>	 but today you plan to rebalance codfw right?
[07:46:04] <_joe_>	 if so, we might want to be sure to direct traffic to the codfw eventgate-main for the duration
[07:48:51] <elukey>	 yes yes only codfw, but it may take a couple of days, I'll do it slowly
[07:49:19] <elukey>	 in theory there shouldn't be any impact for live traffic, Kafka should increase the partition count and shrink it when completed
[07:49:37] <volans>	 Good morning, as announced last week, I'm going to switch to the new reimage process ( https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage ). Please hold on any reimage for a bit, I'll update here on completion.
[07:49:48] <elukey>	 there are only some topics that are relatively big
[07:49:52] <elukey>	 (like resource-purge)
[08:02:42] <_joe_>	 elukey: that's also quite mission-critical in all DCs
[08:04:13] <elukey>	 _joe_ yes yes I'll keep the most trafficated topics for last
[08:05:58] <_joe_>	 I would have suggested doing it while the DCs that connect to the codfw kafkas for purged consumptions are least trafficated
[08:06:10] <_joe_>	 but sure, start with some small ones
[09:06:08] <volans>	 The swith to the new reimage process for physical hosts has been completed. You can resume reimages following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage
[09:57:30] <elukey>	 _joe_ I moved a lot of low-traffic topics and something is moving in Kafka metrics (related to traffic/data rebalancing), but to see the real effect I'll need to move the bigger ones 
[09:57:46] <elukey>	 if you and others are ok I can start this afternoon
[09:58:56] <_joe_>	 +1
[10:00:02] <elukey>	 I have added all the info of what I have done to https://phabricator.wikimedia.org/T288825, and https://gitlab.wikimedia.org/Elukey/kafka_main_rebalance/-/tree/main/main-codfw/topicmappr_json is up to date too
[10:00:45] <elukey>	 (better https://gitlab.wikimedia.org/elukey/kafka_main_rebalance/-/tree/main/main-codfw)
[13:01:22] <elukey>	 Ready to restart the kafka main codfw topics move :)
[13:01:59] <kormat>	 elukey: i wish you exactly the amount of luck that you deserve. :)
[13:02:26] <elukey>	 kormat: I know that I can count on you :)
[13:02:37] <kormat>	 :D
[14:17:22] <elukey>	 the last/biggest kafka partitions take quite some time to replicate to the new nodes, but so far I don't see anything exploding (ping me in case). It will probably take 2/3 days to finish the work, and then there will be main-eqiad to do :D
[14:18:31] <elukey>	 (I am throttling the replication to max 50MB/s, but in theory we can raise the threshold a bit more)
[14:19:20] <_joe_>	 let's make a call tomorrow morning
[14:21:38] * elukey is scared
[14:21:43] <elukey>	 :D
[14:31:28] <_joe_>	 rightfully so
[14:32:44] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?viewPanel=6&orgId=1&from=now-3h&to=now&var-datasource=codfw%20prometheus%2Fops&var-kafka_cluster=main-codfw&var-cluster=kafka_main&var-kafka_broker=All&var-disk_device=All
[14:32:49] <elukey>	 this one is a good sign :)
[15:40:54] <godog>	 https://www.reddit.com/r/RedditEng/comments/q5vmf2/reddits_move_to_grpc/
[15:44:01] <jynus>	 I will leave dewiki eqiad media backups during the night and tomorrow with very very low concurrency (3 MB/s)- as I won't be back until wednesday
[15:45:23] <_joe_>	 godog: I recently defined grpc "the SOAP of 2020s", which is partially unfair but also not completely unfair
[15:50:23] <godog>	 hehe
[17:14:10] <elukey>	 I am stopping the kafka main codfw's topic moves, summary of what's left in https://phabricator.wikimedia.org/T288825#7417879
[17:14:50] <elukey>	 from the cluster metrics everything looks good, lemme know if you see any issue
[17:15:06] <elukey>	 (the traffic is way more spread out now, tomorrow I should be able to finish)
[17:38:01] <Krinkle>	 _joe_: I was going to get rid of /static/current in favour of /w/••/?static or something like that, but we can do the rewrite approach first to emulate what we have
[17:38:26] <Krinkle>	 it would require maybe a handful of changes and a month of rollover
[17:38:48] <_joe_>	 Krinkle: have you seen my patch to /w/static.php?
[17:38:50] <Krinkle>	 yes
[17:39:32] <_joe_>	 I think it does the trick; however reading through /w/static.php has made me think of what we do for expired assets
[17:39:52] <_joe_>	 we keep static files around for some time for old cached content
[17:39:59] <Krinkle>	 I mean to make /w/foo/bar.png?static the actual URL as used in our configuration, and treat that with long caching (1year) unconditionally unlike for the ?hash URLs.
[17:40:09] <Krinkle>	 so that we can remove /static/current entirely
[17:40:19] <_joe_>	 nod
[17:41:18] <Krinkle>	 we have 1) /w/foo with short caching and expand based on current wiki hostname and mwversion, 2) /w/foo?hash which picks the right version of the file from available mw versions and gives it long-term caching and changes immediately when a new hash is given, and hash is validated such that if it isn't found yet during deployment races it'll cache for very short time only to avoid stale/poison, and 3) "static/current"-like, which means I 
[17:41:19] <Krinkle>	 don't care which version and still cache for long.
[17:41:54] <Krinkle>	 this is that third category and it would be nice to have that use the same /w/ path for easier debugging and so that it works by default on fresh MW installs (without the cache optimisation) whereas now the path itself is a wmf-specific hack and thus difficult to configure at times
[17:42:19] <Krinkle>	 we probably also dont need img vs non-img to differ. I'm curious where that came from
[17:42:23] <Krinkle>	 I'll look into that
[17:43:00] <_joe_>	 it just came from a decade-old mod_expire configuration 
[17:47:07] <_joe_>	 which, if I had to bet, came off of copy-pasting from comp.* 
[17:47:18] <_joe_>	 (the stackoverflow of the old times
[17:47:39] <_joe_>	 it used to be a pretty common configuration
[17:47:46] <Krinkle>	 _joe_: where is the month config? I see the year config in expires.conf
[17:48:00] <_joe_>	 it's the default for mod_expire
[17:48:16] <_joe_>	 so for file types that aren't handled the 1y expiry, 1 month applies
[17:48:31] <Krinkle>	         <FilesMatch "\.(gif|jpe?g|png|css|js|json|woff|woff2|svg|eot|ttf|ico)$">
[17:48:44] <Krinkle>	 right, but apart from perhaps /COPYING, I can't thnk of anything that doesn't match this
[17:48:56] <Krinkle>	 certainly all things that we'd reference with /static/current match this
[17:49:04] <Krinkle>	 incl images and css/js
[17:49:13] <_joe_>	 all images? no webm?
[17:49:21] <Krinkle>	 we don't have webm files on disk
[17:49:29] <Krinkle>	 we don't serve css/js this way either
[17:49:40] <_joe_>	 yeah I noticed looking at apache logs
[17:49:42] <Krinkle>	 only images and fonts currrently
[17:50:06] <Krinkle>	 I thnk we can ignore the fact that there's a default fallback and go with 1Y for this endpoint
[17:50:13] <_joe_>	 sure
[17:50:24] <_joe_>	 I was for now reproducing the state of the art :)
[17:50:32] <_joe_>	 I think we can drop the Etag too, tbh
[17:50:35] <Krinkle>	 what's the timeline. shouhdl I fork yours and implement ?static = 1Y and update urls, or do you want ot roll out the rewrite first?
[17:50:41] <Krinkle>	 yes
[17:50:51] <Krinkle>	 matching what we do for /w/ will help
[17:51:24] <Krinkle>	 and the client-side maxage means it wont' cause much influx traffic other than internal from varnish
[17:51:28] <_joe_>	 I think we should roll out the rewrite first
[17:51:44] <_joe_>	 because it works immediately, and we can close the bug
[17:52:00] <_joe_>	 also we ensure we don't break old URLs in the meantime
[17:52:17] <_joe_>	 sorry, dinner time
[17:52:54] <Krinkle>	 ok. I'll update the patch to use 1Y always, drop Etag, add also forward-support for ?static,  then we can both handle our side (you rewrite, me update URLs)
[17:53:12] <Krinkle>	 Itll then also move from a wmf-specific convention to a core recommendation and Ill update docs for that
[19:56:35] <atol>	 Hi. I was wondering if it's known to SREs that gerrit.wikimedia.org is advertising the invalid/expired ISRG X1 intermediate in its SSL configuration? I've been tracing an issue for a couple hours and realized it's server-side, and confirmed such at https://www.ssllabs.com/analyze.html?d=gerrit.wikimedia.org section Certification Paths #2, item #3
[19:56:36] <atol>	 "sent by server".
[20:24:40] <RhinosF1>	 atol: most browsers should work round it
[20:25:08] <RhinosF1>	 Are you having trouble with access
[20:59:39] <atol>	 While triaging some ancient code, I found a travis-ci build that's doing checkouts from Gerrit. (I don't know why, and I'll probably stop those.) I do know that browsers are patching/patched now to specifically block the expired root so that the invalid intermediate from Gerrit doesn't cause accessibility issues, but that won't make any difference
[20:59:39] <atol>	 to Travis-ci's default Ubuntu builder (which is just trying to do a 'git checkout'), so I figured I owed the courtesy of pointing out the flaw.
[21:00:22] <atol>	 (I'll also report it to Travis-ci, but I ~90% expect them to assign fault to the misconfigured intermediate.)
[21:03:09] <atol>	 It sounds like I've delivered the concern report successfully, so beyond that I don't seek technical support or anything, just wanted to FYI.
[21:14:06] <Krinkle>	 atol: It's fairly late in the day both Europe and US west coast. I imagine they'll read this tomorrow, otherwise, perhaps file it to Phabricator to be sure.
[21:16:30] <atol>	 Noted! I'll try to leave the tab open in case anyone has questions at some point, but if I drop at some point, I trust their judgment on whether to repair. Appreciate the guidance.
[21:27:20] <RhinosF1>	 Which Ubuntu version would be useful
[21:29:10] <atol>	 I tested Ubuntu 16 LTS and 18 LTS so far, and will test 20 next. Travis offers the full range of 12/14/16/18/20 and defaults to 16 unless you specifically pin one.
[21:30:10] <RhinosF1>	 Ah ok
[21:31:59] <RhinosF1>	 After 16.04 was supposed to be on the ok list
[21:32:40] <RhinosF1>	 Anyway sleep time for me
[21:33:03] <RhinosF1>	 https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry is the big help guide
[21:33:09] <RhinosF1>	 Scott's blogs are interesting
[21:36:47] <atol>	 It's okay _unless_ you include the third intermediate shown in the ssllabs link above, which directs clients to ignore the trusted root and instead follow the trust chain to the expired root.
[21:39:11] <atol>	 This seems to happen mostly either when Let's Encrypt is deployed in a scenario where either the server is configured to use a static copy of intermediate.pem rather than the one that LE writes out each time it renews the cert, _or_ is one of the various webservers that has to be restarted/signaled/whatever'd to pick up changes to the intermediate
[21:39:11] <atol>	 chain written on-disk by LE.
[21:39:34] <atol>	 (I've only debugged a few, though, so I can't be certain those two scenarios cover it.)