[00:47:03] we switched gerrrit-replica.wikimedia.org from gerrit2001 to gerrit2002, after gerrit1001 had succesfully replicated to both old and new. once replicas were the same as on master, we removed the old replica from gerrit config, firewall rules, known_hosts etc [00:48:00] everything is fine, we can git clone, gerrit had firewall change but is up and normal. now the old machine will sit there until Tuesday and all it needs is the decom cookbook and site.pp. all else is gone. this was old hardware out of warranty [00:49:19] I will be out tomorrow and on Monday and Tuesday is holiday for all, so until Wednesday [00:50:29] so gerrit2001 is not prod anymore and gerrit2002 now is [00:51:20] downtiming 2001 to avoid false alerts.. if anything with 2002.. you can contact me via number on office page or talk to d.ancy [09:08:51] ryankemper: many of the k8s nodes in codfw are cordoned as a result of the PDU maintenance afaik and so there'll be less resources available [09:09:33] the memory request is pretty high but this has been to compensate for the increased load that num_workers might bring, I'm going to try upping that again next week so I'll leave it in place for now but after that I might reduce it [09:29:42] hnowlan: actually, looking at the paste, this is exceeded namespace ressource quota [09:30:06] looks like we need to increase that to account for the increased memory requests [09:30:34] ahh, oops [09:38:03] <_joe_> hnowlan: ah sigh I wanted to mention to check the other day [09:38:13] <_joe_> I was too busy and i dropped the ball on that [09:38:39] <_joe_> let's fix that before we repool codfw [09:38:42] tbh we can drop the memory requirements already rather than increase quotas [09:38:49] <_joe_> ah hah [09:44:49] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/820663 if anyone has a sec [09:46:19] <_joe_> hnowlan: lgtm [09:46:27] <_joe_> but let me check one thing [09:49:58] <_joe_> I'm not wure how we hit the 100G limit if we were requesting just 75G with that pod [09:50:21] <_joe_> well it's just the changeprop container, let me actually look at the containers in production [09:55:01] is it a matter of the maxSurge as ryan mentioned? it won't quite hit 100 with the extra 25% but it's getting close [09:55:14] <_joe_> ah yes [09:55:16] <_joe_> btw [09:55:37] <_joe_> we do have 30 pods of changeprop-jobqueue running in codfw without issues rn [09:57:13] <_joe_> ah but they're running with the old settings it seems [09:57:28] <_joe_> so memory: 1500Mi [09:57:35] <_joe_> let's try with your change [11:37:03] 10serviceops, 10serviceops-collab, 10GitLab (Infrastructure), 10Patch-For-Review: Document and test failover for GitLab and GitLab Replica - https://phabricator.wikimedia.org/T296713 (10LSobanski) @Arnoldokoth For the sake of completeness, could you mention what the SSL issue and fix were? [14:04:31] 10serviceops, 10SRE, 10ops-codfw: decommission mw2251-mw2255, mw2257-mw2258 - https://phabricator.wikimedia.org/T313730 (10Papaul) [16:05:25] I've done a blubberfile for thumbor, would love a review if someone feels like it (on Monday is fine) https://gerrit.wikimedia.org/r/c/operations/software/thumbor-plugins/+/813613 [16:10:15] 10serviceops, 10Infrastructure-Foundations, 10netbox: Netbox and Redis - https://phabricator.wikimedia.org/T311385 (10ayounsi) >> Thanks for this task. So, from what I gathered, netbox uses Redis for caching and task queuing purposes, and support using different databases per function (queuing vs caching). I... [19:39:08] 10serviceops, 10Parsoid, 10Patch-For-Review, 10Performance-Team (Radar): Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 (10ssastry) Quick summary: After an initial hiccup during which roundtrip testing was broken for about a week, we got a test run in y'day. Based on rough estima... [22:17:40] 10serviceops, 10Parsoid, 10Patch-For-Review, 10Performance-Team (Radar): Parsoid migration to php 7.4 - https://phabricator.wikimedia.org/T312638 (10ssastry) The new test run completed and the perf is roughly similar. As for the logstash warning, looking at a 3-month window, I see identical warnings from... [22:30:46] 10serviceops, 10Observability-Logging: Increase of ~50 million access logs per day from mobileapps-production-tls-proxy - https://phabricator.wikimedia.org/T313099 (10colewhite) In T314381 we routed k8s logs into a new partition. Preliminary analysis indicates that k8s logs occupied ~95% of the syslog partiti... [22:32:02] 10serviceops, 10Observability-Logging: Increase of ~50 million access logs per day from mobileapps-production-tls-proxy - https://phabricator.wikimedia.org/T313099 (10colewhite) p:05Triage→03High