[09:45:29] 10Traffic: Provide a TCP MSS clamping mechanism for real servers - https://phabricator.wikimedia.org/T350462 (10CodeReviewBot) vgutierrez merged https://gitlab.wikimedia.org/vgutierrez/tcp-mss-clamper/-/merge_requests/1 Provide basic functionality [09:45:59] fabfur: ^^ at least CodeReviewBot updates the phab task [09:58:05] 10Traffic, 10SRE: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [09:58:51] 10Traffic, 10SRE: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) p:05Triage→03Medium [10:38:41] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) p:05Triage→03High [11:02:45] (VarnishHighThreadCount) firing: (8) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:03:42] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10hashar) [11:07:45] (VarnishHighThreadCount) firing: (11) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:12:45] (VarnishHighThreadCount) firing: (12) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:16:19] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10hashar) [11:17:45] (VarnishHighThreadCount) firing: (14) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:22:45] (VarnishHighThreadCount) firing: (14) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:27:46] (VarnishHighThreadCount) firing: (11) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:32:45] (VarnishHighThreadCount) firing: (10) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [11:37:45] (VarnishHighThreadCount) resolved: (8) Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:18:41] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) I decided we should move about 10% of the mobileapps traffic at a time; that means about 300 rps, which I think we should be able to serve moving over about 2-3 api serve... [12:32:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [12:42:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [12:43:27] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye [13:31:40] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10LSobanski) [13:32:31] 10Traffic, 10sre-alert-triage: Alert in need of triage: PuppetConstantChange (instance pybal-test2003:9100) - https://phabricator.wikimedia.org/T351084 (10LSobanski) [13:53:56] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye... [14:26:16] hello! I'd like to restore traffic flows to the editor-analytics services, developers have since patched the issues that caused us to roll back. Given that this was successfully in place before I don't think we need a slow rollout. https://gerrit.wikimedia.org/r/c/operations/puppet/+/973758 [14:26:54] hi hnowlan I can check it [14:27:54] thanks! [14:29:49] hnowlan: that was already tested, correct? [14:30:52] yes, the service was responding correctly for (most) requests after [14:31:00] ok [14:31:05] just some logic errors in the service made us roll back [14:31:07] than I think should be ok [14:31:30] great, thank you! [14:31:39] yw! [15:17:29] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:19:42] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [15:20:34] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [15:20:47] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ssingh) [15:55:50] 10Traffic, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10MoritzMuehlenhoff) [16:15:34] 10Traffic, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10SRE, and 2 others: find solution for acmechief in puppet7 - https://phabricator.wikimedia.org/T349915 (10jbond) 05In progress→03Resolved a:03jbond This is in place now use hiera key during migration [16:48:56] 10Traffic, 10Patch-For-Review: Alert in need of triage: PuppetConstantChange (instance pybal-test2003:9100) - https://phabricator.wikimedia.org/T351084 (10BCornwall) 05Open→03Resolved a:03BCornwall [17:17:42] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) >>! In T347623#9322202, @LSobanski wrote: > @BCornwall For operations/software/varnish, looks like it should just be archived and not migra... [17:42:58] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Krinkle) [18:17:07] 10Acme-chief, 10Traffic, 10SRE: acme-chief should support debian bookworm - https://phabricator.wikimedia.org/T344330 (10BCornwall) 05In progress→03Resolved [18:17:12] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [20:49:54] 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) [21:14:16] sukhe: Do you have any experience with rolling out the unified cp stuff? Trying to get traffic-cache-bookworm up and running in horizon [21:15:29] Currently diffing prod and cloud hieradata but it's quite tedious [21:23:29] brett: yeah we should have similar commits in the history [21:23:32] I remember doing it once [21:24:09] realized I misnamed the instance so I'm starting over again with better names :) [21:24:41] traffic-cache should work fine though? that's the prefix? [21:24:53] basically: These puppet settings will affect all VMs in the traffic project whose names begin with 'traffic-cache'. [21:24:58] yeah but we need a separate text and upload, yeah? [21:25:11] we usually have had just one and that's fine [21:25:16] oh [21:25:28] traffic-cache-bullseye is cache::text for example [21:25:30] How does that work though? hieradata keys like 'cache::cluster: upload' [21:25:58] it's all just merged/shoved together? [21:26:17] brett: just the cache::text role [21:26:26] ohhhh upload is ignored [21:26:28] so basically specify the role under "Puppet classes" [21:26:29] gotcha [21:26:44] okay, I'll kill the upload instance [21:26:48] the traffic-cache-bullseye config should work verbatim [21:27:11] 16:23:29 < sukhe> brett: yeah we should have similar commits in the history [21:27:18] I misread unifying above :) [21:27:26] but yeah just cache::text is fine for horizon [21:27:55] sorry, I had just assumed that a horizon instance hadn't been launched since the unified cp work! [21:29:22] the unified cp part at least in prod is that text and upload have the same hardware config underneath them now [21:29:38] so there is on difference in the per-cluster hieradata we applied [21:29:51] at least, for eqiad, since it's the first where text and upload both have two disks! [21:30:07] esams will follow next, so we will unify the configs there [21:30:21] ohhh my bad, I misunderstood [21:30:55] all good, my mind was thinking about unifying the configs, which is what we do once we finish reimages [21:31:15] dangit, so should I relaunch with the more generic traffic-cache-bookworm? :( [21:32:36] I'll do that. Well, good practice [21:33:08] yeah I think that's the desired name. not sure if we can rename [21:33:45] I don't think that'd be easy, too many things to change. I'll just relaunch :) [21:34:41] :P [21:50:24] nd [22:10:21] Okay, now how do I use the cache instance? :P [22:17:08] Hm, haproxy is refusing to start with '/etc/update-ocsp.d/*.conf': No such file or directory [22:17:36] Looks like the bullseye one fails to restart for the same reason [22:18:20] do we have a fake key in the private repo? [22:18:28] (well, the fakely-private repo) [22:19:06] hmmm, appears we do [22:19:32] in any case, though, I'm not sure how the OCSP stuff would've ever worked there. There must have been a way to disable/fake that also [22:20:07] modules/profile/manifests/cache/haproxy.pp: Boolean $do_ocsp = lookup('profile::cache::haproxy::do_ocsp'), [22:20:10] ? [22:20:19] hieradata/cloud/eqiad1/traffic/hosts/traffic-cache-atstext-buster.yaml:profile::cache::haproxy::do_ocsp: true [22:20:22] hieradata/common/profile/cache/haproxy.yaml:profile::cache::haproxy::do_ocsp: true [22:20:31] ^ seems like this flag might have existed just for this case? [22:21:17] I donno [22:21:40] but basically: you're dealing with fake TLS certs that don't even have the correct format [22:21:47] not sure how that's meant to work in a VM [22:23:30] * brett bangs his head on the table [22:24:13] ugh, sorry for not doing this on -private, I *just* realized [22:24:15] glad it wasn't -sre or something [22:24:30] I really need to alter my input bar on weechat to put the current chan