[08:33:14] vgutierrez: I thought that would be better to chat about the cookbook with volans too [08:33:26] makes sense :) [08:33:33] * volans here [08:34:16] so the first step (change varnish conf to not listen on port 80) could be done relatively safely (providing that no-one restarts varnish in the meantime) [08:34:44] no one should restart a varnish instance during normal circumstances [08:34:58] considering that before that you're gonna depool the host, no reason at all :) [08:35:41] while the second step (reconfigure / restart haproxy) should be done host by host, and in this case or we disable puppet on all impacted hosts or find another way [08:35:56] (no need to restart haproxy) [08:36:07] *reload [08:36:27] everything should be done host by host [08:36:39] we cannot stop listening on port 80 globally [08:36:42] vgutierrez: what's the shortest sleep you'll consider ok to sleep between hosts? [08:36:54] between hosts? 0s [08:37:18] just be safe and wait ~30s between hosts [08:37:38] but I don't see any specific reason to introduce a longer wait than that [08:38:30] because one option I was thinking is: disable puppet on all remaining cp hosts, merge the patch to change both settings, then have a cookbook that does: depool, stop varnish, enable+run puppet, repool one by one [08:38:56] (add a start varnish if puppet doesn't do it ofc) [08:39:12] I'd restart varnish rather than stopping it [08:39:13] or stop+start is more impactful than a restart? [08:39:30] any reason to stop it instead of restarting it? [08:39:43] oh right [08:39:46] yes to stop before puppet so port 80 is free [08:39:51] hmmm [08:40:10] not sure if puppet is gonna start varnish [08:40:23] but yeah, sounds good and we can skip the netbox->hiera stuff [08:40:44] because if you do it without sleep I guess in ~1 day it's all done [08:40:53] and puppet can be disabled for ~1d for this one time migration [08:40:55] IMHO [08:40:59] we can do this on per-dc basis? [08:41:05] yes [08:41:05] that too [08:41:18] hiera can be per dc and then centralized after with a noop change [08:41:24] disabling puppet CDN wide for 1 day isn't ideal [08:41:43] can be also done per cluster per dc [08:41:43] acme-chief relies on puppet to refresh OCSP stapled responses [08:41:53] so 8 hostst at a time basically [08:41:54] yep... per DC should be enough [08:42:25] ok, so to recap (then I'll update T323557) [08:42:26] T323557: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 [08:42:41] - commit both changes on hiera on per-dc basis [08:43:04] - disable puppet on dc [08:43:22] (invert the two above) [08:43:28] ok [08:43:34] *merge after disable [08:44:25] - depool [08:44:25] - run puppet to change varnish and haproxy configuration [08:44:25] - restart varnish [08:44:25] - reload haproxy [08:44:25] - test test test [08:44:25] - repool [08:44:41] (reenable puppet) [08:44:52] fabfur: nope, that won't work [08:45:03] depool + stop varnish + run-puppet [08:45:17] otherwise if you run puppet without stopping varnish HAProxy will fail to reload [08:45:21] (the enable is part of the run, you can't run it if disabled) [08:45:35] puppet will automatically reload HAproxy [08:46:10] ok [08:46:55] any other consideration before I update the ticket and going on writing the cookbook? [08:47:16] go for it :) [08:48:01] SGTM! [08:48:06] oh, as you can imagine from my previous comment there is no need to explicitly reload haproxy after starting varnish [08:48:24] ok [08:55:19] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) The action plan is slightly changed thanks to the contribution of @Vgutierrez and @Volans . Now the checklist is more like (**on a per-DC basis**): 1. Disable puppet on all impacted hosts 2. Merge the changes on hiera... [08:55:48] so, let's check how many errors in a single comment :) [08:57:19] you start talking about disabling puppet on all affected hosts and then on step 3 you talk about a single host [08:57:45] yeah, maybe not too clear [08:57:50] [E_ANAL] you should mention that you're gonna loop through all the hosts on that DC/cluster [08:58:27] done [08:58:46] cool [08:59:11] start writing code then :) [08:59:38] damn... [09:00:59] :) [09:09:31] vgutierrez: just to be on safe side, do you see any performance drawback between a varnish restart and a stop+start? Is the behaviour any different? [09:09:49] nope AFAIK [09:10:14] it does loose anyway the cached content right? [09:10:19] indeed [09:10:35] hence my previous question about sleep in between :D [09:10:46] duh [09:10:51] I obviously need more coffee [09:10:58] to allow one to recovery before doign the next one [09:10:59] rather than 30s.. 20 minutes :) [09:11:18] ok, now makes sense, I was a bit surprised earlier :) [09:34:23] hi folks, I'm going through the prometheus "global" dashboards as part of T288196 and wanted to know if https://grafana.wikimedia.org/d/000000257/tcp-fast-open?orgId=1 is still used or we can ditch it? some panels have ... wait for it ... graphite as data source [09:34:23] T288196: Retire Prometheus 'global' instance - https://phabricator.wikimedia.org/T288196 [09:38:23] godog: not used at all AFAIK [09:39:20] ack, thank you vgutierrez ! I'll save a copy locally just in case and nuke it [09:44:57] in the same spirit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/922477 [10:37:05] 10Traffic, 10serviceops: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10jbond) [11:44:16] 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) [11:44:24] 10netops, 10Infrastructure-Foundations, 10SRE: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) 05Open→03Resolved Merged patch based on option 5, but using hostname rather than any other var to determine device class.... [12:45:55] 10Traffic, 10Commons, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177 (10jbond) [14:56:37] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [14:58:41] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [15:03:09] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis) [16:41:38] 10netops, 10DC-Ops, 10Infrastructure-Foundations: add contract end dates to the ops maint & contract gcal - https://phabricator.wikimedia.org/T84585 (10jbond) [19:09:51] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [19:10:12] 10netops, 10Infrastructure-Foundations, 10SRE: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Open→03Resolved This is now modeled in Netbox in the 'upstream_speed' field of the z-end of a circuit termination. The one service we have where it... [19:12:53] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) Completed today 1 E1 lvs1018 lsw1-e1-eqiad xe-0/0/47 ssw1-e1-eqiad xe-0/0/33 [19:19:27] 10netops, 10Infrastructure-Foundations, 10SRE: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Resolved→03Open [19:19:47] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney) [19:39:18] 10Traffic, 10conftool, 10Sustainability (Incident Followup): requestctl should handle (or reject) quotation marks in the resp_status - https://phabricator.wikimedia.org/T337336 (10RLazarus) [19:40:10] sukhe: ^ that's the bug from earlier in case you'd like to follow along, but no obligation - I forgot to say, but thanks for being nearby :) [19:47:20] 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) We had a chat about this. The first iteration will be a manual cookbook that takes a host as parameter. The cookbook will connect to the device and see if there is alre... [19:49:05] rzl: thanks! [19:56:42] 10Traffic, 10conftool, 10Sustainability (Incident Followup): requestctl sync should print an error on invalid YAML syntax - https://phabricator.wikimedia.org/T337339 (10RLazarus) [20:01:41] 10Traffic, 10conftool, 10Sustainability: requestctl log should print a better error message on incorrect action name - https://phabricator.wikimedia.org/T337341 (10RLazarus) [20:10:25] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) Thanks @Jclark-ctr I think we're good to do the other two lvs moves whenever you are ready. Please ping me on irc and we can arran... [20:36:09] 10netops, 10Infrastructure-Foundations, 10SRE-tools: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) p:05Triage→03High [21:22:16] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) With @ayounsi we've checked a bunch of things and so far we didn't find anything wrong. The traffic seems to exit from `mr1` but dosn't make it to the...