[08:33:14] <fabfur>	 vgutierrez: I thought that would be better to chat about the cookbook with volans too
[08:33:26] <vgutierrez>	 makes sense :)
[08:33:33] * volans here
[08:34:16] <fabfur>	 so the first step (change varnish conf to not listen on port 80) could be done relatively safely (providing that no-one restarts varnish in the meantime)
[08:34:44] <vgutierrez>	 no one should restart a varnish instance during normal circumstances
[08:34:58] <vgutierrez>	 considering that before that you're gonna depool the host, no reason at all :)
[08:35:41] <fabfur>	 while the second step (reconfigure / restart haproxy) should be done host by host, and in this case or we disable puppet on all impacted hosts or find another way
[08:35:56] <vgutierrez>	 (no need to restart haproxy)
[08:36:07] <fabfur>	 *reload
[08:36:27] <vgutierrez>	 everything should be done host by host
[08:36:39] <vgutierrez>	 we cannot stop listening on port 80 globally
[08:36:42] <volans>	 vgutierrez: what's the shortest sleep you'll consider ok to sleep between hosts?
[08:36:54] <vgutierrez>	 between hosts? 0s
[08:37:18] <vgutierrez>	 just be safe and wait ~30s between hosts
[08:37:38] <vgutierrez>	 but I don't see any specific reason to introduce a longer wait than that
[08:38:30] <volans>	 because one option I was thinking is: disable puppet on all remaining cp hosts, merge the patch to change both settings, then have a cookbook that does: depool, stop varnish, enable+run puppet, repool one by one
[08:38:56] <volans>	 (add a start varnish if puppet doesn't do it ofc)
[08:39:12] <vgutierrez>	 I'd restart varnish rather than stopping it
[08:39:13] <volans>	 or stop+start is more impactful than a restart?
[08:39:30] <vgutierrez>	 any reason to stop it instead of restarting it?
[08:39:43] <vgutierrez>	 oh right
[08:39:46] <volans>	 yes to stop before puppet so port 80 is free
[08:39:51] <vgutierrez>	 hmmm
[08:40:10] <vgutierrez>	 not sure if puppet is gonna start varnish
[08:40:23] <vgutierrez>	 but yeah, sounds good and we can skip the netbox->hiera stuff
[08:40:44] <volans>	 because if you do it without sleep I guess in ~1 day it's all done
[08:40:53] <volans>	 and puppet can be disabled for ~1d for this one time migration
[08:40:55] <volans>	 IMHO
[08:40:59] <fabfur>	 we can do this on per-dc basis?
[08:41:05] <vgutierrez>	 yes
[08:41:05] <volans>	 that too
[08:41:18] <volans>	 hiera can be per dc and then centralized after with a noop change
[08:41:24] <vgutierrez>	 disabling puppet CDN wide for 1 day isn't ideal
[08:41:43] <volans>	 can be also done per cluster per dc
[08:41:43] <vgutierrez>	 acme-chief relies on puppet to refresh OCSP stapled responses
[08:41:53] <volans>	 so 8 hostst at a time basically
[08:41:54] <vgutierrez>	 yep... per DC should be enough
[08:42:25] <fabfur>	 ok, so to recap (then I'll update T323557)
[08:42:26] <stashbot>	 T323557: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557
[08:42:41] <fabfur>	 - commit both changes on hiera on per-dc basis
[08:43:04] <fabfur>	 - disable puppet on dc 
[08:43:22] <volans>	 (invert the two above)
[08:43:28] <fabfur>	 ok
[08:43:34] <volans>	 *merge after disable
[08:44:25] <fabfur>	 - depool
[08:44:25] <fabfur>	 - run puppet to change varnish and haproxy configuration
[08:44:25] <fabfur>	 - restart varnish
[08:44:25] <fabfur>	 - reload haproxy
[08:44:25] <fabfur>	 - test test test
[08:44:25] <fabfur>	 - repool
[08:44:41] <fabfur>	 (reenable puppet)
[08:44:52] <vgutierrez>	 fabfur: nope, that won't work
[08:45:03] <vgutierrez>	 depool + stop varnish + run-puppet
[08:45:17] <vgutierrez>	 otherwise if you run puppet without stopping varnish HAProxy will fail to reload
[08:45:21] <volans>	 (the enable is part of the run, you can't run it if disabled)
[08:45:35] <vgutierrez>	 puppet will automatically reload HAproxy 
[08:46:10] <fabfur>	 ok
[08:46:55] <fabfur>	 any other consideration before I update the ticket and going on writing the cookbook?
[08:47:16] <vgutierrez>	 go for it :)
[08:48:01] <volans>	 SGTM!
[08:48:06] <vgutierrez>	 oh, as you can imagine from my previous comment there is no need to explicitly reload haproxy after starting varnish
[08:48:24] <fabfur>	 ok
[08:55:19] <wikibugs>	 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10Fabfur) The action plan is slightly changed thanks to the contribution of @Vgutierrez and @Volans .  Now the checklist is more like (**on a per-DC basis**):  1. Disable puppet on all impacted hosts 2. Merge the changes on hiera...
[08:55:48] <fabfur>	 so, let's check how many errors in a single comment :)
[08:57:19] <vgutierrez>	 you start talking about disabling puppet on all affected hosts and then on step 3 you talk about a single host
[08:57:45] <fabfur>	 yeah, maybe not too clear
[08:57:50] <vgutierrez>	 [E_ANAL] you should mention that you're gonna loop through all the hosts on that DC/cluster
[08:58:27] <fabfur>	 done
[08:58:46] <vgutierrez>	 cool
[08:59:11] <vgutierrez>	 start writing code then :)
[08:59:38] <fabfur>	 damn...
[09:00:59] <fabfur>	 :)
[09:09:31] <volans>	 vgutierrez: just to be on safe side, do you see any performance drawback between a varnish restart and a stop+start? Is the behaviour any different?
[09:09:49] <vgutierrez>	 nope AFAIK
[09:10:14] <volans>	 it does loose anyway the cached content right?
[09:10:19] <vgutierrez>	 indeed
[09:10:35] <volans>	 hence my previous question about sleep in between :D
[09:10:46] <vgutierrez>	 duh
[09:10:51] <vgutierrez>	 I obviously need more coffee
[09:10:58] <volans>	 to allow one to recovery before doign the next one
[09:10:59] <vgutierrez>	 rather than 30s.. 20 minutes :)
[09:11:18] <volans>	 ok, now makes sense, I was a bit surprised earlier :)
[09:34:23] <godog>	 hi folks, I'm going through the prometheus "global" dashboards as part of T288196 and wanted to know if https://grafana.wikimedia.org/d/000000257/tcp-fast-open?orgId=1 is still used or we can ditch it? some panels have ... wait for it ... graphite as data source
[09:34:23] <stashbot>	 T288196: Retire Prometheus 'global' instance - https://phabricator.wikimedia.org/T288196
[09:38:23] <vgutierrez>	 godog: not used at all AFAIK
[09:39:20] <godog>	 ack, thank you vgutierrez ! I'll save a copy locally just in case and nuke it
[09:44:57] <godog>	 in the same spirit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/922477
[10:37:05] <wikibugs>	 10Traffic, 10serviceops: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10jbond)
[11:44:16] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney)
[11:44:24] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Create mechanism to disable IPv6 RA generation on irb interfaces when required - https://phabricator.wikimedia.org/T337057 (10cmooney) 05Open→03Resolved Merged patch based on option 5, but using hostname rather than any other var to determine device class....
[12:45:55] <wikibugs>	 10Traffic, 10Commons, 10Wikimedia-Site-requests, 10serviceops, 10Patch-For-Review: Enforce upload rate limits for bots on commons - https://phabricator.wikimedia.org/T248177 (10jbond)
[14:56:37] <wikibugs>	 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis)
[14:58:41] <wikibugs>	 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis)
[15:03:09] <wikibugs>	 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Performance-Team (Radar): GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10CDanis)
[16:41:38] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations: add contract end dates to the ops maint & contract gcal - https://phabricator.wikimedia.org/T84585 (10jbond)
[19:09:51] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney)
[19:10:12] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Open→03Resolved This is now modeled in Netbox in the 'upstream_speed' field of the z-end of a circuit termination.  The one service we have where it...
[19:12:53] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) Completed today 1 E1 lvs1018 lsw1-e1-eqiad xe-0/0/47 ssw1-e1-eqiad xe-0/0/33
[19:19:27] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: Expose sub-rated circuit speeds to Homer templates - https://phabricator.wikimedia.org/T328313 (10cmooney) 05Resolved→03Open
[19:19:47] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Create Quality of Service design for WMF internal networks - https://phabricator.wikimedia.org/T316358 (10cmooney)
[19:39:18] <wikibugs>	 10Traffic, 10conftool, 10Sustainability (Incident Followup): requestctl should handle (or reject) quotation marks in the resp_status - https://phabricator.wikimedia.org/T337336 (10RLazarus)
[19:40:10] <rzl>	 sukhe: ^ that's the bug from earlier in case you'd like to follow along, but no obligation - I forgot to say, but thanks for being nearby :)
[19:47:20] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: TLS certificates for network devices - https://phabricator.wikimedia.org/T334594 (10ayounsi) We had a chat about this.  The first iteration will be a manual cookbook that takes a host as parameter. The cookbook will connect to the device and see if there is alre...
[19:49:05] <sukhe>	 rzl: thanks!
[19:56:42] <wikibugs>	 10Traffic, 10conftool, 10Sustainability (Incident Followup): requestctl sync should print an error on invalid YAML syntax - https://phabricator.wikimedia.org/T337339 (10RLazarus)
[20:01:41] <wikibugs>	 10Traffic, 10conftool, 10Sustainability: requestctl log should print a better error message on incorrect action name - https://phabricator.wikimedia.org/T337341 (10RLazarus)
[20:10:25] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) Thanks @Jclark-ctr I think we're good to do the other two lvs moves whenever you are ready.  Please ping me on irc and we can arran...
[20:36:09] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE-tools: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) p:05Triage→03High
[21:22:16] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-tools: DHCP traffic to install server is missing - https://phabricator.wikimedia.org/T337345 (10Volans) With @ayounsi we've checked a bunch of things and so far we didn't find anything wrong. The traffic seems to exit from `mr1` but dosn't make it to the...