[03:00:45] (HAProxyRestarted) firing: HAProxy server restarted on cp1085:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1085&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [06:53:17] 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Joe) we also need to add wikifunctions to our internal certs [07:00:45] (HAProxyRestarted) firing: HAProxy server restarted on cp1085:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1085&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [07:13:07] 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Joe) [07:34:31] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [07:41:27] 10Traffic, 10SRE: Varnish SLI is impacted by external components performance|behavior - https://phabricator.wikimedia.org/T317051 (10Vgutierrez) 05In progress→03Resolved can be closed, cheers! [07:45:29] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [07:52:59] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 9 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10fgiunchedi) [08:28:58] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Gehel) [08:29:16] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Gehel) [08:40:45] (HAProxyRestarted) resolved: HAProxy server restarted on cp1085:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=eqiad%20prometheus/ops&var-instance=cp1085&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [08:42:09] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [08:55:32] 10Traffic, 10Infrastructure-Foundations: Set cookie in Varnish to start a probe - https://phabricator.wikimedia.org/T335637 (10ayounsi) Not answering the question directly (as I don't know varnish enough), but as first iteration we could sample equally all clients (but keeping the `PreventProbe` cookie). Then... [09:34:08] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10BTullis) [09:55:23] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10BTullis) [09:59:53] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade - T334049... [10:26:29] 10Traffic, 10Upstream: HAProxy 2.6.12 segfaults - https://phabricator.wikimedia.org/T334448 (10Vgutierrez) p:05High→03Medium Decreasing the priority as we are already testing a fixed version (fix proposed by upstream and that should be released as part of HAProxy 2.6.13 at some point) and we aren't seeing... [10:44:50] 10netops, 10Infrastructure-Foundations, 10SRE: Add network-layer protections to avoid inadvertently lowering IRB MTU - https://phabricator.wikimedia.org/T329799 (10cmooney) 05Open→03Resolved [10:53:06] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in codfw: codfw row C switches upgrade - T334049... [11:06:04] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) p:05Triage→03Medium [11:06:25] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) [11:10:34] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) I plan to do {T335759} then we can specify the FQDN to use for the bird config. Otherwise I think we would need to hardcod... [11:25:20] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) @aborrero hey. Yeah I can understand why having to hardcode the IPs in the puppet tree is not a great option. Unfortunate... [11:30:26] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) yeah I'm thinking about doing something like `resolve_ipv4(whateverserver.codfw.hw.wikimedia.cloud)`, so basically let pup... [11:52:05] 10netops, 10Cloud-Services, 10Infrastructure-Foundations, 10SRE: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10cmooney) @aborrero yep that should work. Potentially a race condition there if we drive the DNS from Netbox, which will only get... [12:11:25] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [12:20:42] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:26:02] XioNoX: codfw DNS depooled as well [12:26:03] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ssingh) [12:26:10] great! [12:27:09] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:42:14] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [12:46:04] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Eevans) [13:03:33] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 11 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=21224f03-d3c2-4431-accb-64fcadd01a0f) set by ayounsi@cumin1001 for 2:00:00 on 185 host(s) and their ser... [13:18:47] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp2039:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [13:21:48] (VarnishPrometheusExporterDown) resolved: (12) Varnish Exporter on instance cp2027:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [13:24:42] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10klausman) [13:25:32] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10MoritzMuehlenhoff) [13:25:42] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [13:37:23] 10HTTPS, 10Traffic, 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch): Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) a:03Clement_Goubert [13:47:59] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10Andrew) [13:54:21] 10Traffic, 10ops-codfw: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) [13:55:12] 10Traffic, 10ops-codfw: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10ssingh) p:05Triage→03Low [14:01:33] 10netops, 10Infrastructure-Foundations, 10SRE: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1 - https://phabricator.wikimedia.org/T254013 (10ayounsi) [14:02:23] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [14:02:39] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row D switches upgrade - https://phabricator.wikimedia.org/T335042 (10ayounsi) [14:02:58] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) [14:10:53] 10HTTPS, 10Traffic, 10SRE, 10serviceops, and 2 others: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org - https://phabricator.wikimedia.org/T313227 (10Clement_Goubert) 05Open→03In progress [14:57:16] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10Jelto) [15:00:10] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049 started. [15:05:43] 10netops, 10Infrastructure-Foundations, 10SRE: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi) [15:07:08] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ayounsi) 05Open→03Resolved a:03ayounsi Upgrade went fine! Thanks everybody. [15:17:12] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10ops-monitoring-bot) jiji@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in codfw: codfw row C switches upgrade - T334049 comple... [17:39:25] 10Traffic, 10SRE: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. - https://phabricator.wikimedia.org/T330670 (10ssingh) Summarizing for posterity the current state: On `cr*-eqiad`: ` /* ns0 */ route 208.80.154.238/32 { next-hop [ 208.... [17:40:08] 10Traffic, 10SRE: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]. - https://phabricator.wikimedia.org/T330670 (10ssingh) [17:44:20] 10Traffic, 10SRE: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]. - https://phabricator.wikimedia.org/T330670 (10ssingh) 05Open→03Resolved a:03ssingh As per the last comment, we have moved over authdns[12]001 to dns[12]00[123] and marking this as resolved. [17:45:07] 10Traffic, 10SRE: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]00[123]. - https://phabricator.wikimedia.org/T330670 (10ssingh) [18:29:36] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Consider confirming the hostname by user input when running the reimaging cookbook - https://phabricator.wikimedia.org/T332202 (10Ladsgroup) Yeah, emphasizing on what host the operator is about to reimage sounds better to me. Maybe we can... [18:49:46] 10Traffic, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10Papaul) [18:56:45] (HAProxyRestarted) firing: HAProxy server restarted on cp2031:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=codfw%20prometheus/ops&var-instance=cp2031&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [19:21:47] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10wiki_willy) a:03Papaul [19:22:50] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [19:23:40] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10nskaggs) [19:51:31] 10Traffic, 10PyBal: Clean up debs/pybal branches - https://phabricator.wikimedia.org/T335455 (10BCornwall) 05In progress→03Resolved After consulting with the team this morning, I got more context: Deep design improvements were attempted some time ago and merged into the master branch (explaining its signif... [19:56:10] 10Traffic, 10PyBal: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10BCornwall) 05Stalled→03Declined Hello! PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial... [20:19:13] 10Traffic, 10netops, 10DBA, 10Data-Engineering, and 10 others: codfw row C switches upgrade - https://phabricator.wikimedia.org/T334049 (10colewhite) [20:23:13] 10Traffic: Let HAProxy handle port 80 - https://phabricator.wikimedia.org/T323557 (10BCornwall) [20:27:06] 10Traffic: Write a cookbook to handle restarts of Wikimedia DNS - https://phabricator.wikimedia.org/T335533 (10BCornwall) [20:27:33] 10Traffic: Write a cookbook to handle restarts of Wikimedia DNS - https://phabricator.wikimedia.org/T335533 (10BCornwall) [20:41:45] (HAProxyRestarted) firing: (2) HAProxy server restarted on cp2031:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted