[00:31:49] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [01:24:45] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Initial ab run: {P32126} [05:51:37] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -... [05:52:19] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [07:08:20] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) > I looked more closely at one of them with tcpdump So are cross-DC connections happening in plain text? [07:41:43] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic. [07:45:11] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) All cross-DC connections except the first had an associated statsd metric `MediaWiki.wanobjectcache.rdbms_server_readonly.hit.refresh`, which imp... [07:46:33] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) >>! In T279664#8122493, @tstarling wrote: > @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at... [09:18:57] 10Traffic, 10Performance-Team, 10SRE, 10serviceops, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Joe) >>! In T279664#8122378, @tstarling wrote: > A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimar... [10:40:19] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10cmooney) Hi Andrew, I'm unable to find any issue here. Looking at the cloud-in acl/filter on the CR routers there does is no rule that will block tra... [10:55:26] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) [10:55:56] 10Traffic, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs from the active DC to fix up where MW fail... [13:36:49] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10rook) I've been looking for how we see the routing of a subnet in openstack, but thus far have come up with little. How did you identify that there is... [13:37:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) on cr1-eqaid, we have all the interfaces setup for asw2-c and asw2-d move ` papaul@re0.cr1-eqiad> show interfaces terse | match xe-1/1 xe-1/1/0:0... [13:41:39] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): More public IPs for codfw1dev - https://phabricator.wikimedia.org/T313977 (10cmooney) Hi @rook, you probably need to confirm within the cloud team, but as far as I am aware the cloudgw nodes are external to OpenStack completely,... [13:53:56] (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [13:54:15] ^ expected [14:43:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:03:16] (VarnishPrometheusExporterDown) firing: (6) Varnish Exporter on instance cp2029:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [16:03:56] (HAProxyEdgeTrafficDrop) firing: 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:08:56] (HAProxyEdgeTrafficDrop) resolved: 64% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:28:56] (HAProxyEdgeTrafficDrop) firing: 66% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:33:56] (HAProxyEdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [16:58:16] (VarnishPrometheusExporterDown) firing: (6) Varnish Exporter on instance cp2029:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [17:33:57] (HAProxyEdgeTrafficDrop) firing: 58% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:38:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:39:56] (HAProxyEdgeTrafficDrop) firing: 51% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [17:44:56] (HAProxyEdgeTrafficDrop) resolved: 55% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:17:56] (HAProxyEdgeTrafficDrop) firing: 21% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:39:44] (VarnishPrometheusExporterDown) firing: (4) Varnish Exporter on instance cp2031:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [18:44:52] (PurgedHighBacklogQueue) firing: Large backlog queue for purged on cp2031:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [18:47:56] (HAProxyEdgeTrafficDrop) resolved: 42% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:53:35] (PurgedHighBacklogQueue) resolved: (2) Large backlog queue for purged on cp2031:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2031 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [19:38:42] (VarnishPrometheusExporterDown) resolved: (2) Varnish Exporter on instance cp2033:9331 is unreachable - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [19:41:24] (PurgedHighEventLag) firing: High event process lag with purged on cp2033:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2033 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [19:41:29] (PurgedHighBacklogQueue) firing: Large backlog queue for purged on cp2033:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2033 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [19:45:35] (PurgedHighEventLag) resolved: (2) High event process lag with purged on cp2033:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2033 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [19:55:35] (PurgedHighBacklogQueue) resolved: (2) Large backlog queue for purged on cp2033:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=codfw%20prometheus/ops&var-instance=cp2033 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue