[06:48:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [06:53:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:46:04] hello folks, I've acked all the cp6* unknowns in icinga to clear out the unhandled problems board [08:06:33] elukey: Thank you for handling that :) XioNoX and I have also acknowledged the other pending drmrs Icinga-Unknown [08:09:45] np! :) [08:09:57] (EdgeTrafficDrop) firing: 63% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [08:36:31] that's the eqiad depool ^ [08:42:07] ema: thanks yes that is it [08:48:23] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1090.eqiad.wmnet with OS buster [08:54:56] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1090:9331 is unreachable - https://alerts.wikimedia.org [08:54:57] (EdgeTrafficDrop) resolved: 67% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [09:04:56] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1090:9331 is unreachable - https://alerts.wikimedia.org [09:05:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1090:9331 is unreachable - https://alerts.wikimedia.org [09:08:26] (EdgeTrafficDrop) firing: 60% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [09:10:11] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1090:9331 is unreachable - https://alerts.wikimedia.org [09:16:26] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1090:9331 is unreachable - https://alerts.wikimedia.org [09:20:17] ^^ cp1090 is being reimaged :) [09:31:26] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1090:9331 is unreachable - https://alerts.wikimedia.org [09:33:01] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1090.eqiad.wmnet with OS buster c... [09:33:32] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) [09:34:58] bblack: ack, ok :-) [09:36:31] might have been dependent on the order of puppet runs in the initial setup [09:36:41] (EdgeTrafficDrop) resolved: 69% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org [09:37:59] mid-term I plan to look into moving the ganeti-internal SSH key handling into Puppet instead of relying on the Ganeti-internal mechanism, this also makes workflows like reimages/reinstalls simpler [09:55:54] 10Traffic, 10SRE, 10Patch-For-Review, 10User-ema: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) After setting `cache::single_backend_fqdn: cp4021.ulsfo.wmnet` in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for insta... [10:17:48] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6001.drmrs.wmnet with OS buster [10:22:56] (EdgeTrafficDrop) firing: 57% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [10:35:13] ^^^ this is due to eqiad being re-pooled. All good. [10:46:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Change completed successfully in eqiad. ##### Before ` cmooney@re0.cr1-eqiad> show route receive-protocol bgp 208.80.154.197 inet.0:... [10:56:33] 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) 05Open→03Resolved The next-hop self policy has been applied on cr1-eqiad and cr2-eqiad, in the Confed_eqiad group, to address this issue. cr2-eqiad is... [10:57:17] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6001.drmrs.wmnet with OS buster completed: - lvs6001 (**WARN**... [11:02:56] (EdgeTrafficDrop) resolved: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [11:04:39] ^^ again this due to eqiad re-pool. [11:05:51] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6002.drmrs.wmnet with OS buster [11:27:24] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6003.drmrs.wmnet with OS buster [11:45:14] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6002.drmrs.wmnet with OS buster completed: - lvs6002 (**WARN**... [12:06:45] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6003.drmrs.wmnet with OS buster completed: - lvs6003 (**WARN**... [14:29:50] bblack: we should rename the ganeti groups for the two Marseille clusters, they're currently both using default, which is confusing. I'd suggest "sudo gnt-group rename default B12" on ganeti6001 and "sudo gnt-group rename default B13" on ganeti6002? [15:08:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Ottomata) furud does not run any active services; it can be restarted anytime. [15:38:24] moritzm: ok, sounds good to me [15:38:41] I'll do that in a bit [15:39:20] ok, thanks! [15:39:51] I think, judging from the outside anyways, that some of the cookbook stuff for VM creation and so-on might need updates for this case too [15:40:25] the argument that looks like codfw_A (when the cluster is codfw and the group is Row_A), seems like it probably has to have some assumptions about these namings/layouts, or site-per-cluster, or both [15:40:30] I haven't poked at it yet [15:40:47] fyi, I depooled lvs2007 for the switch replacement [15:41:15] yeah, I'll have a patch ready for this soon (this is where I noticed the "default" names for the groups) [15:41:17] and 2010 is taking over nicely [15:43:19] in the POPs the ganeti groups dont have the "row" part, just the name of the DC, right? [15:43:28] afair [15:43:59] it's https://doc.wikimedia.org/spicerack/master/api/spicerack.ganeti.html#spicerack.ganeti.CLUSTERS_AND_ROWS [15:44:53] although we could get those dynamically for netbox IMHO [15:45:29] https://gerrit.wikimedia.org/r/739855 [15:52:20] ah, nice link, volans, I should use doc.wm more [15:56:19] curious, why does drmrs get 2 new clusters and not just 2 rows in one cluster? [15:57:32] mutante: instead of having 1 big virtual switch across the two racks, we now have independent switches [15:57:47] that way each rack is its own failure domain [15:58:01] aha! gotcha, thanks [15:58:12] so they each have a public/private vlan, etc.. [15:58:45] cool, that sounds good, more separation [15:59:23] right, it's a step towards more resiliency [15:59:51] each rack in drmrs is more like a row in the legacy core site setup [16:00:05] [but with L3 at the ToR too] [16:00:27] but this whole thing about multiple ganeti rows vs multiple ganeti clusters also applies to the core sites, to some lesser degree. [16:01:01] what we're doing there now works, but it does have some design resiliency holes that should maybe eventually be addressed [16:01:09] [sorry, "there" being the core sites] [16:01:42] the core sites' layout is one ganeti Cluster for the whole site, and one ganeti Group per row [16:02:14] one Ganeti cluster has one master node, and you need the master node to be working to do any significant ganeti operations. [16:02:23] I guess one question is how much fate shares the groups within one cluster [16:02:42] ganeti has master failover by floating the IP of ganeti01.svc.eqiad.wmnet from the current master node to [16:02:46] that answers the questions :) [16:03:10] but the address used for that floating master name/IP, is from the per-row vlan of the primary master node [16:03:42] so the only possible backup nodes are also in that row. If we lose that row's network or power at a whole-row level, we cease to be able to manage ganeti instances in the other rows (although they'll keep running as they are) [16:04:45] so if there's some singular ganeti instance which is important and which also lived only in that row, it would be hard to reprovision it while the row's dead. I guess you'd have to reconfigure the ganeti cluster and move the master node floating IP in DNS, etc, etc. [16:05:18] if you look at the rough probabilities of different failure scenarios, I don't think this is really a high priority problem [16:05:24] but it is something to think about in the future of the design [16:06:07] but I think that floating IP is not used much, mostly for monitoring and identifying which host is the master [16:06:18] if you ssh to any other node it will tell you which host is the master [16:06:22] right [16:06:23] and you can ssh to that host directly [16:06:39] but if no host can become the master, because no remaining live host can take on the master IP [16:06:58] then what? or will it still elect an other-row master even though that IP isn't usable there? [16:07:50] I'd hope but I don't know [16:07:51] I mean, it's possible that works, but I wouldn't really expect it to be without some impact. Might be interesting to test sometime! :) [16:08:28] I don't think "gnt-cluster verify" would be happy in any case, for whatever that's worth at that moment. So you're kind of in supported territory in terms of expectations. [16:08:42] s/supported/unsupported/ [16:09:36] maybe something we could test with the new ganeti test cluster [16:11:49] the downside of splitting clusters across sub-site failure domains (like drmrs is now, or a hypothetical core site with cluster-per-row or something) [16:12:04] is that you can't easily rebalance/migrate with a simple operation between independent clusters. [16:12:20] so you have to treat them as two separate things (kinda like AWS availability zones or whatever) [16:12:56] yeah, there will be some kind of shared fate if live migration is possible [16:13:22] so it becomes another layer/dimension of redundancy in every way. A service in such a setup which needs very high resiliency might want to have twice as many nodes as it had before (2 in each row-cluster, rather than 1 per row in one big cluster), and so that duplication can double up the VM resources needed. [16:13:47] it's not just an easy-win, there's tradeoffs [16:13:55] yeah exactly [16:14:58] and probably for many existing ganeti-VM cases, we only have one VM in the site, and assume we can migrate it to another row-group if necessary because of a row-level issue. [16:15:13] splitting the clusters per-row means that singular VM can't leave the row now [16:15:33] (easily, through standard ganeti mechanisms, I mean. I guess you could reprovision it, at a new IP address) [16:15:37] we can't just migrate it to a different row as its IP won't fit [16:16:02] but I think the idea is to be able to easily spun a new one [16:16:13] the situation's simpler when the VMs are part of a redundant cluster of N VMs and you can make arbitrary choices about layout/redundancy, but for single-host services that can't be made redundant, it's tricky [16:17:06] it all goes down to the application :) [16:17:34] and while row-level failure in a core DC is possible, and I think we've had it happen before in various scenarios, it's not very common or likely compared to many of the other dimensions of availability we're trying to protect, either. [16:17:39] thanks a lot for this incredibly detailed answer to my question, heh :) [16:17:51] sorry, it's my disease :) [16:17:57] it's great :) [16:18:23] +1 just stumbled in, very interesting reading :) [16:27:13] bblack: I think vgutierrez and ema are out for the day. https://gerrit.wikimedia.org/r/c/operations/puppet/+/738422 may be contributing to a 15000% increase in log traffic from haproxy: https://logstash.wikimedia.org/goto/ca0c3e167e0b6c09ac6cc80f31b15104 [16:29:11] well haproxy is "new", and we're expanding its testing footprint [16:29:16] is it enough that it's harming something? [16:29:38] chwite: ^ [16:30:38] Yeah, it manifests as a 300% increase in logstash load. We can't sustain that for long. [16:30:46] (I mean, we kind of expect a big increase. But it should also be offset by the decrease in logs from the ats-tls daemons it replaced?) [16:31:18] or maybe those didn't end up through logstash at all [16:32:53] ouch [16:32:56] ^^ I'm pretty sure that [16:33:23] cwhite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/738422/15/modules/profile/templates/cache/haproxy.rsyslog.conf.erb that doesn't prevent haproxy from hitting logstash? [16:33:55] gerrit seems super-slow today btw, had a couple of connection timeouts even, to https for it [16:34:01] not sure what's going on there [16:34:07] vgutierrez: looking [16:35:59] bblack: via esams is as slow as usual for my slow connection [16:37:39] vgutierrez: that filter appears to be installed after the kafka output [16:39:37] brrr sorry [16:39:50] irccloud seems to be suffering their own issues [16:40:18] cwhite: I was saying that maybe the haproxy rsyslog snippet needs a higher priority than the 30-remote-syslog one? [16:41:49] vgutierr1z: yeah, that filter appears installed after the kafka output (30) [16:41:55] should probably be priority 20 or so [16:43:10] hah, what a time for a netsplit [16:43:39] or, an irccloud outage? [16:44:08] cwhite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/739862 [16:44:09] yep [16:44:21] no, there's something bigger happening internet-wide [16:44:29] I think the IRC issues and others are just secondary fallout [16:44:44] vgutierr1z: +1 [16:44:45] it's actually an issue between libera.chat and irccloud [16:45:18] merging... sorry about that :) [16:46:07] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Reedy) [16:48:17] thanks for jumping on and finding a solution :) [16:49:51] cwhite: running puppet on the affected nodes.. messages should be already decreasing [16:50:14] confirmed, squiggly line goes down [16:50:31] I manually hit cp3065 that should be the one holding more rps right now [16:50:50] cumin is taking care of the others [16:51:13] Awesome, thank you! [16:51:40] (done) [16:51:51] you shouldn't get more TTFB metrics on logstash from those nodes [16:52:00] \o/ [16:53:01] Yep, looks back to normal now. Thanks again, and have a good rest of your day :) [16:57:10] that seems to save a 0,5% of CPU in cp3065 BTW [17:23:14] hello folks, as FYI I am configuring varnishkafka-webrequest on cp3050 to use the ca bundle /etc/ssl/localcerts/wmf_trusted_root_CAs.pem [18:06:56] (EdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [18:10:10] expected [19:11:56] (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org