[06:48:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[06:53:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[07:46:04] <elukey>	 hello folks, I've acked all the cp6* unknowns in icinga to clear out the unhandled problems board
[08:06:33] <mmandere>	 elukey: Thank you for handling that :) XioNoX and I have also acknowledged  the other pending drmrs Icinga-Unknown 
[08:09:45] <elukey>	 np! :)
[08:09:57] <jinxer-wm>	 (EdgeTrafficDrop) firing: 63% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org
[08:36:31] <ema>	 that's the eqiad depool ^
[08:42:07] <topranks>	 ema:  thanks yes that is it
[08:48:23] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vgutierrez@cumin1001 for host cp1090.eqiad.wmnet with OS buster
[08:54:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1090:9331 is unreachable   - https://alerts.wikimedia.org
[08:54:57] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 67% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org
[09:04:56] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1090:9331 is unreachable   - https://alerts.wikimedia.org
[09:05:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1090:9331 is unreachable   - https://alerts.wikimedia.org
[09:08:26] <jinxer-wm>	 (EdgeTrafficDrop) firing: 60% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org
[09:10:11] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1090:9331 is unreachable   - https://alerts.wikimedia.org
[09:16:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp1090:9331 is unreachable   - https://alerts.wikimedia.org
[09:20:17] <vgutierrez>	 ^^ cp1090 is being reimaged :)
[09:31:26] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp1090:9331 is unreachable   - https://alerts.wikimedia.org
[09:33:01] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vgutierrez@cumin1001 for host cp1090.eqiad.wmnet with OS buster c...
[09:33:32] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez)
[09:34:58] <moritzm>	 bblack: ack, ok :-)
[09:36:31] <moritzm>	 might have been dependent on the order of puppet runs in the initial setup
[09:36:41] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org
[09:37:59] <moritzm>	 mid-term I plan to look into moving the ganeti-internal SSH key handling into Puppet instead of relying on the Ganeti-internal mechanism, this also makes workflows like reimages/reinstalls simpler
[09:55:54] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review, 10User-ema: Experiment with single backend CDN nodes - https://phabricator.wikimedia.org/T288106 (10ema) After setting `cache::single_backend_fqdn: cp4021.ulsfo.wmnet` in hiera, cp4021 is now gone from the list of cache backends on all upload@ulsfo nodes, see for insta...
[10:17:48] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6001.drmrs.wmnet with OS buster
[10:22:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 57% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[10:35:13] <topranks>	 ^^^ this is due to eqiad being re-pooled.  All good.
[10:46:09] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Use next-hop-self for iBGP sessions - https://phabricator.wikimedia.org/T295672 (10cmooney) Change completed successfully in eqiad.  ##### Before ` cmooney@re0.cr1-eqiad> show route receive-protocol bgp 208.80.154.197               inet.0:...
[10:56:33] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) 05Open→03Resolved The next-hop self policy has been applied on cr1-eqiad and cr2-eqiad, in the Confed_eqiad group, to address this issue.  cr2-eqiad is...
[10:57:17] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6001.drmrs.wmnet with OS buster completed: - lvs6001 (**WARN**...
[11:02:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[11:04:39] <topranks>	 ^^ again this due to eqiad re-pool.
[11:05:51] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6002.drmrs.wmnet with OS buster
[11:27:24] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host lvs6003.drmrs.wmnet with OS buster
[11:45:14] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6002.drmrs.wmnet with OS buster completed: - lvs6002 (**WARN**...
[12:06:45] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host lvs6003.drmrs.wmnet with OS buster completed: - lvs6003 (**WARN**...
[14:29:50] <moritzm>	 bblack: we should rename the ganeti groups for the two Marseille clusters, they're currently both using default, which is confusing. I'd suggest "sudo gnt-group rename default B12" on ganeti6001 and "sudo gnt-group rename default B13" on ganeti6002?
[15:08:44] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE-swift-storage, 10ops-codfw: Can't commit on asw-b-codfw - https://phabricator.wikimedia.org/T295118 (10Ottomata) furud does not run any active services; it can be restarted anytime.
[15:38:24] <bblack>	 moritzm: ok, sounds good to me
[15:38:41] <moritzm>	 I'll do that in a bit
[15:39:20] <bblack>	 ok, thanks!
[15:39:51] <bblack>	 I think, judging from the outside anyways, that some of the cookbook stuff for VM creation and so-on might need updates for this case too
[15:40:25] <bblack>	 the argument that looks like codfw_A (when the cluster is codfw and the group is Row_A), seems like it probably has to have some assumptions about these namings/layouts, or site-per-cluster, or both
[15:40:30] <bblack>	 I haven't poked at it yet
[15:40:47] <XioNoX>	 fyi, I depooled lvs2007 for the switch replacement
[15:41:15] <moritzm>	 yeah, I'll have a patch ready for this soon (this is where I noticed the "default" names for the groups)
[15:41:17] <XioNoX>	 and 2010 is taking over nicely
[15:43:19] <mutante>	 in the POPs the ganeti groups dont have the "row" part, just the name of the DC, right?
[15:43:28] <mutante>	 afair
[15:43:59] <volans>	 it's https://doc.wikimedia.org/spicerack/master/api/spicerack.ganeti.html#spicerack.ganeti.CLUSTERS_AND_ROWS
[15:44:53] <volans>	 although we could get those dynamically for netbox IMHO
[15:45:29] <moritzm>	 https://gerrit.wikimedia.org/r/739855
[15:52:20] <mutante>	 ah, nice link, volans, I should use doc.wm more 
[15:56:19] <mutante>	 curious, why does drmrs get 2 new clusters and not just 2 rows in one cluster?
[15:57:32] <XioNoX>	 mutante: instead of having 1 big virtual switch across the two racks, we now have independent switches
[15:57:47] <XioNoX>	 that way each rack is its own failure domain
[15:58:01] <mutante>	 aha! gotcha, thanks
[15:58:12] <XioNoX>	 so they each have a public/private vlan, etc..
[15:58:45] <mutante>	 cool, that sounds good, more separation
[15:59:23] <bblack>	 right, it's a step towards more resiliency
[15:59:51] <bblack>	 each rack in drmrs is more like a row in the legacy core site setup
[16:00:05] <bblack>	 [but with L3 at the ToR too]
[16:00:27] <bblack>	 but this whole thing about multiple ganeti rows vs multiple ganeti clusters also applies to the core sites, to some lesser degree.
[16:01:01] <bblack>	 what we're doing there now works, but it does have some design resiliency holes that should maybe eventually be addressed
[16:01:09] <bblack>	 [sorry, "there" being the core sites]
[16:01:42] <bblack>	 the core sites' layout is one ganeti Cluster for the whole site, and one ganeti Group per row
[16:02:14] <bblack>	 one Ganeti cluster has one master node, and you need the master node to be working to do any significant ganeti operations.
[16:02:23] <XioNoX>	 I guess one question is how much fate shares the groups within one cluster
[16:02:42] <bblack>	 ganeti has master failover by floating the IP of ganeti01.svc.eqiad.wmnet from the current master node to <some other node>
[16:02:46] <XioNoX>	 that answers the questions :)
[16:03:10] <bblack>	 but the address used for that floating master name/IP, is from the per-row vlan of the primary master node
[16:03:42] <bblack>	 so the only possible backup nodes are also in that row.  If we lose that row's network or power at a whole-row level, we cease to be able to manage ganeti instances in the other rows (although they'll keep running as they are)
[16:04:45] <bblack>	 so if there's some singular ganeti instance which is important and which also lived only in that row, it would be hard to reprovision it while the row's dead.  I guess you'd have to reconfigure the ganeti cluster and move the master node floating IP in DNS, etc, etc.
[16:05:18] <bblack>	 if you look at the rough probabilities of different failure scenarios, I don't think this is really a high priority problem
[16:05:24] <bblack>	 but it is something to think about in the future of the design
[16:06:07] <XioNoX>	 but I think that floating IP is not used much, mostly for monitoring and identifying which host is the master
[16:06:18] <XioNoX>	 if you ssh to any other node it will tell you which host is the master
[16:06:22] <bblack>	 right
[16:06:23] <XioNoX>	 and you can ssh to that host directly
[16:06:39] <bblack>	 but if no host can become the master, because no remaining live host can take on the master IP
[16:06:58] <bblack>	 then what? or will it still elect an other-row master even though that IP isn't usable there?
[16:07:50] <XioNoX>	 I'd hope but I don't know
[16:07:51] <bblack>	 I mean, it's possible that works, but I wouldn't really expect it to be without some impact.  Might be interesting to test sometime! :)
[16:08:28] <bblack>	 I don't think "gnt-cluster verify" would be happy in any case, for whatever that's worth at that moment.  So you're kind of in supported territory in terms of expectations.
[16:08:42] <bblack>	 s/supported/unsupported/
[16:09:36] <XioNoX>	 maybe something we could test with the new ganeti test cluster
[16:11:49] <bblack>	 the downside of splitting clusters across sub-site failure domains (like drmrs is now, or a hypothetical core site with cluster-per-row or something)
[16:12:04] <bblack>	 is that you can't easily rebalance/migrate with a simple operation between independent clusters.
[16:12:20] <bblack>	 so you have to treat them as two separate things (kinda like AWS availability zones or whatever)
[16:12:56] <XioNoX>	 yeah, there will be some kind of shared fate if live migration is possible
[16:13:22] <bblack>	 so it becomes another layer/dimension of redundancy in every way.  A service in such a setup which needs very high resiliency might want to have twice as many nodes as it had before (2 in each row-cluster, rather than 1 per row in one big cluster), and so that duplication can double up the VM resources needed.
[16:13:47] <bblack>	 it's not just an easy-win, there's tradeoffs
[16:13:55] <XioNoX>	 yeah exactly
[16:14:58] <bblack>	 and probably for many existing ganeti-VM cases, we only have one VM in the site, and assume we can migrate it to another row-group if necessary because of a row-level issue.
[16:15:13] <bblack>	 splitting the clusters per-row means that singular VM can't leave the row now
[16:15:33] <bblack>	 (easily, through standard ganeti mechanisms, I mean.  I guess you could reprovision it, at a new IP address)
[16:15:37] <XioNoX>	 we can't just migrate it to a different row as its IP won't fit
[16:16:02] <XioNoX>	 but I think the idea is to be able to easily spun a new one
[16:16:13] <bblack>	 the situation's simpler when the VMs are part of a redundant cluster of N VMs and you can make arbitrary choices about layout/redundancy, but for single-host services that can't be made redundant, it's tricky
[16:17:06] <XioNoX>	 it all goes down to the application :)
[16:17:34] <bblack>	 and while row-level failure in a core DC is possible, and I think we've had it happen before in various scenarios, it's not very common or likely compared to many of the other dimensions of availability we're trying to protect, either.
[16:17:39] <mutante>	 thanks a lot for this incredibly detailed answer to my question, heh :)
[16:17:51] <bblack>	 sorry, it's my disease :)
[16:17:57] <mutante>	 it's great :)
[16:18:23] <topranks>	 +1 just stumbled in, very interesting reading :)
[16:27:13] <cwhite>	 bblack: I think vgutierrez and ema are out for the day.  https://gerrit.wikimedia.org/r/c/operations/puppet/+/738422 may be contributing to a 15000% increase in log traffic from haproxy: https://logstash.wikimedia.org/goto/ca0c3e167e0b6c09ac6cc80f31b15104
[16:29:11] <bblack>	 well haproxy is "new", and we're expanding its testing footprint
[16:29:16] <bblack>	 is it enough that it's harming something?
[16:29:38] <bblack>	 chwite: ^
[16:30:38] <cwhite>	 Yeah, it manifests as a 300% increase in logstash load.  We can't sustain that for long.
[16:30:46] <bblack>	 (I mean, we kind of expect a big increase.  But it should also be offset by the decrease in logs from the ats-tls daemons it replaced?)
[16:31:18] <bblack>	 or maybe those didn't end up through logstash at all
[16:32:53] <vgutierrez>	 ouch
[16:32:56] <cwhite>	 ^^ I'm pretty sure that
[16:33:23] <vgutierrez>	 cwhite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/738422/15/modules/profile/templates/cache/haproxy.rsyslog.conf.erb that doesn't prevent haproxy from hitting logstash?
[16:33:55] <bblack>	 gerrit seems super-slow today btw, had a couple of connection timeouts even, to https for it
[16:34:01] <bblack>	 not sure what's going on there
[16:34:07] <cwhite>	 vgutierrez: looking
[16:35:59] <vgutierrez>	 bblack: via esams is as slow as usual for my slow connection
[16:37:39] <cwhite>	 vgutierrez: that filter appears to be installed after the kafka output
[16:39:37] <vgutierr1z>	 brrr sorry
[16:39:50] <vgutierr1z>	 irccloud seems to be suffering their own issues
[16:40:18] <vgutierr1z>	 cwhite: I was saying that maybe the haproxy rsyslog snippet needs a higher priority than the 30-remote-syslog one?
[16:41:49] <cwhite>	 vgutierr1z: yeah, that filter appears installed after the kafka output (30)
[16:41:55] <cwhite>	 should probably be priority 20 or so
[16:43:10] <cwhite>	 hah, what a time for a netsplit
[16:43:39] <cwhite>	 or, an irccloud outage?
[16:44:08] <vgutierr1z>	 cwhite: https://gerrit.wikimedia.org/r/c/operations/puppet/+/739862
[16:44:09] <vgutierr1z>	 yep
[16:44:21] <bblack>	 no, there's something bigger happening internet-wide
[16:44:29] <bblack>	 I think the IRC issues and others are just secondary fallout
[16:44:44] <cwhite>	 vgutierr1z: +1
[16:44:45] <vgutierr1z>	 it's actually an issue between libera.chat and irccloud
[16:45:18] <vgutierr1z>	 merging... sorry about that :)
[16:46:07] <wikibugs>	 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 2 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Reedy)
[16:48:17] <cwhite>	 thanks for jumping on and finding a solution :)
[16:49:51] <vgutierrez>	 cwhite: running puppet on the affected nodes.. messages should be already decreasing
[16:50:14] <cwhite>	 confirmed, squiggly line goes down
[16:50:31] <vgutierrez>	 I manually hit cp3065 that should be the one holding more rps right now
[16:50:50] <vgutierrez>	 cumin is taking care of the others
[16:51:13] <cwhite>	 Awesome, thank you!
[16:51:40] <vgutierrez>	 (done)
[16:51:51] <vgutierrez>	 you shouldn't get more TTFB metrics on logstash from those nodes
[16:52:00] <cwhite>	 \o/
[16:53:01] <cwhite>	 Yep, looks back to normal now.  Thanks again, and have a good rest of your day :)
[16:57:10] <vgutierrez>	 that seems to save a 0,5% of CPU in cp3065 BTW
[17:23:14] <elukey>	 hello folks, as FYI I am configuring varnishkafka-webrequest on cp3050 to use the ca bundle /etc/ssl/localcerts/wmf_trusted_root_CAs.pem
[18:06:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[18:10:10] <cdanis>	 expected
[19:11:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org