[00:01:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: (3) 50% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org
[00:03:28] <wikibugs>	 10Traffic: Everything was down - https://phabricator.wikimedia.org/T303903 (10AlexisJazz)
[00:03:57] <wikibugs>	 10Traffic: Wikimedia domains unreachable (Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz)
[00:06:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: (4) 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop  - https://alerts.wikimedia.org
[00:08:11] <wikibugs>	 10Traffic: Wikimedia domains unreachable (Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz)
[00:09:36] <wikibugs>	 10Traffic: Wikimedia domains unreachable (Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz)
[00:12:13] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster
[00:25:15] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6011:9331 is unreachable   - https://alerts.wikimedia.org
[01:28:32] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster completed: - cp6011 (**WARN**)   -...
[06:48:22] <wikibugs>	 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10RhinosF1)
[08:48:44] <wikibugs>	 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Peachey88) Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36/ | private paste ]]?   For more information...
[08:51:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[09:02:26] <volans>	 sorry for the spam about the VarnishPrometheusExporterDown, those should not happen anymore once we get https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/770456/3/cookbooks/sre/hosts/reimage.py merged 
[09:02:32] <volans>	 and that should happen probably today
[09:21:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[10:36:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[10:46:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org
[11:09:17] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster
[11:20:15] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6012:9331 is unreachable   - https://alerts.wikimedia.org
[11:20:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[11:45:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org
[12:14:12] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster completed: - cp6012 (**WARN**)   -...
[12:27:52] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster
[12:40:15] <jinxer-wm>	 (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6013:9331 is unreachable   - https://alerts.wikimedia.org
[12:45:15] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6013:9331 is unreachable   - https://alerts.wikimedia.org
[12:52:19] <sukhe>	 ^ expected
[13:22:34] <volans>	 sukhe: I'm about to merge the change that should fix ^^, so next reimage shuld be fine :)
[13:26:01] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster completed: - cp6013 (**WARN**)   -...
[13:29:36] <volans>	 sukhe: merged and deployed on cumin2002, lmk if it works as expected
[13:29:46] <volans>	 (puppet still running sorry, hit enter too soon)
[13:32:01] <volans>	 {done} now
[13:38:54] <sukhe>	 volans: <3
[13:57:51] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster
[14:44:01] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster completed: - cp6014 (**WARN**)   -...
[14:47:02] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster
[14:51:48] <volans>	 yay, reimage without spam :D
[14:52:10] <godog>	 \o/ \o/ \o/ nice, thank you volans for following up
[14:52:17] <sukhe>	 volans: woho!
[14:52:41] <sukhe>	 where is the volans meme about automation? <insert here>
[14:52:50] <bblack>	 sukhe: I see anycast BGP alerts from routers now in drmrs, I imagine arzhel just fixed his side and ours isn't advertising yet or something?
[14:53:12] <bblack>	 e.g. 14:50 <+icinga-wm> PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1
[14:53:23] <bblack>	 14:51 <+icinga-wm> PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast
[14:53:28] <sukhe>	 bblack: yes, we have this weird error in durum6001 with bird6 that we are trying to figure out
[14:53:31] <sukhe>	 Mar 16 14:46:41 durum6001 bird6[6674]: KRT: Received route ::/0 with strange next-hop fe80::cafe:6a02:6d2d:3800
[14:53:33] <bblack>	 ok!
[14:53:47] <XioNoX>	 all the other BGP sessions are fine, so no a blocker
[14:54:00] <bblack>	 yeah I wonder what's up with that
[14:54:05] <sukhe>	 it's a weird one
[14:54:15] <bblack>	 it thinks the switch is sending a default route?
[14:59:43] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney)
[15:00:52] <wikibugs>	 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney)
[15:34:18] <topranks>	 bblack: durum6001 has no IPv6 link-local address configured on ens13 for some reason.
[15:34:28] <topranks>	 unsure why that might be, but it would explain that Bird error
[15:34:42] <topranks>	 durum6002 does have a link-local, presumably why same issue isn't seen there.
[15:34:50] <XioNoX>	 topranks: nice catch!
[15:34:54] <XioNoX>	 sukhe: ^
[15:41:27] <topranks>	 stumbled over it really.  I am confused though, asw1-b12 says it is not sending any routes (v4 or v6) to durum6001, so unsure why it'd show that message.
[15:41:55] <sukhe>	 hello, back
[15:42:17] <sukhe>	 ah ok, let me see, though I wonder why there would be any discrepancy between durum6001 and durum6002 at all
[15:42:32] <topranks>	 yeah it's odd.
[15:44:06] <topranks>	 The switch is sending RAs, and durum6001 has a v6 default via the link local address on it
[15:44:46] <topranks>	 Which is going into it's routing table
[15:44:47] <topranks>	 default via fe80::cafe:6a02:6d2d:3800 dev ens13 proto ra metric 1024 expires 590sec hoplimit 64 pref medium
[15:44:54] <topranks>	 And is pingable:
[15:44:59] <topranks>	 https://www.irccloud.com/pastebin/WKc3f3Fl/
[15:46:09] <topranks>	 I think what is happening here is that the bird6 service is crashing on start, as it's not parsing the default v6 route properly.
[15:46:33] <topranks>	 And I expect one way or the other the reason for that is the lack of a link local address configured on the interface.
[15:46:46] <topranks>	 Unfortunately I still can't explain why that is missing though
[15:47:03] <sukhe>	 yeah it's pretty confusing
[15:47:14] <topranks>	 so to be clear it's not that it's learning a default from the switch via BGP.
[15:47:32] <topranks>	 It's parsing the local kernel routing table and tripping up on the installed default route in it.
[15:49:11] <bblack>	 you can snoop the RAs on the host, too
[15:49:15] <bblack>	 maybe they're really not showing up there
[15:49:40] <topranks>	 they appear in a tcpdump, and the default route in the kernel table suggests they are parsed correctly by Linux
[15:52:19] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster completed: - cp6015 (**WARN**)   -...
[15:58:39] <bblack>	 yeah but RA should set the link-local as well, not just the route, right?
[15:59:06] <bblack>	 (at the kernel level when it parses it, I mean)
[15:59:12] <topranks>	 yes it should
[15:59:52] <topranks>	 But I think, based on some googling I was doing, if the link local is removed from the interface, subsequent RAs received won't re-add it
[16:00:25] <bblack>	 err sorry I get that backwards
[16:00:30] <topranks>	 Purely based on this which could be wrong: https://medium.com/opsops/how-to-restore-link-local-ipv6-address-in-linux-737666a505f3
[16:00:36] <bblack>	 RA isn't setting the link-local one
[16:00:41] <bblack>	 that comes at boot
[16:00:53] <topranks>	 yeah ofc sry.
[16:00:54] <bblack>	 hmmm
[16:00:56] <topranks>	 getting confused myself.
[16:02:02] <topranks>	 Possibly it's worth just a re-boot to see if it properly configures itself on a cold boot
[16:02:31] <sukhe>	 :) I did some quick grepping but couldn't see any reason why this will be different
[16:02:40] <sukhe>	 (this host in particular, even for drmrs)
[16:02:56] <topranks>	 Yeah I find no diffs in config, sysctl settings etc between durum6001 and durum6002
[16:03:16] <sukhe>	 maybe it was something in the setup -- I don't recall though
[16:03:19] <topranks>	 It *should* configure itself with a v6 link local from what I can see.
[16:03:47] <sukhe>	 ok let's try a restart to see if it alleviates it, unless there is something else we want to try
[16:04:44] <topranks>	 well I guess it's do we really want to understand what has happened?  Or if it works as expected on a reboot are we happy to ignore this?
[16:05:08] <bblack>	 my vote is reboot
[16:05:16] <bblack>	 if it works, assume it was a fluke, I donno
[16:05:32] <bblack>	 (until/unless we see this again somewhere sometime)
[16:05:36] <topranks>	 I'd also lean that way
[16:05:41] <sukhe>	 the bird6 config for drmrs was kinda our testbed for the IPv6 changes we have been making, not sure if that is any consolation
[16:05:54] <sukhe>	 specifically in how it differs from the rest of our anycast setup
[16:06:06] <sukhe>	 that still doesn't explain why this host though
[16:06:18] <sukhe>	 ok rebooting then
[16:06:21] <bblack>	 if reboot works, most likely explanation is one of us screwed it up with some CLI command messing with ipv6 at some point and didn't realize it :)
[16:07:28] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster
[16:07:40] <topranks>	 link local is there after the reboot
[16:08:07] <sukhe>	 yep
[16:08:08] <sukhe>	 it's back
[16:08:11] <bblack>	 I hate it when that works
[16:08:13] <sukhe>	 lol
[16:08:20] <bblack>	 makes me feel like a windows admin.  reboot to fix anything :P
[16:08:42] <volans>	 FYI sre.hosts.reboot-single is an option :)
[16:08:50] <sukhe>	 haha
[16:09:05] <sukhe>	 volans: now all we need from the cookbook is to become sentient and explain why this worked to us
[16:09:18] <sukhe>	 when are we getting that feature? :)
[16:09:26] <sukhe>	 12:08:26 <+icinga-wm> RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:09:46] <volans>	 sukhe: it's called IT Crowd solution
[16:09:53] <sukhe>	 haha
[16:09:55] <volans>	 it just works, you don't need to know why
[16:09:58] <sukhe>	 I think they were on Windows though :P
[16:10:39] <wikibugs>	 10Traffic, 10SRE, 10User-Ladsgroup: Rework education.wikimedia.org redirects - https://phabricator.wikimedia.org/T303397 (10Ladsgroup) 05Open→03Resolved
[16:10:53] <sukhe>	 sO I think to make ourselves feel less weird, it probably is because of all the IPv6 tuning we have been trying to do 
[16:11:03] <sukhe>	 maybe we ran some command somewhere to mess things up on durum6001
[16:11:06] <sukhe>	 I don't recall it but well
[16:11:08] <topranks>	 yeah I think that explanation probably makes sense.
[16:11:33] <topranks>	 or triggered some odd race condition whereby it didn't properly re-add the link local.
[16:11:48] <topranks>	 unless it happens again let's forget any of this ever happened
[16:13:03] <sukhe>	 :D
[17:12:06] <wikibugs>	 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**)   -...
[17:42:43] <bblack>	 re: drmrs network stuff, just scanning icinga for outstanding things
[17:43:11] <bblack>	 we still have a BFD Status alert with "CRIT: Down: 2" on both drmrs switches
[17:43:24] <bblack>	 and cr2-drmrs has "BGP WARNING - AS65001/IPv6: Active (for 9d5h)"
[17:43:30] <bblack>	 all known? can we ack them or something?
[17:44:03] <bblack>	 on non-network stuff:
[17:44:14] <bblack>	 there's two UNKNOWNs right now for alert1001 vk delivery alerts:
[17:44:16] <bblack>	 cache_text: Varnishkafka eventlogging Delivery Errors per second -drmrs-
[17:44:24] <bblack>	 cache_text: Varnishkafka statsv Delivery Errors per second -drmrs-
[17:44:40] <bblack>	 it's quite possible these are just for lack of any appreciable client load of various kinds
[17:44:49] <bblack>	 not sure though
[17:48:45] <ottomata>	 afaict that eventlogging vk should be  doing about 170ish reqs/second
[17:48:48] <ottomata>	 https://grafana.wikimedia.org/goto/SIXUY5E7k?orgId=1
[17:49:19] <ottomata>	 statsv around 50
[17:49:41] <bblack>	 yeah currently they're just reporting NaN, so not sure what's up there
[17:49:55] <bblack>	 the site isn't "live" yet, so I wasn't sure if some level of real traffic is necessary before something or other kicks in
[17:50:27] <bblack>	 could be some config error, too
[17:59:32] <bblack>	 I've run through all the host-level icinga status on all the hosts in drmrs, they all look good
[17:59:56] <bblack>	 there's a few recently-reimaged cps still in auto-downtime, but they'll clear on their own soon enough and are all green anyways
[18:40:48] <sukhe>	 ;; NSID: 646F6836303032 "doh6002"
[18:40:56] <sukhe>	 ^ Wikidough, drmrs
[18:41:49] <bblack>	 hmmm, since I re-enabled puppet on cp6011, we've got:
[18:41:50] <bblack>	 18:22 <+icinga-wm> PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish
[18:42:03] <sukhe>	 oh sigh
[18:42:07] <sukhe>	 not this again
[18:42:29] <sukhe>	 just on cp6011 correct?
[18:42:47] <sukhe>	 or did you mean Puppet was only enabled there?
[18:43:03] <bblack>	 sorry I was talking about this in another channel earlier
[18:43:25] <bblack>	 cp6011 was puppet-disabled since ~10h ago, due to some unrelated work.  I think it was meant to be re-enabled but got missed.
[18:43:33] <sukhe>	 oh
[18:43:42] <bblack>	 the netmapper stuff
[18:43:47] <bblack>	 anyways, looking into it
[18:47:45] <bblack>	 was just an artifact of extended puppet disablement, etc
[18:47:54] <bblack>	 re-ran confd's reload action manually and it cleared up
[18:49:47] <sukhe>	 ok that's good let's hope it stays this way because I think this was the first event before the dominoes start falling last time so that got me worried :P
[18:52:45] <bblack>	 that time around, it was probably just a side-effect of the filled disk or the oom condition, one of the two
[19:04:54] <bblack>	 circling back to the network icinga issues:
[19:05:02] <bblack>	 the BFD ones are doh/durum Ipv6 sessions
[19:05:10] <bblack>	 bblack@asw1-b12-drmrs> show bfd session
[19:05:16] <bblack>	 [...]
[19:05:17] <bblack>	 2a02:ec80:600:1:185:15:58:11 Down                 0.000     2.000        3   
[19:05:20] <bblack>	 2a02:ec80:600:101:10:136:0:21 Down                0.000     2.000        3  
[19:05:54] <bblack>	 (not the ipv6 advert, but the ipv6 bfd/bgp session)
[19:06:41] <sukhe>	 o_O
[19:08:27] <bblack>	 and the BGP one on cr2-drmrs is complaining about the session in AS 65001 which is confed-eqiad
[19:09:01] <bblack>	 specifically to 2a02:ec80:600:fe04::21
[19:09:04] <bblack>	 whatever that is!
[19:10:03] <bblack>	 looks like possibly a config typo
[19:11:11] <sukhe>	 can you please check bfd session again to see if it resolved? thanks :)
[19:14:38] <sukhe>	 not sure what permissions I need but I can't seem to ssh
[19:14:41] <bblack>	 still down on both switches
[19:14:43] <sukhe>	 weird
[19:15:07] <bblack>	 might need some manual clearing or something, but I'm not gonna mess with that
[19:15:34] <bblack>	 the cr2-drmrs confed_eqiad thing, the config has:
[19:15:47] <bblack>	 (for ipv6 for that confed group):
[19:15:49] <bblack>	             neighbor 2a02:ec80:600:fe04::2 {
[19:15:49] <bblack>	                 description cr1-eqiad;  
[19:15:49] <bblack>	             }                           
[19:15:49] <bblack>	             neighbor 2a02:ec80:600:fe04::21 {
[19:15:51] <bblack>	                 family inet6 {          
[19:15:54] <bblack>	                     unicast;            
[19:15:56] <bblack>	                 }                       
[19:15:59] <bblack>	             } 
[19:16:01] <bblack>	 and it's the ::21 causing the icinga alert
[19:16:14] <bblack>	 I'm guessing ::21 wasn't even supposed to exist, and the family inet6 part was meant to be underneath ::2
[19:16:32] <bblack>	 but who knows, certainly not me :)
[19:17:21] <bblack>	 anyways, we can pick this up with netops in the AM.  The three alerts are easy to find in icinga, just search for string "drmrs" and scan down for non-green things.
[19:17:53] <bblack>	 I also still don't have a clue about the varnishkafka eventlogging and statsv alerts
[19:18:05] <bblack>	 well, not alerts, but the check is reporting NaN -> UNKNOWN
[19:18:20] <sukhe>	 Arzhel did ping me about this but we got busy with other stuff and I kinda assumed this would be resolved by the restart of durum and the bird error that was fixed
[19:18:24] <sukhe>	 so yeah let's ask them tomorrow
[19:19:04] <sukhe>	 2a02:ec80:600:1:185:15:58:11 is doh6001
[19:19:12] <sukhe>	 maybe that's why I was only hiting doh6002 hmmm
[19:19:55] <sukhe>	 and 2a02:ec80:600:101:10:136:0:21 is durum6001, fwiw
[19:20:24] <bblack>	 it's different in each switch
[19:20:45] <bblack>	 b12 switch is reporting about doh/durum01, and b13 about doh/durum02, as per the rack layout stuff
[19:20:51] <sukhe>	 yeo
[19:20:53] <sukhe>	 p
[19:23:05] <bblack>	 the vk NaNs have been for only about a day
[19:23:23] <bblack>	 so I'm guessing it's the reimaging of text that triggered it.  Perhaps there's some manual post-reimage setup to do there that we don't remember.
[19:23:34] <bblack>	 (e.g. some kind of cert/keyholder stuff or whatever)
[19:24:03] <wikibugs>	 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata)
[19:25:39] <sukhe>	 https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka#Delivery_errors
[19:25:47] <sukhe>	 > This error means that Varnishkafka failed to send messages to Kafka Jumbo, and hence data has been lost.
[19:26:06] <wikibugs>	 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata)
[19:26:43] <sukhe>	 https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-cp_cluster=cache_text&var-datasource=drmrs%20prometheus%2Fops&var-instance=All&var-source=eventlogging&viewPanel=20 is a NaN
[19:26:49] <bblack>	 oh nevermind about the 24h thing, that's just when it was reimaged
[19:26:55] <bblack>	 it has no data going back forever
[19:27:13] <bblack>	 maybe it's an analytics vlan firewall rule thing for the drmrs networks or something
[19:27:15] <sukhe>	 yeah I think that's it from the dashboard
[19:28:02] <taavi>	 drmrs is missing from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/definitions/static.net#38
[19:30:45] <bblack>	 ah the ipv6 private
[19:30:49] <bblack>	 great catch!
[19:34:43] <sukhe>	 indeed! so 2a02:ec80:600:100::/56?
[19:34:46] <sukhe>	 I can prep the patch
[19:34:47] <bblack>	 yes
[19:34:56] <sukhe>	 ok on it
[19:35:01] <bblack>	 send it by the netopsen for review of course :)
[19:35:09] <sukhe>	 yep
[19:35:26] <bblack>	 but hopefully, that indirectly impacts some network firewall rule and lets vk deliver stuff to analytics
[19:35:54] <bblack>	 or something
[19:38:00] <sukhe>	 taavi: would you prefer to be credited as taavi?
[19:40:28] <sukhe>	 I put you as taavi because I have to step out for a bit and wanted to get this done; let me know and I can revise
[19:47:40] <taavi>	 yeah that's fine
[19:57:38] <wikibugs>	 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7780748, @Peachey88 wrote: > Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36...
[20:14:38] <wikibugs>	 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata)
[20:16:01] <wikibugs>	 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Aklapper) > You do not have permission to view this object. Sorry, should work now.
[22:24:18] <wikibugs>	 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10cmooney) Worth noting that we are planning in the short term to adjus...
[23:15:58] <wikibugs>	 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7783673, @Aklapper wrote: >> You do not have permission to view this object. > Sorry, should work now.  Thanks, https://phabricator.wikimedia.org/P22736
[23:49:56] <jinxer-wm>	 (EdgeTrafficDrop) firing: 60% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org
[23:54:56] <jinxer-wm>	 (EdgeTrafficDrop) resolved: 65% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org