[00:01:56] (EdgeTrafficDrop) firing: (3) 50% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [00:03:28] 10Traffic: Everything was down - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) [00:03:57] 10Traffic: Wikimedia domains unreachable (Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) [00:06:56] (EdgeTrafficDrop) resolved: (4) 60% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org [00:08:11] 10Traffic: Wikimedia domains unreachable (Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) [00:09:36] 10Traffic: Wikimedia domains unreachable (Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) [00:12:13] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster [00:25:15] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6011:9331 is unreachable - https://alerts.wikimedia.org [01:28:32] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6011.drmrs.wmnet with OS buster completed: - cp6011 (**WARN**) -... [06:48:22] 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10RhinosF1) [08:48:44] 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Peachey88) Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36/ | private paste ]]? For more information... [08:51:56] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [09:02:26] sorry for the spam about the VarnishPrometheusExporterDown, those should not happen anymore once we get https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/770456/3/cookbooks/sre/hosts/reimage.py merged [09:02:32] and that should happen probably today [09:21:56] (EdgeTrafficDrop) resolved: 69% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:36:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [10:46:56] (EdgeTrafficDrop) resolved: 68% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [11:09:17] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster [11:20:15] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6012:9331 is unreachable - https://alerts.wikimedia.org [11:20:56] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [11:45:56] (EdgeTrafficDrop) resolved: 67% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [12:14:12] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6012.drmrs.wmnet with OS buster completed: - cp6012 (**WARN**) -... [12:27:52] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster [12:40:15] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp6013:9331 is unreachable - https://alerts.wikimedia.org [12:45:15] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp6013:9331 is unreachable - https://alerts.wikimedia.org [12:52:19] ^ expected [13:22:34] sukhe: I'm about to merge the change that should fix ^^, so next reimage shuld be fine :) [13:26:01] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6013.drmrs.wmnet with OS buster completed: - cp6013 (**WARN**) -... [13:29:36] sukhe: merged and deployed on cumin2002, lmk if it works as expected [13:29:46] (puppet still running sorry, hit enter too soon) [13:32:01] {done} now [13:38:54] volans: <3 [13:57:51] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster [14:44:01] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6014.drmrs.wmnet with OS buster completed: - cp6014 (**WARN**) -... [14:47:02] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster [14:51:48] yay, reimage without spam :D [14:52:10] \o/ \o/ \o/ nice, thank you volans for following up [14:52:17] volans: woho! [14:52:41] where is the volans meme about automation? [14:52:50] sukhe: I see anycast BGP alerts from routers now in drmrs, I imagine arzhel just fixed his side and ours isn't advertising yet or something? [14:53:12] e.g. 14:50 <+icinga-wm> PROBLEM - BFD status on asw1-b12-drmrs.mgmt is CRITICAL: CRIT: Down: 1 [14:53:23] 14:51 <+icinga-wm> PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast [14:53:28] bblack: yes, we have this weird error in durum6001 with bird6 that we are trying to figure out [14:53:31] Mar 16 14:46:41 durum6001 bird6[6674]: KRT: Received route ::/0 with strange next-hop fe80::cafe:6a02:6d2d:3800 [14:53:33] ok! [14:53:47] all the other BGP sessions are fine, so no a blocker [14:54:00] yeah I wonder what's up with that [14:54:05] it's a weird one [14:54:15] it thinks the switch is sending a default route? [14:59:43] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) [15:00:52] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) [15:34:18] bblack: durum6001 has no IPv6 link-local address configured on ens13 for some reason. [15:34:28] unsure why that might be, but it would explain that Bird error [15:34:42] durum6002 does have a link-local, presumably why same issue isn't seen there. [15:34:50] topranks: nice catch! [15:34:54] sukhe: ^ [15:41:27] stumbled over it really. I am confused though, asw1-b12 says it is not sending any routes (v4 or v6) to durum6001, so unsure why it'd show that message. [15:41:55] hello, back [15:42:17] ah ok, let me see, though I wonder why there would be any discrepancy between durum6001 and durum6002 at all [15:42:32] yeah it's odd. [15:44:06] The switch is sending RAs, and durum6001 has a v6 default via the link local address on it [15:44:46] Which is going into it's routing table [15:44:47] default via fe80::cafe:6a02:6d2d:3800 dev ens13 proto ra metric 1024 expires 590sec hoplimit 64 pref medium [15:44:54] And is pingable: [15:44:59] https://www.irccloud.com/pastebin/WKc3f3Fl/ [15:46:09] I think what is happening here is that the bird6 service is crashing on start, as it's not parsing the default v6 route properly. [15:46:33] And I expect one way or the other the reason for that is the lack of a link local address configured on the interface. [15:46:46] Unfortunately I still can't explain why that is missing though [15:47:03] yeah it's pretty confusing [15:47:14] so to be clear it's not that it's learning a default from the switch via BGP. [15:47:32] It's parsing the local kernel routing table and tripping up on the installed default route in it. [15:49:11] you can snoop the RAs on the host, too [15:49:15] maybe they're really not showing up there [15:49:40] they appear in a tcpdump, and the default route in the kernel table suggests they are parsed correctly by Linux [15:52:19] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6015.drmrs.wmnet with OS buster completed: - cp6015 (**WARN**) -... [15:58:39] yeah but RA should set the link-local as well, not just the route, right? [15:59:06] (at the kernel level when it parses it, I mean) [15:59:12] yes it should [15:59:52] But I think, based on some googling I was doing, if the link local is removed from the interface, subsequent RAs received won't re-add it [16:00:25] err sorry I get that backwards [16:00:30] Purely based on this which could be wrong: https://medium.com/opsops/how-to-restore-link-local-ipv6-address-in-linux-737666a505f3 [16:00:36] RA isn't setting the link-local one [16:00:41] that comes at boot [16:00:53] yeah ofc sry. [16:00:54] hmmm [16:00:56] getting confused myself. [16:02:02] Possibly it's worth just a re-boot to see if it properly configures itself on a cold boot [16:02:31] :) I did some quick grepping but couldn't see any reason why this will be different [16:02:40] (this host in particular, even for drmrs) [16:02:56] Yeah I find no diffs in config, sysctl settings etc between durum6001 and durum6002 [16:03:16] maybe it was something in the setup -- I don't recall though [16:03:19] It *should* configure itself with a v6 link local from what I can see. [16:03:47] ok let's try a restart to see if it alleviates it, unless there is something else we want to try [16:04:44] well I guess it's do we really want to understand what has happened? Or if it works as expected on a reboot are we happy to ignore this? [16:05:08] my vote is reboot [16:05:16] if it works, assume it was a fluke, I donno [16:05:32] (until/unless we see this again somewhere sometime) [16:05:36] I'd also lean that way [16:05:41] the bird6 config for drmrs was kinda our testbed for the IPv6 changes we have been making, not sure if that is any consolation [16:05:54] specifically in how it differs from the rest of our anycast setup [16:06:06] that still doesn't explain why this host though [16:06:18] ok rebooting then [16:06:21] if reboot works, most likely explanation is one of us screwed it up with some CLI command messing with ipv6 at some point and didn't realize it :) [16:07:28] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster [16:07:40] link local is there after the reboot [16:08:07] yep [16:08:08] it's back [16:08:11] I hate it when that works [16:08:13] lol [16:08:20] makes me feel like a windows admin. reboot to fix anything :P [16:08:42] FYI sre.hosts.reboot-single is an option :) [16:08:50] haha [16:09:05] volans: now all we need from the cookbook is to become sentient and explain why this worked to us [16:09:18] when are we getting that feature? :) [16:09:26] 12:08:26 <+icinga-wm> RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:46] sukhe: it's called IT Crowd solution [16:09:53] haha [16:09:55] it just works, you don't need to know why [16:09:58] I think they were on Windows though :P [16:10:39] 10Traffic, 10SRE, 10User-Ladsgroup: Rework education.wikimedia.org redirects - https://phabricator.wikimedia.org/T303397 (10Ladsgroup) 05Open→03Resolved [16:10:53] sO I think to make ourselves feel less weird, it probably is because of all the IPv6 tuning we have been trying to do [16:11:03] maybe we ran some command somewhere to mess things up on durum6001 [16:11:06] I don't recall it but well [16:11:08] yeah I think that explanation probably makes sense. [16:11:33] or triggered some odd race condition whereby it didn't properly re-add the link local. [16:11:48] unless it happens again let's forget any of this ever happened [16:13:03] :D [17:12:06] 10Traffic, 10SRE, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp6016.drmrs.wmnet with OS buster completed: - cp6016 (**WARN**) -... [17:42:43] re: drmrs network stuff, just scanning icinga for outstanding things [17:43:11] we still have a BFD Status alert with "CRIT: Down: 2" on both drmrs switches [17:43:24] and cr2-drmrs has "BGP WARNING - AS65001/IPv6: Active (for 9d5h)" [17:43:30] all known? can we ack them or something? [17:44:03] on non-network stuff: [17:44:14] there's two UNKNOWNs right now for alert1001 vk delivery alerts: [17:44:16] cache_text: Varnishkafka eventlogging Delivery Errors per second -drmrs- [17:44:24] cache_text: Varnishkafka statsv Delivery Errors per second -drmrs- [17:44:40] it's quite possible these are just for lack of any appreciable client load of various kinds [17:44:49] not sure though [17:48:45] afaict that eventlogging vk should be doing about 170ish reqs/second [17:48:48] https://grafana.wikimedia.org/goto/SIXUY5E7k?orgId=1 [17:49:19] statsv around 50 [17:49:41] yeah currently they're just reporting NaN, so not sure what's up there [17:49:55] the site isn't "live" yet, so I wasn't sure if some level of real traffic is necessary before something or other kicks in [17:50:27] could be some config error, too [17:59:32] I've run through all the host-level icinga status on all the hosts in drmrs, they all look good [17:59:56] there's a few recently-reimaged cps still in auto-downtime, but they'll clear on their own soon enough and are all green anyways [18:40:48] ;; NSID: 646F6836303032 "doh6002" [18:40:56] ^ Wikidough, drmrs [18:41:49] hmmm, since I re-enabled puppet on cp6011, we've got: [18:41:50] 18:22 <+icinga-wm> PROBLEM - Confd vcl based reload on cp6011 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish [18:42:03] oh sigh [18:42:07] not this again [18:42:29] just on cp6011 correct? [18:42:47] or did you mean Puppet was only enabled there? [18:43:03] sorry I was talking about this in another channel earlier [18:43:25] cp6011 was puppet-disabled since ~10h ago, due to some unrelated work. I think it was meant to be re-enabled but got missed. [18:43:33] oh [18:43:42] the netmapper stuff [18:43:47] anyways, looking into it [18:47:45] was just an artifact of extended puppet disablement, etc [18:47:54] re-ran confd's reload action manually and it cleared up [18:49:47] ok that's good let's hope it stays this way because I think this was the first event before the dominoes start falling last time so that got me worried :P [18:52:45] that time around, it was probably just a side-effect of the filled disk or the oom condition, one of the two [19:04:54] circling back to the network icinga issues: [19:05:02] the BFD ones are doh/durum Ipv6 sessions [19:05:10] bblack@asw1-b12-drmrs> show bfd session [19:05:16] [...] [19:05:17] 2a02:ec80:600:1:185:15:58:11 Down 0.000 2.000 3 [19:05:20] 2a02:ec80:600:101:10:136:0:21 Down 0.000 2.000 3 [19:05:54] (not the ipv6 advert, but the ipv6 bfd/bgp session) [19:06:41] o_O [19:08:27] and the BGP one on cr2-drmrs is complaining about the session in AS 65001 which is confed-eqiad [19:09:01] specifically to 2a02:ec80:600:fe04::21 [19:09:04] whatever that is! [19:10:03] looks like possibly a config typo [19:11:11] can you please check bfd session again to see if it resolved? thanks :) [19:14:38] not sure what permissions I need but I can't seem to ssh [19:14:41] still down on both switches [19:14:43] weird [19:15:07] might need some manual clearing or something, but I'm not gonna mess with that [19:15:34] the cr2-drmrs confed_eqiad thing, the config has: [19:15:47] (for ipv6 for that confed group): [19:15:49] neighbor 2a02:ec80:600:fe04::2 { [19:15:49] description cr1-eqiad; [19:15:49] } [19:15:49] neighbor 2a02:ec80:600:fe04::21 { [19:15:51] family inet6 { [19:15:54] unicast; [19:15:56] } [19:15:59] } [19:16:01] and it's the ::21 causing the icinga alert [19:16:14] I'm guessing ::21 wasn't even supposed to exist, and the family inet6 part was meant to be underneath ::2 [19:16:32] but who knows, certainly not me :) [19:17:21] anyways, we can pick this up with netops in the AM. The three alerts are easy to find in icinga, just search for string "drmrs" and scan down for non-green things. [19:17:53] I also still don't have a clue about the varnishkafka eventlogging and statsv alerts [19:18:05] well, not alerts, but the check is reporting NaN -> UNKNOWN [19:18:20] Arzhel did ping me about this but we got busy with other stuff and I kinda assumed this would be resolved by the restart of durum and the bird error that was fixed [19:18:24] so yeah let's ask them tomorrow [19:19:04] 2a02:ec80:600:1:185:15:58:11 is doh6001 [19:19:12] maybe that's why I was only hiting doh6002 hmmm [19:19:55] and 2a02:ec80:600:101:10:136:0:21 is durum6001, fwiw [19:20:24] it's different in each switch [19:20:45] b12 switch is reporting about doh/durum01, and b13 about doh/durum02, as per the rack layout stuff [19:20:51] yeo [19:20:53] p [19:23:05] the vk NaNs have been for only about a day [19:23:23] so I'm guessing it's the reimaging of text that triggered it. Perhaps there's some manual post-reimage setup to do there that we don't remember. [19:23:34] (e.g. some kind of cert/keyholder stuff or whatever) [19:24:03] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) [19:25:39] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka#Delivery_errors [19:25:47] > This error means that Varnishkafka failed to send messages to Kafka Jumbo, and hence data has been lost. [19:26:06] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) [19:26:43] https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-cp_cluster=cache_text&var-datasource=drmrs%20prometheus%2Fops&var-instance=All&var-source=eventlogging&viewPanel=20 is a NaN [19:26:49] oh nevermind about the 24h thing, that's just when it was reimaged [19:26:55] it has no data going back forever [19:27:13] maybe it's an analytics vlan firewall rule thing for the drmrs networks or something [19:27:15] yeah I think that's it from the dashboard [19:28:02] drmrs is missing from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/definitions/static.net#38 [19:30:45] ah the ipv6 private [19:30:49] great catch! [19:34:43] indeed! so 2a02:ec80:600:100::/56? [19:34:46] I can prep the patch [19:34:47] yes [19:34:56] ok on it [19:35:01] send it by the netopsen for review of course :) [19:35:09] yep [19:35:26] but hopefully, that indirectly impacts some network firewall rule and lets vk deliver stuff to analytics [19:35:54] or something [19:38:00] taavi: would you prefer to be credited as taavi? [19:40:28] I put you as taavi because I have to step out for a bit and wanted to get this done; let me know and I can revise [19:47:40] yeah that's fine [19:57:38] 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7780748, @Peachey88 wrote: > Did you keep a full copy of one of the tracerts that you could provide to the SRE team via [[ https://phabricator.wikimedia.org/paste/edit/form/36... [20:14:38] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10Ottomata) [20:16:01] 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10Aklapper) > You do not have permission to view this object. Sorry, should work now. [22:24:18] 10netops, 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, and 2 others: Allow access to prometheus-pushgateway.discovery.wmnet port 80 from within Analytics VLAN - https://phabricator.wikimedia.org/T304001 (10cmooney) Worth noting that we are planning in the short term to adjus... [23:15:58] 10Traffic, 10SRE: Wikimedia domains unreachable (16 Mar 2022) - https://phabricator.wikimedia.org/T303903 (10AlexisJazz) >>! In T303903#7783673, @Aklapper wrote: >> You do not have permission to view this object. > Sorry, should work now. Thanks, https://phabricator.wikimedia.org/P22736 [23:49:56] (EdgeTrafficDrop) firing: 60% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org [23:54:56] (EdgeTrafficDrop) resolved: 65% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org