[22:12:21] !incidents [22:12:21] 4486 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:12:21] 4485 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:12:22] 4484 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:12:22] 4483 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:14:03] https://librenms.wikimedia.org/alerts/ doesn't show anything unusual right now [22:14:43] Kingsoft cloud corporation maybe? [22:15:05] https://superset.wikimedia.org/superset/dashboard/p/58XBDkYvz2d/ [22:15:14] https://librenms.wikimedia.org/device/device=93/alerts clear [22:15:28] It keeps resolving/paging [22:15:51] https://librenms.wikimedia.org/graphs/device=93/type=device_bits/from=1708812898/legend=yes/popup_title=Device+Traffic/ looks organic rather than unusual [22:16:53] brett: I'm tempted to silence that and leave it for tomorrow, since our key metrics are all good ? [22:17:31] brett: cf https://wikitech.wikimedia.org/wiki/Network_monitoring#LibreNMS_alerts [22:17:42] I'll make a task [22:18:03] Yeah, I looked there and it seemed fine. Superset and https://grafana.wikimedia.org/d/oMIu2XI4z/cdn-data-transfer-rates?orgId=1&var-site=codfw&var-min_step=2m&var-cluster=All&from=now-3h&to=now were where I was finding anomalies [22:18:23] Seems like silencing is a good idea [22:18:40] although that is a ton of traffic increase [22:20:30] Mmm, that is quite a bit more outbound traffic [22:24:05] Honestly not sure if this is worth further action now (I was on my way to bed) [22:24:44] I'm having a hard time figuring out where the issue is but that's likely due to incompetence :( [22:27:17] oh, I see. It's facebook [22:27:28] It's been going on for a while, so I didn't zoom out enough [22:27:36] as32934 [22:28:17] where are you getting that from? [22:28:43] https://superset.wikimedia.org/superset/dashboard/p/z0Gr4gNBxyP/ [22:29:57] Hm, yes [22:31:25] 2a03:2880:25ff::/36 [22:31:47] oh, an even bigger prefix, really [22:31:49] and those requests from as32934 are largely to upload.wm.org which is what you'd expect [22:32:52] So.... is the expectation to rate limit the AS? [22:33:44] largely to load.php [22:36:18] If we thought this was going to cause a wider problem, then we could rate-limit from that as to upload.wikimedia.org [22:36:55] Do you think it will? [22:37:55] I'm playing with the requestctl generator page and it ignores as_number as a filter [22:38:59] which leads me to think (possibly incorrectly) that using requestctl to block from just as32934 wouldn't be straightforward [22:39:16] Oh yeah, I think I remember that being an issue [22:39:23] It has to be IP ranges [22:41:12] that as has a bunch of IP ranges (https://whois.ipip.net/AS32934) [22:42:13] Yeah, that's a huge range [22:42:42] I think I'm not confident enough of working out the appropriate range(s) to rate-limit [22:42:52] Yeah, same [22:43:02] Sooo.... leave it for tomorrow and you get your sleep? [22:43:22] + silence [22:44:09] I think so, on the basis that we'd expect further pages if it starts to be a wider problem? [22:44:25] That is my thought, yes [22:45:30] OK, let me see if i can figure out how to silence that alert [22:47:05] done [22:47:14] You're the man now, dog [22:47:28] I'll make a ticket too [22:47:28] I don't know why I felt the need to say that [22:47:44] Thanks so much for the help. Sleep well :) [22:48:21] Just got a page :( [22:51:24] I just set a silence on alerts.wm.o [22:51:34] for 24 hours. That should do it, I think? [22:52:12] T358455 is the task [22:52:13] T358455: Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455 [22:52:21] !incidents [22:52:22] 4487 (UNACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:52:22] 4486 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:52:22] 4485 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:52:22] 4484 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:52:23] 4483 (RESOLVED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:52:31] !ack 4487 [22:52:31] 4487 (ACKED) Primary outbound port utilisation over 80% (paged) global noc (cr2-codfw.wikimedia.org) [22:53:31] We'll definitely want to revisit on Monday morning [22:53:54] Hopefully that was the right thing to do tonight though :-/ [22:54:19] Thanks again for your help