[11:42:13] <_joe_> effie, eoghan: we've disabled a requestctl rule (see SAL) which should have run its course. If you get paged for parsoid and / or restbase outages, re-enable it first, ask questions later [11:42:36] Good to know, thanks! I'll make sure our US friends know as well at the end of the day. [11:43:02] <_joe_> the rule is cache-text/wikifeeds_featured [11:43:30] <_joe_> to enable it just sudo requestctl enable cache-text/wikifeeds_featured && sudo requestctl commit [11:58:18] ack, thanks _joe_ [13:20:00] any network maint ongoing? [13:27:42] there was some quick network disturbance that resulted in a page about phab1004 (but fixed itself quickly) [13:28:17] looking at librenms, around the same time there were some "inbound interface errors" alerts on cr2-eqiad [13:28:54] looks like they were on transport links for codfw<->eqiad [13:30:48] indeed https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=codfw&var-site=eqiad&var-target_site=codfw&var-target_site=eqiad&var-role=cr&var-family=All&from=now-30m&to=now [13:31:15] although, digging around librenms more, I see those cr2-eqiad "inbound interface errors" seem pretty frequent for quite a while now, so maybe that's unrelated [13:31:46] commonly on the lumen transport [13:32:32] I checked other hosts on the same rack and I didn't see the same errors- but if it is the router it may not affect all hosts equally [13:32:59] e.g. if it was only eqiad-codfw links [13:33:05] that started about a week ago (but again, may be unrelated): https://librenms.wikimedia.org/graphs/to=1699536600/id=11592/type=port_errors/from=1696858200/ [13:35:30] jelto: see also ^ [13:35:41] can't quite connect all the dots, but maybe-related [13:38:35] those failed network probes were dc-local though, at least seem to be [13:38:53] ah thanks! however the phab pa.ge looks a bit more related to php-fpm if I look the probe reports https://logstash.wikimedia.org/goto/c38d5b9a0e170c93f59f08bbc9ab298f and syslog on phab1004. [13:38:54] prometheus1006 -> phab1004 [13:38:54] Get \"https://10.64.16.101:443/\": context deadline exceeded" and [13:38:54] [WARNING] [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are [13:38:54] 0 idle, and 23 total children [13:40:49] the interface errors were tracked in https://phabricator.wikimedia.org/T342502 until "phaultfinder" changed the task description [13:42:25] that same ticket used to have xe-3/2/2 on it? [13:42:30] yeah [13:42:48] yeah that doesn't seem like ideal behavior :) [13:43:59] bblack: I think it's an edgecase, where at the same time the cr2 errors decreased, and the ssw1 errors started happening [13:44:00] https://phabricator.wikimedia.org/T342502#9232643 [13:44:18] in theory it would have closed it and then re-opened a new one [13:44:51] but I agree the librenms/prometheus/phab integration is not ideal, I don't think it's easy to have the hostname/interface in the task description for example [13:52:49] should I make a manual ticket for it? [13:54:49] made one, can close if it's duplicate or irrelevant :) https://phabricator.wikimedia.org/T350869 [13:58:24] bblack: the issue is that if the error rate goes up, Prometheus will open a new one :) [14:01:30] re: having librenms open a task per host+interface should be doable, pretty sure we can do that right away for hostname not sure for interface, would you mind filing a task XioNoX ? [14:01:45] godog: oh nice! [14:01:47] yeah [14:02:07] cheers [14:05:51] godog: https://phabricator.wikimedia.org/T350872 [14:08:28] LGTM, I'll followup [15:34:21] bblack,arnoldokoth: Hello! Relatively quiet oncall day today, except for the phabricator scare earlier. The other thing to flag is what _joe_ said earlier this morning, the `cache-text/wikifeeds_featured` requestctl rule was removed. If you get paged for parsoid or restbase, re-enable that rule is probably a good first step. [15:40:08] eoghan: ack, thanks [15:46:16] eoghan: Thanks.