[06:17:29] <_joe_> !incidents [06:17:30] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:17:30] 4542 (UNACKED) [4x] ProbeDown sre (text-https:443 probes/service) [06:17:30] 4544 (UNACKED) [2x] HaproxyUnavailable cache_text global sre () [06:17:30] 4545 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [06:17:31] 4543 (RESOLVED) [2x] VarnishUnavailable global sre (varnish-text) [06:17:31] 4541 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [06:17:31] 4539 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:17:31] 4538 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:17:50] <_joe_> !ack 4544 4542 [06:17:51] Could not ack the alert. Please check the parameters. [06:17:54] <_joe_> !ack 4544 [06:17:54] 4544 (ACKED) [2x] HaproxyUnavailable cache_text global sre () [06:17:59] <_joe_> !ack 4542 [06:17:59] 4542 (ACKED) [4x] ProbeDown sre (text-https:443 probes/service) [06:26:41] <_joe_> !incidents [06:26:41] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:26:41] 4546 (UNACKED) [3x] ProbeDown sre () [06:26:42] 4544 (RESOLVED) [2x] HaproxyUnavailable cache_text global sre () [06:26:42] 4542 (RESOLVED) [4x] ProbeDown sre (text-https:443 probes/service) [06:26:42] 4545 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [06:26:42] 4543 (RESOLVED) [2x] VarnishUnavailable global sre (varnish-text) [06:26:42] 4541 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [06:26:43] 4539 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:26:43] 4538 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:32:26] <_joe_> !incidents [06:32:26] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:32:26] 4546 (UNACKED) [3x] ProbeDown sre () [06:32:27] 4544 (RESOLVED) [2x] HaproxyUnavailable cache_text global sre () [06:32:27] 4542 (RESOLVED) [4x] ProbeDown sre (text-https:443 probes/service) [06:32:27] 4545 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [06:32:27] 4543 (RESOLVED) [2x] VarnishUnavailable global sre (varnish-text) [06:32:27] 4541 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [06:32:28] 4539 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:32:28] 4538 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [06:33:38] <_joe_> !ack 4546 [06:33:38] 4546 (ACKED) [3x] ProbeDown sre () [10:34:55] nemo-yiannis, urandom: confirming that change fixed the commons redirects [10:34:58] thanks topranks! [10:39:11] <_joe_> hnowlan: is wikifeeds back to healthy? [10:40:46] _joe_: it's been largely stable since we depooled the newer restbase hosts around 1800 yesterday, but hopefully this change will mean it's ~fixed when we repool [10:41:05] not 100% certain this will have fixed the latency concerns but we've eliminated the 5xx problem [10:41:11] <_joe_> hnowlan: looks like the problem was ipv6 [10:41:34] as with all things restbase, "yes, but" [10:41:46] <_joe_> I did try to resist dual-stacking everyting at the applayer, I expected it to create more and more problems, but I failed... [10:41:52] hnowlan: glad to hear! [10:42:08] would it work with v6 if we allowed the address range in that regex? [10:42:12] I think so [10:42:26] I'm writing a little change to allow us to configure that list of ranges rather than having it hardcoded [10:42:54] cool... that's the best longer term fix [10:43:11] we can use aggregates so the number of ranges to include won't be huge or change much [10:44:18] Tell me if I should hold off on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005756 [10:46:37] thanks hnowlan [10:49:50] latency is also down after depooling yesterday (i think it was heavily correlated to the 500s) [10:50:24] does it worth to file a ticket to properly fix the rule on how to tag internal/external traffic with ipv6 support ? [10:52:35] *fix the rule on restbase [10:52:38] I'm a bit out of context, but to fully understand, was this another case of hosts provisioned with/without AAAA records for their IPv6s but were supposed to not have/have them? [11:02:56] hnowlan: all good for me to start moving traffic to mw-api-int then? [11:13:25] I'll proceed then, I'll stop puppet on P:restbase, then run it one one, check logs and re-enable [11:16:03] claime: sgtm [11:57:18] hnowlan: everything has been repooled already [11:57:34] ah cool [11:58:51] btw, currently stalled for the rollout of the mw-api-int migration because we didn't actually implement the split listener for envoy in puppet [11:59:09] I'll do a full switchover to mw-api-int and roll it out progressively on the nodes [11:59:21] volans: I think so, though I'm not sure anyone knew that ipv6 dns records would cause problems here [11:59:40] I didn't [12:04:23] urandom: ack, thx [12:28:58] !incidents [12:28:58] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:28:58] 4548 (ACKED) [2x] ProbeDown sre (phab1004:443 probes/custom eqiad) [12:28:59] 4547 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [12:28:59] 4546 (RESOLVED) [3x] ProbeDown sre () [12:28:59] 4544 (RESOLVED) [2x] HaproxyUnavailable cache_text global sre () [12:28:59] 4542 (RESOLVED) [4x] ProbeDown sre (text-https:443 probes/service) [12:28:59] 4545 (RESOLVED) PHPFPMTooBusy appserver sre (php7.4-fpm.service eqiad) [12:29:00] 4543 (RESOLVED) [2x] VarnishUnavailable global sre (varnish-text) [12:29:00] 4541 (RESOLVED) ProbeDown sre (10.64.16.101 ip4 phab1004:443 probes/custom http_phabricator_wikimedia_org_ip4 eqiad) [12:29:00] 4539 (RESOLVED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:51:25] <_joe_> denisse: as you can see, we had fun this morning [12:55:18] <_joe_> !resolve 4540 [12:55:18] 4540 (ACKED) GatewayBackendErrorsHigh sre (wikifeeds_cluster rest-gateway eqiad) [12:55:37] <_joe_> sigh, the api is broken since time immemorable... [16:15:22] Dear SREs, we will be pooling codfw back tomorrow at 14:00 UTC