[02:50:57] (HAProxyEdgeTrafficDrop) firing: 68% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [03:00:57] (HAProxyEdgeTrafficDrop) resolved: 69% request drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=esams&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [07:25:14] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) Can someone do the thing again? It expired today. [07:55:26] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Joe) p:05Medium→03High Please #traffic take a look at this prob... [07:56:19] <_joe_> sukhe vgutierrez bblack brett ^^ can please someone fix this issue permanently? [08:31:46] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Vgutierrez) The current deployment-prep instances are pretty far fr... [08:34:38] 10Traffic, 10Beta-Cluster-Infrastructure: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) [08:35:49] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10Vgutierrez) I've created T320930 to track this [08:35:50] _joe_: so who decides when is a good time to create a new deployment-cache instance? [08:36:22] <_joe_> vgutierrez: you ask me? look in the mirror :P [08:36:33] <_joe_> when I upgrade php in prod, I do it in beta first [08:36:58] yeah, I test non-destructive changes in beta [08:36:59] <_joe_> which means I spin up a new VM with the new version, and point hiera to it [08:37:19] <_joe_> AIUI those cache instances still have ats-tls [08:37:42] <_joe_> I would install a new instance with haproxy and all the stack you use now, but again, not my responsibility [08:37:57] not sure that's mine either TBH [08:38:17] anyways.. I'll do it instead of starting a useless ownership discussion [08:39:26] <_joe_> yeah the ownership discussion regarding all of beta and how to make in maintainable needs to start with the ownership part [08:39:55] <_joe_> we can just ensure it doesn't collapse or, more importantly, that it doesn't require us to constantly intervene to unbreak it [08:40:18] <_joe_> I try to do just that with the mediawiki part, and sometimes fail at that too [08:44:54] 10Traffic, 10Beta-Cluster-Infrastructure: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) p:05Triage→03High [08:47:21] Hey. I'm fairly certain that to shorten the TTL of a DNS discovery service I just need to change it in etcd. However, I'd still like to know what the 300/10 TTL means in the dns/templates/wmnet file, if anyone has an answer. :) [08:57:50] claime: correct, the TTL is managed by etcd, see also https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service and that syntax is gdnsd specific and it's basically max/min TTL, see https://github.com/gdnsd/gdnsd/wiki/GdnsdZonefile#dynadync-ttls [08:57:56] (HAProxyEdgeTrafficDrop) firing: 59% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [08:58:22] volans: Thanks! [09:01:54] claime: also, based on what you need to do, either the sre.discovery.service-route or the TTL-only bits of the sre.switchdc.services.* cookbooks might be useful [09:02:14] for reference: [09:02:14] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/discovery/service-route.py [09:02:17] and [09:02:19] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/switchdc/services/00-reduce-ttl-and-sleep.py [09:04:36] For the TTL I actually already did it manually with conftool, hope it wasn't a mistake [09:04:56] But the switchd.services could be useful rather than set/pooled=yes|no [09:05:10] switchdc* [09:05:39] The service-route actually, sorry, got mixed up in my tabs [09:06:36] that one does also the TTL mangling [09:06:42] *that one too [09:06:52] Yeah, I can see that [09:17:56] (HAProxyEdgeTrafficDrop) resolved: 67% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [09:24:14] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10Vgutierrez) [09:24:47] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10dcaro) a:03aborrero +1 [09:42:42] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10aborrero) [09:42:56] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10aborrero) 05Open→03Resolved [09:46:02] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Cloud-VPS (Quota-requests): Request increased quota for deployment-prep Cloud VPS project - https://phabricator.wikimedia.org/T320932 (10Vgutierrez) Thanks @aborrero && @dcaro [10:49:05] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) 05Stalled→03In progress Spawning deployment-cache-text07 && deployment-cache-upload07... [11:43:53] 10Traffic, 10SRE, 10SRE-swift-storage: Revisit CDN<-->Swift communication - https://phabricator.wikimedia.org/T317616 (10LSobanski) [13:18:19] 10HTTPS, 10Traffic, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10taavi) [13:19:00] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE, 10Patch-For-Review: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) deployment-cache-text07 is up & running: ` vgutierrez@deployment-cache-text07:~$ curl --conne... [13:38:51] 10Traffic, 10Beta-Cluster-Infrastructure, 10SRE: deployment-cache instances are missing several major features available in production - https://phabricator.wikimedia.org/T320930 (10Vgutierrez) deployment-cache-upload07 is up & running as well: ` vgutierrez@deployment-cache-upload07:~$ curl -I --connect-to u... [14:02:40] 10netops, 10Infrastructure-Foundations, 10ops-eqiad: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10ayounsi) p:05Triage→03Medium [14:45:39] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh) [14:56:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:06:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet - https://phabricator.wikimedia.org/T296411 (10aborrero) 05Stalled→03In progress p:05Medium→03High a:03... [15:10:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:30:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [16:17:18] 10Traffic, 10SRE, 10Patch-For-Review: per-backend-service concurrency limits in ATS-BE - https://phabricator.wikimedia.org/T306223 (10Ladsgroup) 05Stalled→03Open >>! In T306223#8111941, @CDanis wrote: > Awaiting {T309651} to continue testing boldly unstalling it. [18:11:56] (HAProxyEdgeTrafficDrop) firing: 30% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [18:16:56] (HAProxyEdgeTrafficDrop) resolved: 54% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [19:16:46] bblack: could you possibly help me with pybal/LVS one more time [20:24:56] (HAProxyEdgeTrafficDrop) firing: 27% request drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqsin&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [20:29:56] (HAProxyEdgeTrafficDrop) resolved: (4) 65% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [21:21:37] mutante: pong :) [21:22:28] unfortunately I can't now because meanwhile we got unrelated pages :( [21:22:46] ok [21:23:37] I reverted another attempt at properly removing git-ssh [21:23:48] most alerts are gone but not all..