[06:57:56] (EdgeTrafficDrop) firing: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [07:02:56] (EdgeTrafficDrop) resolved: 69% request drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=codfw&var-cache_type=text - https://alerts.wikimedia.org [09:59:57] (EdgeTrafficDrop) firing: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:24:57] (EdgeTrafficDrop) resolved: 68% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org [10:55:31] 10Acme-chief, 10cloud-services-team (Kanban): toolsbeta acme-chief certtificate has expired - https://phabricator.wikimedia.org/T301117 (10aborrero) [10:56:23] 10Acme-chief, 10cloud-services-team (Kanban): toolsbeta acme-chief certtificate has expired - https://phabricator.wikimedia.org/T301117 (10aborrero) p:05Triage→03High [10:56:47] 10Acme-chief, 10cloud-services-team (Kanban): toolsbeta acme-chief certtificate has expired - https://phabricator.wikimedia.org/T301117 (10Vgutierrez) this is a current limitation of acme-chief, wipe the old/expired cert and restart acme-chief [11:07:51] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) [11:08:11] 10Acme-chief, 10cloud-services-team (Kanban): toolsbeta acme-chief certtificate has expired - https://phabricator.wikimedia.org/T301117 (10aborrero) 05Open→03Resolved a:03aborrero OK, thanks!! This is what I did: `lang=shell-session aborrero@toolsbeta-acme-chief-01:~$ sudo su root@toolsbeta-acme-chief-... [13:43:41] if anyone's up for some bikeshedding: https://gerrit.wikimedia.org/r/c/operations/puppet/+/742686/ adds a new Cumin alias to target all our edge sites, current naming choices are "edges" or "pops", if anyone favours some name, please comment on the patch :-) [13:46:21] I'll bikeshed! [13:48:18] feelsgood.png [13:49:10] my brain is all fucked up... I was going to mention "in the edge of glory" and I're realised that's a Lady Gaga song [13:49:19] let me cry in my corner [13:49:24] edges++ [13:49:38] yeah edges makes more sense to me [13:49:48] technically chicago is also a pop, but we don't have any servers there [13:50:01] pops remind me of "pop pop ret" [13:52:11] 10Traffic, 10SRE, 10ops-ulsfo: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez) this a gentle 7 days reminder :) @RobH. [13:54:32] interesting, to me pop is less confusing than edge [13:55:30] I've expanded a little on the bikeshe^Wcomment [13:56:25] you can all blame me for starting it! [13:57:48] lolz [13:58:02] to be clear, I'm fine with whichever [13:59:15] I vote for "pedgep", to not get confused :-P [14:01:10] I mean if you think of the network as a graph, then the edges are the network links and these are technically nodes :P [14:03:40] anyways, my line of thinking is that "network pop" means anywhere we have a router, really (incl eqdfw and eqord, which have no servers), and nobody uses "edge" for that term. [14:03:59] so if it's cdn server sites, edge fits better than pop to me. [14:05:19] got it yeah makes sense to me [14:06:37] in the spirit of https://getyarn.io/yarn-clip/90a12f05-29bc-42d9-8fa7-38b0d86aa207 I could also add "edges" as an alias and make "pops" an alias to "edges"... [14:08:26] hahaha indeed [14:08:47] lol [14:15:16] technically, our routers have ssh [14:15:43] you could have a pops definition that's [edges + eqord + eqdfw] and include routers in cumin targets. [14:18:17] bblack: we don't have those in puppetdb, hence cumin [14:18:22] via the puppetdb backend [15:04:42] so all we need to do to fix that, is find a way to run puppet agents on the BSD side of the junipers? [15:04:45] sounds like a win/win! :) [15:05:46] i think they have or had puppet support officially at some point [15:05:55] there was something anyway [15:06:19] yes you can run puppet client on junos [15:06:27] can and want though are two very different things ;) [16:53:32] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) We have discussed this in the Traffic team and decided to go with `2001:67c:930::1/128`, mostly because we feel it's easy to memorize/copy (for cases where people want... [17:00:55] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) > We have discussed this in the Traffic team and decided to go with 2001:67c:930::1/128, mostly because we feel it's easy to memorize/copy (for cases where people want... [17:01:15] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) [17:32:15] 10Traffic, 10SRE, 10ops-ulsfo: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) So checking on this and we have a few things: * a bad sda that is failing (SSD) * dimm b3 memory errors in dell service event log * system is out of warranty, and will be 5 years old on 2022-04-07 ** i... [17:51:49] 10Traffic, 10decommission-hardware, 10ops-ulsfo: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:52:01] 10Traffic, 10decommission-hardware, 10ops-ulsfo: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:52:05] 10Traffic, 10SRE, 10ops-ulsfo: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) [17:54:41] 10Traffic, 10decommission-hardware, 10ops-ulsfo: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:54:54] 10Traffic, 10decommission-hardware, 10ops-ulsfo: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:56:58] 10Traffic, 10SRE, 10ops-ulsfo: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) 05Open→03Resolved IRC Update: Discussed this with @bblack in IRC and the call on this is to decom cp4031, and ensure the planned refresh for the ulsfo batch of everything (except the 4 new cp hosts... [18:04:16] 10Traffic, 10decommission-hardware, 10ops-ulsfo, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) https://gerrit.wikimedia.org/r/c/operations/puppet/+/761012 has the puppet bits that should happen before true decom of its existence it will fail its own puppetizat... [18:05:01] 10Traffic, 10decommission-hardware, 10ops-ulsfo, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) a:05BBlack→03RobH [18:15:23] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) >>! In T301165#7693858, @cmooney wrote: >> We have discussed this in the Traffic team and decided to go with 2001:67c:930::1/128, mostly because we feel it's easy to me... [18:30:06] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) ssingh: thanks! Yeah I'm not aware of any reason not to just match what was done with the IPv4, even if there are other options in this case. I've gone and added 3 I... [18:32:21] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) >>! In T301165#7694515, @cmooney wrote: > ssingh: thanks! Yeah I'm not aware of any reason not to just match what was done with the IPv4, even if there are other optio...