[02:20:09] 10Puppet, 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q1), 10User-fgiunchedi: Puppet failing on the alert hosts should alert - https://phabricator.wikimedia.org/T283151 (10lmata) [02:21:25] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q1), 10User-jbond: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10lmata) [02:21:43] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability (FY2021/2022-Q1): Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10lmata) [02:22:07] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229 (10lmata) [02:23:33] 10Mail, 10Infrastructure-Foundations, 10SRE, 10SRE Observability, 10Sustainability (Incident Followup): Add exim queue size to grafana graph - https://phabricator.wikimedia.org/T275867 (10lmata) [02:23:52] 10Mail, 10Icinga, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890 (10lmata) [02:24:48] 10SRE-tools, 10Infrastructure-Foundations, 10SRE Observability, 10IPv6, 10User-crusnov: Some Observability clusters apparently do not support IPv6. - https://phabricator.wikimedia.org/T271138 (10lmata) [02:26:34] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10SRE Observability, and 3 others: Sign-in links from Grafana dashboards don't work when not signed into SSO - https://phabricator.wikimedia.org/T269272 (10lmata) [02:27:08] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: Notification spam from "last puppet run" upon re-enabling puppet - https://phabricator.wikimedia.org/T263720 (10lmata) [02:30:07] 10SRE-tools, 10Icinga, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: ops-monitoring-bot creating dupes - https://phabricator.wikimedia.org/T226908 (10lmata) [02:30:41] 10Puppet, 10Icinga, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10lmata) [02:32:02] 10Puppet, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: Icinga alert for hosts with no Puppet roles - https://phabricator.wikimedia.org/T238006 (10lmata) [02:33:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: replace check_ripe_atlas Python script with a check_prometheus backed by atlasexporter data - https://phabricator.wikimedia.org/T251155 (10lmata) [02:33:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: add traceroute measurements to RIPE Atlas prometheus data - https://phabricator.wikimedia.org/T251156 (10lmata) [02:39:22] 10Mail, 10Infrastructure-Foundations, 10SRE, 10SRE Observability, 10User-MoritzMuehlenhoff: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016 (10lmata) [02:39:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10SRE Observability: Provision plaintext syslog collectors in esams/ulsfo/eqsin - https://phabricator.wikimedia.org/T243065 (10lmata) [07:13:02] are we going to have the IF meeting later or shall we do async updates via the pad? Only John and myself were around last week anyway and Chris/Joanna are off [07:32:12] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10elukey) >>! In T286032#7197078, @MoritzMuehlenhoff wrote: > Looking at Ganeti VMs, they broadly fall under three/four categories: >... [07:39:51] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10elukey) [08:03:59] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [08:04:11] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:04:19] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [08:04:29] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:04:41] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [08:05:11] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10cmooney) [08:13:57] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10ArielGlenn) With the new schedule I think I can swap one dumpsdata host and one snapshot host and avoid any impact whatsoever on XMl/SQL dumps.... [08:26:58] moritzm: I'm easy enough on the meeting. As you say not many updates probably. Arzhel is also off this week. [08:27:20] btw thanks for the detailed updates regarding VMs on those tickets, appreciate it. [08:30:54] ack, let's hear what John and Riccardo prefer when they're around [08:31:25] yep. [08:32:56] btw on the rpki VMs my sense is we don't need to worry about the CRs getting disconnected from rpki1001, they are peered to rpki2001 in codfw so should still have a live session (and indeed the blip may not be long enough to interrupt the first). [08:33:37] But I am not very familiar with the setup, so open to correction (I can confirm with Arzhel when he is back) [08:40:46] ah, seems so, yes. In fact Arzhel already marked it as not needing any action: https://phabricator.wikimedia.org/T286065#7194729 [08:45:08] I think it makes sense to complement the tables for the switch updates with another (table) row? We have "Action required", but it might make sense to add another one like "Action taken" or so? some of the prep tasks can happen way beforehand (e.g. I'd failover irc.wikimedia.org in the next days) and then there's a clearer picture what's still blocking [08:54:43] Ack, that makes a lot of sense. Maybe something like "Action Status" would be a good heading? And we could put in "complete", "pending" or just some free-form notes? [08:55:12] yeah, that makes sense [09:00:01] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) >>! In T286065#7194569, @Bstorm wrote: > @aborrero does cloudgw require manual failover? it doesn't require manual failover, but we could... [09:02:21] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Kormat) [09:07:11] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:07:57] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:11:49] also easy either way in relation to the meeting [09:12:10] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:13:29] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:15:23] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:15:59] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:29:34] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [09:38:58] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [09:51:10] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: switchdc SAL log entries are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709 (10dcaro) Could we avoid IRC from the chain to store the SAL message? Maybe make tcpircbot/logmsgbot go directly to sta... [09:51:14] * volans too for the meeting, I don't have much updates from the week off [09:51:40] we could use it to chat about okrs details if needed, but no strong opinion from my side [09:51:53] let's do async then, we're just four people and most were off? [09:52:09] syncing up on OKRs works better next week anyway when we're complete again [09:52:18] wfm [09:52:23] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:53:41] sgtm [09:53:57] I've updated the pad to mark it as async [09:59:41] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [10:02:10] 10Puppet, 10Icinga, 10Infrastructure-Foundations, 10SRE, and 2 others: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10jbond) 05Open→03Resolved a:03jbond being bold and closing this based on last comment [10:26:40] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [10:29:17] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to //cl... [10:33:16] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [10:33:49] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) @Bstorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently including those in the row connected to... [14:26:31] 10netbox, 10Infrastructure-Foundations: Netbox: define strategy to track standard server configurations - https://phabricator.wikimedia.org/T284614 (10Volans) >>! In T284614#7188879, @wiki_willy wrote: > Hope this helps, but let me know and I can hop on during your next office hours as well. I've sent the inv... [14:55:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: switchdc SAL log entries are getting cut off because long lines are being split over IRC - https://phabricator.wikimedia.org/T285709 (10RLazarus) >>! In T285709#7204976, @dcaro wrote: > Could we avoid IRC from the chain to store the SAL message? In pr... [14:57:32] I want to add something about the switch buffer reconfig to the agenda for the SRE meeting today. [14:58:19] Anyone got any advice on whether it's better to go under "service interruptions (other maintenance?)" or just leave in the Netops section of the agenda? [14:58:24] topranks: just add it to the gdoc and bold it, then it'll be raised when going through topics [14:58:38] service interruptions is the best fit [14:58:47] ok thanks :) [14:58:53] although that's for past interruptions [14:59:57] just add it under "Any other maintenance and expansions" [15:00:14] cool [15:16:10] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Adjust egress buffer allocations on ToR switches - https://phabricator.wikimedia.org/T284592 (10jijiki) @cmooney should we sent out an email about this to ops@ and possibly add those times/dates to the maintenance calendar? Thank you! [15:27:17] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (Kanban): Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10hnowlan) [15:45:37] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [15:45:54] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [18:55:46] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) [18:57:27] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Dwisehaupt) Still need to confirm the window with Advancement, but it is looking ok right now. There will be some work on the FR-Tech side to ensure d... [21:28:16] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Bstorm) >>! In T286065#7205088, @cmooney wrote: > @BStorm / @aborrero as mentioned on IRC I messed up with the list of servers here, inadvertently inc... [21:31:13] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) @cmooney Do the cloudsw switches get impacted by row B updates?