[03:27:35] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10Legoktm) Should services like Gerrit, Mailman, etc. be added to this? [08:24:46] 10netops, 10Infrastructure-Foundations, 10Traffic: lvs2007, lvs2009 and lvs2010 connected to the same row A switch - https://phabricator.wikimedia.org/T286879 (10Vgutierrez) [08:29:11] FYI the pad is up [08:29:43] topranks, XioNoX: does the replacement of the codfw/A2 switch have any impact on tomorrow's row D maintenance? (in the sense that the former requires your time in a way that makes in necessary to postpone the latter) otherwise I'd send a mail wrt bast1003 now [08:53:01] moritzm: I don't expect so, but I need to confer with XioNox about the replacement procedure and would defer to his opinion having done it before. [08:53:36] ack [08:54:17] Hello, if anybody needs me I'm back from holidays :) [08:55:49] welcome back :) [08:57:48] indeed welcome back Joanna... hope you managed to enjoy some of that sunshine :) [09:00:51] welcome back [09:01:55] Thanks! topranks I think I rather Irish weather - 30 degrees is way too hot for me. Did I miss on something? [09:03:29] work-wise? or weather-wise? [09:03:54] We had a switch in eqiad die on Friday just out of the blue which was fun :) [09:04:32] Still down, DC-Ops couldn't replace on the day and we decided it was a little tricky for remote hands. Hopefully be able to get it replaced with spare by Papaul later on. [09:04:58] and I suggest watching the recording of the staff meeting (maybe the tech dept one too) [09:05:42] work-wise :D [09:07:30] Thank you! [09:08:51] the switch failure was on codfw, not eqiad? [09:13:14] or topranks is prophetic about this week's switch maintenances :-) [09:13:50] no no, definitely not moritz let's not jinx it :D [09:15:12] majavah: yes the failed device is in codfw. If we can get it swapped with spare this evening by Papaul we should be able to draw a line under that incident and be comfortable things in codfw are fully stable before tomorrow's work in eqiad. [09:16:07] moritzm, topranks, yeah it's fine to keep the current schedule [09:16:28] ack, I'll send a mail for bast1003 in a bit, then [09:22:59] 10CAS-SSO, 10Infrastructure-Foundations: IDP/CAS doesn't load CSS/JS on error pages for disabled accounts - https://phabricator.wikimedia.org/T286885 (10Volans) p:05Triage→03Medium [10:29:20] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10MoritzMuehlenhoff) [10:32:15] done [11:35:13] 10CAS-SSO, 10Infrastructure-Foundations, 10User-jbond: IDP/CAS doesn't load CSS/JS on error pages for disabled accounts - https://phabricator.wikimedia.org/T286885 (10jbond) [12:18:41] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10MoritzMuehlenhoff) [12:19:29] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10MoritzMuehlenhoff) [12:20:52] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10MoritzMuehlenhoff) [12:21:13] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10MoritzMuehlenhoff) >>! In T283242#7220035, @Legoktm wrote: > Should services like Gerrit, Mailman, etc. be added to... [12:21:33] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:21:52] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Add logout.d script for lists.wikimedia.org - https://phabricator.wikimedia.org/T286906 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:22:05] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:30:53] 10CAS-SSO, 10Infrastructure-Foundations, 10Phabricator, 10SRE, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10Majavah) Phabricator doesn't seem to offer this functionality to anyone else than the user itself. Also I don't think there is a way to get this... [12:34:20] 10CAS-SSO, 10Infrastructure-Foundations, 10Phabricator, 10SRE, 10User-jbond: Auf logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10Majavah) >>! In T286904#7221038, @Majavah wrote: > Phabricator doesn't seem to offer this functionality to anyone else than the user itself. Tur... [12:40:36] 10CAS-SSO, 10Infrastructure-Foundations, 10Phabricator, 10SRE, 10User-jbond: Add logout.d script for Phabricator - https://phabricator.wikimedia.org/T286904 (10RhinosF1) [13:03:18] 10Mail, 10Infrastructure-Foundations: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) [13:39:46] 10netops, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) One interesting effect is that, since the datacenter... [14:13:41] 10CAS-SSO, 10Cloud Services Proposals, 10Infrastructure-Foundations, 10LDAP, and 2 others: Create solution for developer account authentication for services hosted in Cloud VPS - https://phabricator.wikimedia.org/T286716 (10jbond) [14:26:26] 10Mail, 10Infrastructure-Foundations: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10herron) +1 for option 2, I think that will be a more straightforward approach overall. In either case let's include a step to route and flush queued mail to the MX in the other DC before erasing/reti... [14:26:43] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:35:10] 10CFSSL-PKI, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review, 10User-jbond: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) > investigate switching ganati cluster certificates to cfssl As far as i can tell the only thing that uses RAPI are netbox and the nrpe check.... [14:36:29] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:38:48] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [14:40:21] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [14:52:20] 10CAS-SSO, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10hashar) We can disable an account over ssh with `gerrit set-account --inactive` or via the REST API https://gerrit.wikimedia.org/r/Documentation/rest-api-accounts.html... [14:52:29] 10CAS-SSO, 10Gerrit, 10Infrastructure-Foundations, 10SRE, 10User-jbond: Add logout.d script for Gerrit - https://phabricator.wikimedia.org/T286905 (10hashar) [16:03:05] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) [17:06:16] topranks: asw-a2-codfw is live, only left to re-enable access ports [17:06:28] ok nice work! [17:06:34] did you sort the issue with the serial/console server? [17:07:11] We shut et-0/0/0 in cr1-codfw as well on Friday, did you bring that back up? [17:07:37] yeah faulty cable [17:07:47] yep uplink is passing prod traffic [17:07:49] ok great :) [17:07:51] cool [17:08:18] now turning all access ports up except the LVS ones [17:08:24] so what is the plan for the access ports? I was browsing some of the tasks created re: LVS and authdns but I can't say I'm 100% on what the order of operation should be. [17:08:24] because of https://phabricator.wikimedia.org/T286921 [17:08:44] ok cool yes [17:09:11] any other problems apart from the serial cable? notes for the wiki page etc? [17:09:27] FYI there is also T286914 [17:09:28] T286914: Track actions to perform before repooling authdns2001 - https://phabricator.wikimedia.org/T286914 [17:10:05] volans: cool you can do in in 5min or so [17:10:46] sure [17:11:11] alright, hosts should come back up [17:21:34] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) switch backup online and Netbox update [17:31:18] topranks, XioNoX: have you run puppet on all affectd hosts? [17:31:22] we should I guess [17:32:20] I've not done anything no, XioNox didn't mention it so I suspect not either. [17:33:17] it will run in max 30m anyway, it would have reduced a bit the noise, now if there is an alert maybe a puppet run will fix, maybe not [17:35:35] Ok. We're almost at that time limit now, not sure if it's worth doing manually? [17:35:55] ok [17:42:36] I've double-checked and we've valid ARP entries on CR1 against every MAC that is learnt on the replaced switch right now. [17:43:03] So I think that should mean all hosts are good in pure reachability terms. [17:43:09] no lost hosts? [17:43:11] :) [17:43:41] yeah, or machines connected to wrong Vlan or something. [17:43:47] (where you'd see MAC but no ARP) [17:44:11] sure [22:02:58] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Case Number:2021-0719-0629 create with Juniper [22:06:52] 10netbox, 10Infrastructure-Foundations: Netbox: define strategy to track standard server configurations - https://phabricator.wikimedia.org/T284614 (10Papaul) I will go for: PowerEdge R440 - Config A 202107 [23:28:40] 10netops, 10Analytics, 10DBA, 10Infrastructure-Foundations, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10Bstorm) [23:52:23] 10netops, 10DC-Ops, 10SRE, 10ops-codfw, 10Wikimedia-Incident: asw-a2-codfw unresponsive - https://phabricator.wikimedia.org/T286787 (10Papaul) Dear Juniper Networks Customer, A Return to Factory (RTF) RMA has been created. Details of which are provided below. ***** RMA DETAILS ***** RMA Number: R200361...