[00:40:47] 10serviceops, 10SRE, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [00:41:44] 10serviceops, 10SRE, 10Developer Productivity, 10Performance-Team (Radar), and 2 others: Debug hosts sometimes Fatal error: "The UdpSocket to 127.0.0.1:10514 has been closed" - https://phabricator.wikimedia.org/T214734 (10Krinkle) [07:22:41] hey, in a hour or so I'm going to +2 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/671204/23 & https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/693411/8 so that the renamed chart is properly deployed to the chart repo and make sure the latest patch of the chain becomes green, please let me know if you have any objections with this [07:30:06] jayme ^ [07:30:13] dcausse: there is a patch following 693411 [07:30:15] ? [07:30:43] which has chart: wmf-stable/flink-session-cluster in helmfile.yaml? [07:31:12] I think since we are not creating the namespace yet, it is fine to do it in 693411 too [07:32:45] effie: I think I wanted to have the CI green on all patches [07:32:56] ah! [07:33:02] sorry, now I understand [07:33:04] myb ad [07:33:08] np :) [08:05:20] good morning :) [08:06:20] the ml k8s master nodes can now reach pod ips on worker nodes (yay!) but they cannot reach service ips, and IIUC this needs a special config like https://docs.projectcalico.org/networking/advertise-service-ips [08:07:11] I double checked on ml-serve1004, where the istiod pod runs (that in turn exposes the webhook etc..) and the service ip is not present in the output of "route" [08:07:29] (but I can see more IPs related to pods etc.., all with ifface calico-blabla) [08:07:34] does it make sense? [08:13:43] (also we have empty GlobalNetworkPolicy, might be a good time to start thinking about it) [08:24:28] (I am also re-installing istio from scratch, it is maybe a problem with old service leftovers and calico) [08:24:58] yes :) [08:25:58] of course now the ingress gateway pod is not coming up but this is progress! [08:52:57] elukey: I can take a look at that later today [08:57:41] jayme: all working so far, envoy in ingress is now telling me that I am a bad person, but it is a new problem, the routing works :) [09:00:47] elukey: so you're just advertising service ip's now? [09:02:14] jayme: I think it was a stale config between istio and calico, the service ip advertised among the routes on the ml-serve node was not the right one. I deleted the istio deployment/pod/service and re-ran istioctl, now it works (so I see the service being advertised) [09:02:33] ah, okay. Cool! [09:33:48] re: mwmaint upgrade. one more thing. noc.wikimedia.org is hosted on mwmaint.discovery which still points to mwmaint1002, so if we don't want that to be down (at least it's not the same as config-master anymore) we need to switch that to codfw first [09:34:21] to check that it works.. make some httpbb checks for it [09:50:22] https://gerrit.wikimedia.org/r/704297 [09:59:21] arr.. and to make that work also need ferm holes on mwmaint to connect from deploy, as usual [10:22:07] all green https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/693416, thanks for the merges! :) [10:22:38] I'll ping you this afternoon to check if it's OK to try to deploy to staging [10:36:24] mutante: yeah, we should switch noc and ideally have it auto switch with the rest of MediaWiki during the switchover... [10:37:02] legoktm: I made patches to: - add tests for noc to httpbb, add firewall hole to let us use them, will check mwmaint2001 passes them, switch over,.. upgrade to buster [10:58:04] https://noc.wikimedia.org backend switched to codfw [11:12:11] 10serviceops, 10SRE, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [11:12:57] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Stalled→03Open Yes, that's correct. We are reimaging eqiad first. Just switched noc.wikimedia.org backend to codfw to avoid any downtime of that. mwmaint2002 will be done o... [11:13:50] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mwmaint1002.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202107... [11:39:49] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [11:40:48] reimaging mwmaint1002 with buster [11:52:37] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [11:52:50] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) 05Stalled→03Open [12:02:15] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mwmaint1002.eqiad.wmnet'] ` and were **ALL** successful. [12:12:55] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) ` [mwmaint1002:~] $ lsb_release -c Codename: buster ` mwmaint1002 is on buster now. puppet runs without errors or warnings. https://noc.wikimedia.org is hosted by mwmaint2002 in... [12:17:07] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [12:20:24] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) @Legoktm done ^ the noc site is now hosted in codfw (leaving it like that until we switch back, right?). and mwmaint1002 is now on buster and puppet did not show any issues. it ha... [12:20:46] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) Also we have this now which shows the noc site works on both hosts also after reimage: ` [deploy1002:~] $ httpbb /srv/deployment/httpbb-tests/noc/* --hosts mwmaint1002.eqiad.wmne... [12:38:17] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [17:07:08] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1281.eqiad.wmnet` - m... [17:20:38] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1282.eqiad.wmnet` - m... [17:21:46] 10serviceops, 10Maps, 10Patch-For-Review, 10User-jijiki: Deploy tegola-vector-tiles to kubernetes - https://phabricator.wikimedia.org/T283159 (10jijiki) [17:30:20] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1282.eqiad.wmnet` - m... [17:57:28] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1283.eqiad.wmnet` - m... [18:00:52] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [20:17:46] did the bast2002 key change? "Warning: the ECDSA host key for 'bast2002.wikimedia.org' differs from the key for the IP address '2620:0:860:2:208:80:153:54'" but https://wikitech.wikimedia.org/w/index.php?title=Help:SSH_Fingerprints/bast2002.wikimedia.org&action=history doesn't show anything recent ... did someone deploy a change but not update wiki yet?