[06:57:06] Just a short reminder: we will start re-deploy services in eqiad Kubernetes cluster soon. Feel free to ping me any time. [07:22:23] <_joe_> jelto, jayme I have one suggestion about eqiad [07:22:42] * jelto listening [07:22:45] <_joe_> do not depool all services, but depool/redeploy/pol each one [07:22:54] <_joe_> individually [07:23:21] <_joe_> the reason is, given most services are actually either called by mediawiki (which is in equiad) or by users in europe/asia at this time of the day [07:23:37] <_joe_> switching them to codfw will cause a perf degradation [07:23:52] <_joe_> one we're ok with, but no reason to make it last longer than necessary [07:24:11] <_joe_> if this is too inconvenient, ofc, just go with depool all, redeploy all, repool all [07:24:58] I think we would need to lower the ttl in that case to not have to wait 5min for each service [07:25:04] We could do that, but that would mean the whole process takes even more time because we have to wait 5 minutes before each service, right? [07:26:18] _joe_: what about doing that for the "big ones" only? sessionstore and mobileapps? [07:26:38] <_joe_> jayme: mobileapps is actually ok to stay depooled [07:26:44] <_joe_> I was thinking of eventgate mostly [07:27:00] <_joe_> jelto: we can reduce the ttl to 10 seconds on all eqiad records for the time of the transition [07:28:51] <_joe_> but yeah basically as long as we do eventgate-main, sessionstore, echostore individually [07:28:57] <_joe_> the rests should be ok [07:29:06] <_joe_> let me take a look at the list of all deployed services [07:29:08] as the services are depooled already, I'd say we just repool those 3 then [07:29:28] <_joe_> yep [07:29:31] or migrate them now and repool? [07:29:39] like as first ones i mean [07:29:49] <_joe_> that too [07:29:54] <_joe_> whatever you prefer [07:30:16] jelto: then I'd say just migrate those three first [07:30:19] so eventgate* first? I can do that [07:30:26] and repool each one directly after migration is done [07:30:30] <_joe_> eventgate-main [07:30:32] and sessionstore and echostore [07:30:34] <_joe_> and sessionstore [07:30:43] ok [07:32:30] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=echostore&var-destination=eventgate-analytics&var-destination=eventgate-analytics-external&var-destination=eventgate-main&var-destination=sessionstore&var-destination=shellbox&var-destination=shellbox-constraints&var-destination=sh [07:32:32] <_joe_> ellbox-syntaxhighlight&var-destination=shellbox-media&var-destination=shellbox-timeline&var-destination=termbox calls from mediawiki to other services [10:18:22] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) cc from ops list: The re-deploy for all services in the eqiad Kubernetes cluster was successful. However this time we had an impact on service availability. Planned reduced serv... [10:18:34] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [10:40:15] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [13:01:25] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) [14:18:06] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) I executed the steps outlined in "**4. How to fix the current situation in staging**" for staging-eqiad now to unblock... [14:23:06] <_joe_> jelto: I think a short-form incident report for the outage of eventgate-main might be a good idea [14:24:33] <_joe_> jelto: https://wikitech.wikimedia.org/wiki/Incident_status and use the lightweight report template [14:25:46] <_joe_> if you need help filling it, I'm happy to help ofc [14:30:16] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 (10JMeybohm) a:03Arnoldokoth @Arnoldokoth the new nodes now have a ipam block assigned (I moved some test workload there to verify). From my POV you can continue with this when you have... [15:36:58] _joe_: i started the incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage I will add some more information and create follow up phab tasks. I will change the status as soon as I'm finished [15:37:23] <_joe_> ack [15:37:28] <_joe_> it doesn't need to be long [15:37:52] <_joe_> it's just that we write somewhere that it happened and why, and if any actionables are there. [22:33:35] 10serviceops, 10Phabricator, 10Release-Engineering-Team: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Hawkeye7) I'm not sure I understand what is proposed here... If Phabricator is not the authoritative place to host repositories, then what is? How is... [22:59:46] 10serviceops, 10Phabricator, 10Release-Engineering-Team: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Reedy) >>! In T296022#7530227, @Hawkeye7 wrote: > I'm not sure I understand what is proposed here... If Phabricator is not the authoritative place to...