[06:57:06] <jelto>	 Just a short reminder: we will start re-deploy services in eqiad Kubernetes cluster soon. Feel free to ping me any time.
[07:22:23] <_joe_>	 jelto, jayme I have one suggestion about eqiad
[07:22:42] * jelto listening
[07:22:45] <_joe_>	 do not depool all services, but depool/redeploy/pol each one 
[07:22:54] <_joe_>	 individually
[07:23:21] <_joe_>	 the reason is, given most services are actually either called by mediawiki (which is in equiad) or by users in europe/asia at this time of the day
[07:23:37] <_joe_>	 switching them to codfw will cause a perf degradation
[07:23:52] <_joe_>	 one we're ok with, but no reason to make it last longer than necessary
[07:24:11] <_joe_>	 if this is too inconvenient, ofc, just go with depool all, redeploy all, repool all
[07:24:58] <jayme>	 I think we would need to lower the ttl in that case to not have to wait 5min for each service
[07:25:04] <jelto>	 We could do that, but that would mean the whole process takes even more time because we have to wait 5 minutes before each service, right?
[07:26:18] <jayme>	 _joe_: what about doing that for the "big ones" only? sessionstore and mobileapps?
[07:26:38] <_joe_>	 jayme: mobileapps is actually ok to stay depooled
[07:26:44] <_joe_>	 I was thinking of eventgate mostly
[07:27:00] <_joe_>	 jelto: we can reduce the ttl to 10 seconds on all eqiad records for the time of the transition
[07:28:51] <_joe_>	 but yeah basically as long as we do eventgate-main, sessionstore, echostore individually
[07:28:57] <_joe_>	 the rests should be ok
[07:29:06] <_joe_>	 let me take a look at the list of all deployed services
[07:29:08] <jayme>	 as the services are depooled already, I'd say we just repool those 3 then
[07:29:28] <_joe_>	 yep
[07:29:31] <jayme>	 or migrate them now and repool?
[07:29:39] <jayme>	 like as first ones i mean
[07:29:49] <_joe_>	 that too
[07:29:54] <_joe_>	 whatever you prefer
[07:30:16] <jayme>	 jelto: then I'd say just migrate those three first
[07:30:19] <jelto>	 so eventgate* first? I can do that
[07:30:26] <jayme>	 and repool each one directly after migration is done
[07:30:30] <_joe_>	 eventgate-main
[07:30:32] <jelto>	 and sessionstore and echostore
[07:30:34] <_joe_>	 and sessionstore
[07:30:43] <jelto>	 ok
[07:32:30] <_joe_>	 https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=echostore&var-destination=eventgate-analytics&var-destination=eventgate-analytics-external&var-destination=eventgate-main&var-destination=sessionstore&var-destination=shellbox&var-destination=shellbox-constraints&var-destination=sh
[07:32:32] <_joe_>	 ellbox-syntaxhighlight&var-destination=shellbox-media&var-destination=shellbox-timeline&var-destination=termbox calls from mediawiki to other services 
[10:18:22] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) cc from ops list:   The re-deploy for all services in the eqiad Kubernetes cluster was successful. However this time we had an impact on service availability. Planned reduced serv...
[10:18:34] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[10:40:15] <wikibugs>	 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[13:01:25] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm)
[14:18:06] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: New Kubernetes nodes may end up with no Pod IPv4 block assigned - https://phabricator.wikimedia.org/T296303 (10JMeybohm) I executed the steps outlined in "**4. How to fix the current situation in staging**" for staging-eqiad now to unblock...
[14:23:06] <_joe_>	 jelto: I think a short-form incident report for the outage of eventgate-main might be a good idea
[14:24:33] <_joe_>	 jelto: https://wikitech.wikimedia.org/wiki/Incident_status and use the lightweight report template
[14:25:46] <_joe_>	 if you need help filling it, I'm happy to help ofc
[14:30:16] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 (10JMeybohm) a:03Arnoldokoth @Arnoldokoth the new nodes now have a ipam block assigned (I moved some test workload there to verify). From my POV you can continue with this when you have...
[15:36:58] <jelto>	 _joe_: i started the incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-11-25_eventgate-main_outage I will add some more information and create follow up phab tasks. I will change the status as soon as I'm finished
[15:37:23] <_joe_>	 ack
[15:37:28] <_joe_>	 it doesn't need to be long
[15:37:52] <_joe_>	 it's just that we write somewhere that it happened and why, and if any actionables are there.
[22:33:35] <wikibugs>	 10serviceops, 10Phabricator, 10Release-Engineering-Team: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Hawkeye7) I'm not sure I understand what is proposed here... If Phabricator is not the authoritative place to host repositories, then what is? How is...
[22:59:46] <wikibugs>	 10serviceops, 10Phabricator, 10Release-Engineering-Team: Deprecate git-ssh service on phabricator.wikimedia.org - https://phabricator.wikimedia.org/T296022 (10Reedy) >>! In T296022#7530227, @Hawkeye7 wrote: > I'm not sure I understand what is proposed here... If Phabricator is not the authoritative place to...