[07:54:17] <_joe_> ottomata: no it's not nice that you're not linking _tls_helpers. It means you'll always have to play catchup; also I'd like to understand why we didn't follow through with petr's patches. [07:54:36] <_joe_> In practice, it means that next time we upgrade something I'll open an UBN! task for analytics to catch up. [07:56:23] <_joe_> also, the "labels" discussion has been settled quite some time ago and it's good enough for everyone, including other projects with multiple deployments like shellbox [08:28:20] 10serviceops, 10Kubernetes: Clarify common k8s label and service conventions in our helm charts - https://phabricator.wikimedia.org/T291848 (10Joe) >For eventgate, this makes wmf.releasename == 'eventgate-production' for each eventgate deployment. I learned that this is safe because each deployment is in its o... [08:57:07] 10serviceops, 10Kubernetes: Clarify common k8s label and service conventions in our helm charts - https://phabricator.wikimedia.org/T291848 (10Joe) The whole app/chart/release/heritage combo of labels comes from what helm best practices suggest. Specifically: * `app` indicates the general chart we're using.... [10:01:00] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [10:15:01] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) 05Open→03Resolved a:03akosiaris And all of the above is mostly irrelevant and I am mostly blind and chasing ghosts (on the plus side I got more acqua... [10:57:17] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.0 - https://phabricator.wikimedia.org/T291095 (10jijiki) @Ladsgroup run into this error: ` 10:39:22 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 352 host(s) 10:39:30 Unhandled error: Traceback (most recen... [11:23:30] 10serviceops, 10MW-on-K8s, 10SRE: Repartition mediawiki servers - https://phabricator.wikimedia.org/T291918 (10jijiki) [11:44:11] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) >>! In T290536#7376383, @akosiaris wrote: >>>! In T290536#7371552, @jijiki wrote: >>>>! In T290536#7364817, @Joe wrote: >>> We could thus start... [11:46:46] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [11:49:53] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [11:50:08] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [11:51:18] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) @WDoranWMF, @hnowlan Hello, this is now unblocked and ready to go. Note that the version... [11:57:45] 10serviceops, 10SRE, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10elukey) Thanks to Joe's refactoring (https://gerrit.wikimedia.org/r/c/operations/puppet/+/7234190) we have now a quick way to define -deploy users with separate permissions. I have creat... [11:59:38] jelto: o/ when you have a moment lemme know what you think about --^ [12:13:00] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.13.0 - https://phabricator.wikimedia.org/T285857 (10WMDE-Fisch) [12:21:02] elukey: I had to take a look at the kubernetes user refactoring first. But looks good to me, see comment [12:23:32] jelto: yep yep, I was wondering if we wanted a different file perm set for the new kubeconfig, etc.. long term we (as ML) will create a separate deployment group as well, so we'll keep our clusters separate [12:26:23] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10akosiaris) >>! In T290536#7383272, @jijiki wrote: > That is a good idea, I started a different task to discuss our options in partitioning our mediawi... [12:28:29] elukey: I like the idea to restrict access to the -deployer kubeconfigs using a dedicated group (somehwat similar to the admin kubeconfig). I'll add a comment later to the task so that this idea won't get lost. Thanks! [12:29:56] jelto: ok so for the moment it may be ok for me to proceed and see what breaks (I am pretty sure I'll find some interesting issues in my set up), and then I'll wait for the final decision by your team (very easy to fix the permissions afterwards) [12:30:21] 10serviceops, 10SRE: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921 (10Lucas_Werkmeister_WMDE) [12:30:49] 10serviceops, 10SRE: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921 (10Lucas_Werkmeister_WMDE) (Just to be clear, in case the task title is ambiguous: I’m aware this is my fault, I’m just suggesting to pre... [12:31:39] elukey: makes sense. I will chat with other ServiceOps kubernetes folks to check if we want to introduce and manage a dedicated group for the -deployer kubeconfig access or not [12:33:36] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10Mvolz) The PDF connection might be a red herring because although that's what happened in the past, attempting to translate those was disabled here: https://github.com/zotero/translation-server... [13:40:09] Pchelolo: deployed your envoy timeout change in eventgate-main [13:40:18] great, thank you ottomata [13:40:31] in a few hours we'll see if that helped [13:41:01] aye [13:41:02] ! [14:01:13] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10akosiaris) >>! In T291707#7383407, @Mvolz wrote: > The PDF connection might be a red herring because although that's what happened in the past, attempting to translate those was disabled here:... [15:18:20] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10dancy) >>! In T291095#7383145, @jijiki wrote: > So I rolled back scap on deploy* hosts, until a fix is ready and we can repackage. The fix has been merged (thanks @Majavah and @... [15:18:36] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10dancy) [15:33:51] 10serviceops, 10SRE, 10envoy: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) [15:34:05] 10serviceops, 10SRE, 10envoy: The TLS proxy configuration in deployment-charts allows invalid listeners - https://phabricator.wikimedia.org/T291959 (10Joe) p:05Triage→03High [15:36:53] folks I'd need to add an extra deploy-kserve ClusterRole to the helmfile_rbac.yaml config, tried in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724448 [15:37:07] but now I am wondering if I should guard it with an option to add to Values [15:37:26] like addKserveClusterRole or similar (and possibly a similar one for deploy-flink?) [15:37:46] otherwise the new ClusterRole gets deployed to main clusters too, not sure if ok or not [15:38:35] maybe I can add in Values a new field with a list of extra cluster roles that one want to deploy [15:40:57] <_joe_> elukey: can't look right now, if you find noone else ping me tomorrow morning [15:41:41] _joe_ I have an idea, will try to code it and then I'll ping you tomorrow [15:42:05] <_joe_> elukey: sure I assumed you already had [15:53:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2412.codfw.wmnet [16:10:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - mw2412 (**FAIL**) - Forced PXE for next reboot - Host rebooted v... [16:21:22] 10serviceops, 10SRE: restart-php7.2-fpm attempts to run as non-root but can’t actually restart service, leaving instance depooled - https://phabricator.wikimedia.org/T291921 (10Joe) p:05Triage→03Medium Lucas is correct, but I think the best fix is to avoid needing the `-i` in the sudo process there, but gi... [16:43:29] Pchelolo: I haven't noticed any 503s yet! [16:45:24] <_joe_> ottomata, Pchelolo can you port the patch to common_templates too? It's surely useful [16:46:10] _joe_: already done. and already merged by Alex https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724174/2 [16:46:17] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2412.codfw.wmnet [16:46:26] <_joe_> Pchelolo: <3 [16:55:08] will deploy to rest of eventgate instances later today [16:55:37] _joe_: thanks for comments on label clarification, those are helpful. will follow up later too [16:56:20] <_joe_> ottomata: I didn't get into the questions about specific templates, I think those are generally valid but it's better discussed on a patch [16:58:22] even if some would make sense to change (e.g. wmf.releasename), changing them is so difficult it is probably not worth it. probably just some really good comment docs wouldl do it [16:58:28] i'll try to do that as the result of that task [17:00:35] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - mw2412 (**WARN**) - Downtimed on Icinga - //Unable to disable Puppet, the h... [17:04:34] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2413.codfw.wmnet [17:06:16] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [17:23:11] _joe_: what's the next thing that I need to do/find help with on the Toolhub rollout? My near term goal is getting https://toolhub.wikimedia.org routed to the eqiad pods so I can complete testing of the OAuth2 authn configuration. [17:24:05] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - mw2413 (**FAIL**) - Forced PXE for next reboot - Host rebooted v... [17:25:10] <_joe_> bd808: legoktm should fix his dns discovery patch, then he has a traffic and pubdns patch and you're done [17:25:33] excellent! Thanks for your help today [17:31:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @Volans mw2413 failed with the same error [17:35:52] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin2002 for host mw2413.codfw.wmnet [17:50:49] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - mw2413 (**WARN**) - Downtimed on Icinga - //Unable to disable Puppet, the h... [18:01:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin1001 for host thumbor2005.codfw.wmnet [18:04:34] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [18:18:38] 10serviceops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @hashar This should be between netops and dcops I think. [18:21:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage executed with errors: - thumbor2005 (**FAIL**) - Forced PXE for next reboot - Host r... [18:26:17] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10Majavah) [18:26:31] more scap 4 release issues sadly: T291095 [18:38:20] 10serviceops, 10Kubernetes, 10Patch-For-Review: Clarify common k8s label and service conventions in our helm charts - https://phabricator.wikimedia.org/T291848 (10Ottomata) Ok, added some docs in patch, lemme know what you think. Still not sure what appbaseurl are for, or if we should make chartid also trun... [19:00:55] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10dancy) 05Open→03Stalled Holding for additional fixes. [21:03:18] 10serviceops, 10SRE, 10Continuous-Integration-Config, 10Regression, 10Sustainability (Incident Followup): operations-apache-config-lint replacement doesn't check syntax - https://phabricator.wikimedia.org/T114801 (10Krinkle) [21:08:36] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage was started by pt1979@cumin1001 for host thumbor2005.codfw.wmnet [21:22:22] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10ops-monitoring-bot) Cookbook cookbooks.sre.experimental.reimage completed: - thumbor2005 (**WARN**) - Downtimed on Icinga - //Unable to disable Pupp... [21:25:13] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) @Volans I was able to get thumbor2005 installed without adding the MAC address but the install failed also like mw2413 ` Run Puppet in NOOP mode... [21:26:32] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install thumbor200[56].codfw.wmnet - https://phabricator.wikimedia.org/T290190 (10Papaul) [21:35:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2414.codfw.wmnet ` The log can be found in `/var/... [21:40:40] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2415.codfw.wmnet ` The log can be found in `/var/... [21:46:13] 10serviceops, 10Icinga, 10SRE, 10SRE Observability, and 2 others: incident 20170323-wikibase did not trigger Icinga paging - https://phabricator.wikimedia.org/T161528 (10Krinkle) [21:48:34] 10serviceops, 10SRE, 10Toolhub, 10Patch-For-Review, 10Service-deployment-requests: New Service Request Toolhub - https://phabricator.wikimedia.org/T280881 (10Legoktm) [21:59:40] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2414.codfw.wmnet'] ` and were **ALL** successful. [22:00:00] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2416.codfw.wmnet ` The log can be found in `/var/... [22:02:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) [22:06:48] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2415.codfw.wmnet'] ` and were **ALL** successful. [22:08:05] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2417.codfw.wmnet ` The log can be found in `/var/... [22:27:13] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2416.codfw.wmnet'] ` and were **ALL** successful. [22:29:14] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2418.codfw.wmnet ` The log can be found in `/var/... [22:34:17] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2417.codfw.wmnet'] ` and were **ALL** successful. [22:40:47] 10serviceops, 10Wikifeeds, 10Sustainability (Incident Followup): Clarify in Wikifeeds documention the request flows - https://phabricator.wikimedia.org/T291912 (10Aklapper) [22:55:25] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2418.codfw.wmnet'] ` and were **ALL** successful. [22:55:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2002.codfw.wmnet for hosts: ` mw2419.codfw.wmnet ` The log can be found in `/var/... [23:21:53] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2419.codfw.wmnet'] ` and were **ALL** successful. [23:24:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul)