[07:48:30] 10serviceops: Support Canary releases on Kubernetes - https://phabricator.wikimedia.org/T282148 (10Joe) @jijiki I think this task can be closed? [08:47:44] 10serviceops, 10Prod-Kubernetes: Better scaffolding for helm charts / releases - https://phabricator.wikimedia.org/T292818 (10Joe) [08:47:57] 10serviceops, 10Prod-Kubernetes: Better scaffolding for helm charts / releases - https://phabricator.wikimedia.org/T292818 (10Joe) p:05Triage→03Medium [09:27:47] 10serviceops: Support Canary releases on Kubernetes - https://phabricator.wikimedia.org/T282148 (10jijiki) I think what we are missing here is how to get prometheus metrics strictly for the canary deployment. I confess I have not dug deeper into this. [11:21:38] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) [11:41:22] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) After the last tuning, the results were more promising: {F34678817} On the other hand, we seem to be hitting max accelerate... [14:04:46] 10serviceops, 10SRE, 10good first task: Upgrade all deployment charts to use the latest version of common_templates - https://phabricator.wikimedia.org/T292390 (10jijiki) [14:12:48] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Daimona) >>! In T271736#7391906, @Reedy wrote: > Which has been accepted. Waiting on a changelog update and a release tagging. Seems like it was released 1 hour ago as 6.2.0. Should I hijack T271777 a bit... [14:19:18] 10serviceops: Migrate WMF Production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 (10Reedy) >>! In T271736#7412247, @Daimona wrote: >>>! In T271736#7391906, @Reedy wrote: >> Which has been accepted. Waiting on a changelog update and a release tagging. > > Seems like it was released 1 hour... [15:49:57] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Add label kubernetes.io/metadata.name to all namespaces - https://phabricator.wikimedia.org/T290476 (10elukey) 05Open→03Resolved Applied also to ml-serve, now the setting is turned on by default in commons.yaml :) [15:52:10] jayme: I took the liberty to close it, lemme know if it is missing something --^ [19:20:12] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1018.eqiad.wmnet ` The log can be found i... [19:24:05] 10serviceops, 10SRE, 10Datacenter-Switchover: Services without a service IP cannot automatically be switched by the switchdc cookbook - https://phabricator.wikimedia.org/T285707 (10BBlack) [19:30:50] 10serviceops, 10MW-on-K8s, 10SRE, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23), 10Patch-For-Review: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) After reading https://en.wikipedia.org/wiki/Proxy_server#Transparent_proxy I'm not exactly sure "t... [19:31:12] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1019.eqiad.wmnet ` The log can be found i... [19:32:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1020.eqiad.wmnet ` The log can be found i... [19:37:44] brennen: hey, so the "cas_session_duration" change on gitlab.. that seems pretty urgent.. is that right? [19:38:11] if you are here I'll merge that. I can't find other examples of it being used though [19:38:35] mutante: hey hey - i patched it manually, so nothing is actively broken, but it should just recreate the current situation [19:38:51] as long as deploying it a) changes /etc/gitlab/gitlab.rb and b) runs gitlab-ctl reconfigure [19:39:11] (if it doesn't run reconfigure, also not a problem, as it'd then be consistent with the actually-deployed config.) [19:41:02] brennen: so puppet is disabled but if we merge it we can enable puppet again and running it should hopefully be noop. right? [19:41:16] looking [19:41:46] puppet looks enabled [19:42:12] yeah, my understanding was j.elto had re-enabled puppet earlier today. and yeah, running that change should be a no-op. [19:42:23] but then how could you manually patch it [19:43:28] brennen: it has the "1" value in /etc/gitlab/gitlab.rb so your fix was reverted by puppet I guess [19:43:43] works because puppet does not do the restart? [19:43:51] well, let's merge that then [19:45:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1018.eqiad.wmnet'] ` Of which those **FAILED**: ` ['kubernetes1018.eqiad.wmnet'] ` [19:45:11] compiling [19:46:03] yes, it does edit /etc/gitlab/gitlab.rb [19:46:50] this gets confusing fast because gitlab.rb gets compiled down to yaml elsewhere by `gitlab-ctl reconfigure` [19:47:00] brennen: about to merge but technically not a noop, it will change it from 1 to 604800 , so reapplies the fix again [19:47:08] mutante: that should be ok [19:47:10] ok [19:47:48] ah, so you edited the yaml and not the .rb? [19:47:55] and it works until it regenerates that? [19:48:33] deploying [19:49:12] correct [19:49:15] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubernetes1018.eqiad.wmnet ` The log can be found i... [19:49:18] brennen: done. on 1001 and 2001 [19:49:19] -gitlab_rails['omniauth_cas3_session_duration'] = 1 [19:49:19] +gitlab_rails['omniauth_cas3_session_duration'] = 604800 [19:49:22] lgtm - thanks! [19:49:27] yep, no [19:49:29] np [19:49:52] brennen: Notice: /Stage[main]/Gitlab/Service[gitlab-ce]: Triggered 'refresh' from 1 event [19:50:04] refresh though not restart ? [19:50:21] Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]: Triggered 'refresh' from 1 event [19:50:37] but yea, it does this because the config was edited. there it is [19:51:30] Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]/returns: Notes: [19:51:43] Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]/returns: Found old initial root password file at /etc/gitlab/initial_root_password and deleted it. [19:52:29] Notice: /Stage[main]/Gitlab/Exec[Reconfigure GitLab]/returns: gitlab Reconfigured! [19:52:38] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1019.eqiad.wmnet'] ` and were **ALL** successful. [19:52:44] ah, but the refresh evidently doesn't trigger a restart? [19:52:54] this was on 2001 [19:53:19] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1020.eqiad.wmnet'] ` and were **ALL** successful. [19:53:27] now that i think about it, reconfigure does a restart on its own, i believe. [19:53:36] all sounds like that's probably how it should work. [19:53:52] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10BBlack) Update on the ca-certificates end of this: Debian has a patch that wi... [19:53:54] If you set hasrestart to true, Puppet will use the init script’s restart command. [19:54:04] ^ with "refresh" [19:54:23] "You can provide an explicit command for restarting with the restart attribute." [19:54:29] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubestage1003.eqiad... [19:54:37] "If you do neither, the service’s stop and start commands will be used." [19:55:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` kubestage1004.eqiad... [20:03:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Jclark-ctr) Confirmed: Service Request 1072368852 was successfully submitted. for kubernetes1021 [20:15:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubestage1003.eqiad.wmnet'] ` and were **ALL** successful. [20:16:46] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubestage1004.eqiad.wmnet'] ` and were **ALL** successful. [20:17:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10Cmjohnson) [20:18:06] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2: (Need By: TBD) rack/setup/install kubestage100[34].eqiad.wmnet - https://phabricator.wikimedia.org/T290894 (10Cmjohnson) 05Open→03Resolved [20:18:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) [20:19:26] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10Cmjohnson) kubernetes1018-1020 are fully installed, once we figure out and fix the issue with 1021 we'll be able to close the task. [20:23:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['kubernetes1018.eqiad.wmnet'] ` and were **ALL** successful. [21:35:41] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10Legoktm) >>! In T292291#7413420, @BBlack wrote: > Update on the ca-certificat... [22:09:38] 10serviceops, 10Shellbox, 10User-brennen, 10Wikimedia-production-error: Shellbox\ShellboxError: Shellbox server returned status code 503 - https://phabricator.wikimedia.org/T292663 (10Legoktm) Just saw another one while looking elsewhere in logstash https://logstash.wikimedia.org/app/discover#/doc/logstash...