[00:11:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/950464 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/950464 (owner: 10TrainBranchBot) [00:46:57] (JobUnavailable) firing: (2) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:52:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/950464 (owner: 10TrainBranchBot) [01:15:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:40] !log [WDQS] `ryankemper@wdqs1006:~$ sudo systemctl restart wdqs-blazegraph wdqs-categories` (free allocators decreasing rapidly -> solution is a simple restart of query service on host) [01:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1006:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:06:41] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:53] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [02:31:41] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:10] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:59:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:04:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:21:10] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:22:50] (03PS2) 10Stevemunene: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) [04:23:51] (03CR) 10CI reject: [V: 04-1] datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [04:31:10] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [05:47:30] jouncebot: nowandnext [05:47:30] No deployments scheduled for the forseeable future! [05:47:30] No deployments scheduled for the forseeable future! [05:48:50] I'm updating MinT. No major changes. [05:49:32] (03PS2) 10KartikMistry: Update MinT to 2023-08-14-091403-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/950063 (https://phabricator.wikimedia.org/T336683) [05:51:33] (03PS1) 10Zabe: add su namespace translations [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950808 (https://phabricator.wikimedia.org/T344314) [05:54:24] (03CR) 10Zabe: [C: 03+2] noc: Disclose langlist-labs to noc.wm.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950175 (owner: 10Zabe) [05:55:05] (03Merged) 10jenkins-bot: noc: Disclose langlist-labs to noc.wm.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950175 (owner: 10Zabe) [05:58:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950808 (https://phabricator.wikimedia.org/T344314) (owner: 10Zabe) [06:03:11] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-08-14-091403-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/950063 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [06:04:09] (03Merged) 10jenkins-bot: Update MinT to 2023-08-14-091403-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/950063 (https://phabricator.wikimedia.org/T336683) (owner: 10KartikMistry) [06:04:28] 10SRE, 10SRE-Access-Requests, 10DBA, 10Patch-For-Review: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Marostegui) a:05Aklapper→03Marostegui [06:05:19] (03CR) 10Marostegui: [C: 03+2] mariadb: grant user 'phstats' additional select on phabricator_repository db [puppet] - 10https://gerrit.wikimedia.org/r/950683 (https://phabricator.wikimedia.org/T344513) (owner: 10Aklapper) [06:05:44] (03PS2) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) [06:05:46] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/950964 (https://phabricator.wikimedia.org/T334651) [06:06:15] (03Abandoned) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/950964 (https://phabricator.wikimedia.org/T334651) (owner: 10Marostegui) [06:06:51] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:07:23] (03PS1) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/950965 (https://phabricator.wikimedia.org/T334651) [06:09:11] 10SRE, 10SRE-Access-Requests, 10DBA, 10Patch-For-Review: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Marostegui) 05Open→03Resolved Patch merged and grants applied live. Please repopen if you need something else [06:09:21] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:09:31] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/950965 (https://phabricator.wikimedia.org/T334651) (owner: 10Marostegui) [06:09:46] (03CR) 10CI reject: [V: 04-1] dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/950965 (https://phabricator.wikimedia.org/T334651) (owner: 10Marostegui) [06:10:36] (03PS2) 10Marostegui: dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/950965 (https://phabricator.wikimedia.org/T334651) [06:11:54] (03Merged) 10jenkins-bot: add su namespace translations [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950808 (https://phabricator.wikimedia.org/T344314) (owner: 10Zabe) [06:12:40] !log zabe@deploy1002 Started scap: Backport for [[gerrit:950808|add su namespace translations (T344314)]] [06:12:44] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [06:13:00] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:15:13] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Depool clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/950965 (https://phabricator.wikimedia.org/T334651) (owner: 10Marostegui) [06:16:26] (03PS1) 10Marostegui: clouddb1019: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/950973 (https://phabricator.wikimedia.org/T334651) [06:18:59] (03CR) 10Marostegui: [C: 03+2] clouddb1019: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/950973 (https://phabricator.wikimedia.org/T334651) (owner: 10Marostegui) [06:19:03] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:19:50] (03PS1) 10Marostegui: Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/950809 [06:20:18] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 14 down 2: https://wikitech.wikimedia.org/wiki/HAProxy [06:21:42] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:22:50] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:23:02] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:24:53] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [06:27:18] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:27:20] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:11] !Update MinT to 2023-08-14-091403-production (T336683) [06:28:11] T336683: Enable MinT support for languages with no Wikipedia yet - https://phabricator.wikimedia.org/T336683 [06:28:57] !log Update MinT to 2023-08-14-091403-production (T336683) [06:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:01] :/ [06:30:39] !log installing Linux 5.10.191 kernel updates [06:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:52] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1018: Depool clouddb1019" [puppet] - 10https://gerrit.wikimedia.org/r/950809 (owner: 10Marostegui) [06:32:27] (03PS1) 10Marostegui: install_server: Do not reimage db2191 [puppet] - 10https://gerrit.wikimedia.org/r/950975 [06:33:00] (HelmReleaseBadStatus) firing: (3) Helm release mw-misc/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:33:01] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2191 [puppet] - 10https://gerrit.wikimedia.org/r/950975 (owner: 10Marostegui) [06:36:01] 10SRE, 10ops-knams, 10DC-Ops, 10Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (10ayounsi) [06:36:27] yeah, deploying to k8s (mw-debug) failed [06:36:50] (03CR) 10Ayounsi: [C: 03+1] netbox: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/950167 (owner: 10Muehlenhoff) [06:38:00] (HelmReleaseBadStatus) resolved: (4) Helm release mw-misc/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:38:53] !log zabe@deploy1002 Started scap: Backport for [[gerrit:950808|add su namespace translations (T344314)]] [06:38:58] T344314: Initial configurations for suwikisource - https://phabricator.wikimedia.org/T344314 [06:41:03] (03PS3) 10Stevemunene: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) [06:41:48] (03CR) 10CI reject: [V: 04-1] datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [06:43:01] (03PS1) 10Zabe: Revert "add su namespace translations" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950810 [06:43:14] (03CR) 10Zabe: [V: 03+2 C: 03+2] Revert "add su namespace translations" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950810 (owner: 10Zabe) [06:43:27] reverting until k8s deployment gets fixed [06:44:05] (03PS4) 10Stevemunene: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) [06:45:02] (03CR) 10Ayounsi: [C: 03+2] Don't advertise small nets to customers [homer/public] - 10https://gerrit.wikimedia.org/r/948081 (https://phabricator.wikimedia.org/T340448) (owner: 10Ayounsi) [06:45:17] (03PS2) 10Ayounsi: Don't advertise small nets to customers [homer/public] - 10https://gerrit.wikimedia.org/r/948081 (https://phabricator.wikimedia.org/T340448) [06:51:11] (03PS1) 10Ayounsi: Update wheels to pickup Aerleon 1.7.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/951038 (https://phabricator.wikimedia.org/T337082) [06:52:08] (03CR) 10Ayounsi: [C: 03+2] Update wheels to pickup Aerleon 1.7.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/951038 (https://phabricator.wikimedia.org/T337082) (owner: 10Ayounsi) [06:53:16] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:24] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update Homer wheels - ayounsi@cumin1001 [06:55:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update Homer wheels - ayounsi@cumin1001 [07:01:17] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: ganeti3001.esams.wmnet [07:01:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: ganeti3001.esams.wmnet [07:01:23] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: ganeti3002.esams.wmnet [07:01:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: ganeti3002.esams.wmnet [07:01:28] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: ganeti3003.esams.wmnet [07:01:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: ganeti3003.esams.wmnet [07:03:36] !log prometheus ops eqiad +300G on the filesystem [07:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:18] 10SRE, 10Observability-Metrics, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027 (10fgiunchedi) [07:16:21] (03PS1) 10Muehlenhoff: Add owner contact for role::insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/951039 [07:18:39] (03PS1) 10Muehlenhoff: Add role_contact for role::common::server [puppet] - 10https://gerrit.wikimedia.org/r/951040 [07:37:54] (03CR) 10Btullis: [C: 03+1] "nit: role:common:ceph:server in commit message title." [puppet] - 10https://gerrit.wikimedia.org/r/951040 (owner: 10Muehlenhoff) [07:45:38] (03PS1) 10Ayounsi: border-in: remove atlas-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951041 [07:46:13] (03CR) 10Ayounsi: [C: 03+2] border-in: remove atlas-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951041 (owner: 10Ayounsi) [07:46:46] (03Merged) 10jenkins-bot: border-in: remove atlas-esams [homer/public] - 10https://gerrit.wikimedia.org/r/951041 (owner: 10Ayounsi) [07:49:46] !log Draining ml-serve2004 for Kubelet partition resize [07:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:01] (03CR) 10Hashar: [C: 03+1] scap: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/945749 (owner: 10Muehlenhoff) [07:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:00:02] 10ops-codfw, 10collaboration-services, 10decommission-hardware: Decommission contint2001.wikimedia.org - https://phabricator.wikimedia.org/T342017 (10Jelto) a:05Arnoldokoth→03None Adjusting tags for DC-Ops (they need `ops-codfw` tag instead of team tag to proceed). [08:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST deployments) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:02:41] !log Draining ml-serve2005 for Kubelet partition resize [08:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:22] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951039 (owner: 10Muehlenhoff) [08:08:16] (03Abandoned) 10JMeybohm: mw-on-k8s: Redirect officewiki to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/932857 (https://phabricator.wikimedia.org/T337490) (owner: 10Clément Goubert) [08:08:39] (03CR) 10JMeybohm: [C: 03+1] mediawiki: Reduce memory request [deployment-charts] - 10https://gerrit.wikimedia.org/r/950177 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [08:10:25] !log Draining ml-serve2006 for Kubelet partition resize [08:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:48] (03CR) 10Muehlenhoff: [C: 03+2] Add owner contact for role::insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/951039 (owner: 10Muehlenhoff) [08:10:52] (03PS1) 10Giuseppe Lavagetto: termbox-test: call mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) [08:12:03] (03PS2) 10Muehlenhoff: Add role_contact for role::ceph::server [puppet] - 10https://gerrit.wikimedia.org/r/951040 [08:13:03] (03PS2) 10Samtar: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [08:13:15] (03CR) 10Samtar: [C: 03+1] wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: 10HMonroy) [08:13:32] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Reserve system resources on k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/949843 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [08:13:37] (03CR) 10Muehlenhoff: [C: 03+2] Add role_contact for role::ceph::server [puppet] - 10https://gerrit.wikimedia.org/r/951040 (owner: 10Muehlenhoff) [08:14:48] 10SRE, 10Infrastructure-Foundations, 10netops: Default allowed SSH parameters on upgraded Juniper mgmt routers prevent some connections - https://phabricator.wikimedia.org/T320272 (10ayounsi) For the record, another possible workaround: ` mr1-esams> start shell % ssh root@10.80.128.6 -m hmac-... [08:16:09] (03CR) 10Clément Goubert: [C: 03+1] termbox-test: call mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) (owner: 10Giuseppe Lavagetto) [08:17:31] 10SRE, 10Infrastructure-Foundations, 10netops: Implement better filter on BGP_Customer_out - https://phabricator.wikimedia.org/T340448 (10ayounsi) 05Open→03Resolved All done. [08:19:22] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15600 [08:19:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15600 [08:22:11] (03PS2) 10Giuseppe Lavagetto: termbox-test: call mw-api-int [deployment-charts] - 10https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) [08:26:11] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Reduce memory request [deployment-charts] - 10https://gerrit.wikimedia.org/r/950177 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [08:26:56] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1108.eqiad.wmnet with OS bullseye [08:27:16] !log restart prometheus@beta - T344582 [08:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:19] T344582: beta / deployment-prep alerts show up in production alertmanager - https://phabricator.wikimedia.org/T344582 [08:27:31] (03Merged) 10jenkins-bot: mediawiki: Reduce memory request [deployment-charts] - 10https://gerrit.wikimedia.org/r/950177 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [08:30:02] zabe: Just saw you had a k8s deployment fail, do you have logs or anything? [08:31:17] (03CR) 10Btullis: "Thanks Steve. The LDAP and JAAS changes look good, but the networkpolicy still needs a little work." [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [08:31:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:33:08] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10JMeybohm) [08:36:15] (03PS1) 10Clément Goubert: mediawiki: Allow autocomputing the memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951045 [08:37:58] (03PS1) 10Giuseppe Lavagetto: Use ClusterConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951046 [08:38:00] (03PS1) 10Giuseppe Lavagetto: ClusterConfig: also allow to return hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 [08:38:02] (03PS1) 10Giuseppe Lavagetto: Replace calls to wfHostname with clusterconfig ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951048 [08:38:04] (03PS1) 10Giuseppe Lavagetto: Disable things that don't work on k8s when on k8s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951049 [08:41:52] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1108.eqiad.wmnet with reason: host reimage [08:42:31] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:42:32] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [08:43:18] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [08:43:19] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [08:43:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [08:43:20] (03PS1) 10JMeybohm: admin_ng: Lower the minimum pod memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951050 (https://phabricator.wikimedia.org/T343978) [08:43:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [08:43:54] (03PS2) 10Clément Goubert: mediawiki: Allow autocomputing the memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951045 (https://phabricator.wikimedia.org/T342748) [08:43:56] (03PS1) 10Clément Goubert: mediawiki: Autocompute requests and limits for all [deployment-charts] - 10https://gerrit.wikimedia.org/r/951051 (https://phabricator.wikimedia.org/T342748) [08:44:23] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1108.eqiad.wmnet with reason: host reimage [08:44:32] (03CR) 10Clément Goubert: [C: 03+1] admin_ng: Lower the minimum pod memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951050 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:45:05] zabe: disregard, we found logs and cause, we are fixing [08:46:18] (03CR) 10JMeybohm: [C: 03+2] admin_ng: Lower the minimum pod memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951050 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:48:27] claime: thanks (sorry, I just saw your message) [08:48:42] (03Merged) 10jenkins-bot: admin_ng: Lower the minimum pod memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951050 (https://phabricator.wikimedia.org/T343978) (owner: 10JMeybohm) [08:50:06] (03PS2) 10Clément Goubert: mediawiki: Autocompute requests and limits for all [deployment-charts] - 10https://gerrit.wikimedia.org/r/951051 (https://phabricator.wikimedia.org/T342748) [08:50:33] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [08:51:49] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [08:51:58] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [08:52:30] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [08:52:42] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [08:53:23] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [08:54:11] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [08:55:44] (03PS3) 10Clément Goubert: mediawiki: Autocompute requests and limits for all [deployment-charts] - 10https://gerrit.wikimedia.org/r/951051 (https://phabricator.wikimedia.org/T342748) [08:56:16] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:04:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:05:28] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1108.eqiad.wmnet with OS bullseye [09:06:54] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Allow autocomputing the memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951045 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:08:43] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-int: autocompute memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951052 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:10:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I'm not sure this will fit our clusters at the moment, still it seems to go in the right direction." [deployment-charts] - 10https://gerrit.wikimedia.org/r/951051 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:11:44] (03PS1) 10Clément Goubert: admin_ng: Raise max cpu per pod to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/951055 (https://phabricator.wikimedia.org/T343978) [09:14:44] (03CR) 10JMeybohm: [C: 03+1] admin_ng: Raise max cpu per pod to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/951055 (https://phabricator.wikimedia.org/T343978) (owner: 10Clément Goubert) [09:14:55] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Raise max cpu per pod to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/951055 (https://phabricator.wikimedia.org/T343978) (owner: 10Clément Goubert) [09:15:55] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3066.esams.wmnet [09:17:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3074.esams.wmnet [09:17:15] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:17:16] (03Merged) 10jenkins-bot: admin_ng: Raise max cpu per pod to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/951055 (https://phabricator.wikimedia.org/T343978) (owner: 10Clément Goubert) [09:17:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:18:19] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:19:44] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:20:04] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:21:15] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:22:02] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:23:36] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:24:14] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:24:55] ls -a [09:24:55] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3066.esams.wmnet [09:24:56] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host db1208.eqiad.wmnet [09:25:35] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:25:43] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:25:43] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [09:25:44] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [09:25:48] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:26:38] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3074.esams.wmnet [09:26:41] (tfw `ls -a` works on IRC /s) [09:29:09] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10SLyngshede-WMF) > Definitely plausible. But it won't be a feature of the new IDM at launch, and we don't want the new IDM to be a blocker for... [09:33:26] (03PS1) 10Clément Goubert: admin_ng: Lower the minimum container memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951059 (https://phabricator.wikimedia.org/T343978) [09:36:07] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:36:08] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:36:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [09:36:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [09:36:27] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [09:36:37] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db1208.eqiad.wmnet [09:37:19] PROBLEM - Check systemd state on db1208 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:44] (03CR) 10Clément Goubert: [C: 03+2] admin_ng: Lower the minimum container memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951059 (https://phabricator.wikimedia.org/T343978) (owner: 10Clément Goubert) [09:38:41] RECOVERY - Check systemd state on db1208 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:48] 10ops-knams: ManagementSSHDown - https://phabricator.wikimedia.org/T344593 (10phaultfinder) [09:40:02] (03Merged) 10jenkins-bot: admin_ng: Lower the minimum container memory request to 50Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/951059 (https://phabricator.wikimedia.org/T343978) (owner: 10Clément Goubert) [09:40:37] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:40:46] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:41:13] !log cgoubert@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:41:41] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:42:33] !log cgoubert@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:42:39] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3067.esams.wmnet [09:44:27] !log cgoubert@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:44:58] !log cgoubert@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:45:03] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:45:59] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1001.eqiad.wmnet [09:46:19] (03PS5) 10Btullis: Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) [09:46:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:46:26] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:46:43] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [09:47:25] !log btullis@cumin1001 START - Cookbook sre.presto.reboot-workers for Presto analytics cluster: Reboot Presto nodes [09:48:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:48:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:50:03] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:50:04] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:50:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:50:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [09:51:11] !log restarted prometheus@k8s on prometheus100[56] - T343529 [09:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:15] T343529: Prometheus doesn't reload or alert on expired client certificates - https://phabricator.wikimedia.org/T343529 [09:51:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1001.eqiad.wmnet [09:52:07] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3067.esams.wmnet [09:52:10] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3068.esams.wmnet [09:52:31] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add new dummy keytab for install3003 and remove install3002 [labs/private] - 10https://gerrit.wikimedia.org/r/949624 (owner: 10Muehlenhoff) [09:52:37] (03PS17) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [09:52:43] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2003.wikimedia.org [09:53:51] (03CR) 10Muehlenhoff: [C: 03+2] ferm network defs: Add $CLOUD_NETWORKS alias [puppet] - 10https://gerrit.wikimedia.org/r/944870 (owner: 10Muehlenhoff) [09:54:50] (03PS1) 10Jbond: admin: drop ssh key for dartmon [puppet] - 10https://gerrit.wikimedia.org/r/951064 (https://phabricator.wikimedia.org/T342968) [09:57:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42927/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [09:57:50] (03CR) 10Jbond: [C: 03+2] admin: drop ssh key for dartmon [puppet] - 10https://gerrit.wikimedia.org/r/951064 (https://phabricator.wikimedia.org/T342968) (owner: 10Jbond) [09:58:37] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1109.eqiad.wmnet with OS bullseye [09:58:53] mw-on-k8s deployments still halted while we handle a resource issue [09:59:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10jbond) @darthmon_wmde I have removed this ssh key from the production ssh config. please update this task with a new ssh key that is not used in the WMCS... [09:59:28] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2003.wikimedia.org [10:00:28] (03CR) 10Slyngshede: [V: 03+1] C:bigtop::hadoop move net-topology.py to files. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:00:37] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [10:00:38] 10SRE, 10Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515 (10MoritzMuehlenhoff) [10:00:46] jouncebot: nowandnext [10:00:46] No deployments scheduled for the forseeable future! [10:00:46] No deployments scheduled for the forseeable future! [10:01:02] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:01:03] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:01:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [10:01:33] There's something wrong with https://wikitech.wikimedia.org/wiki/Deployments [10:03:51] (03CR) 10Muehlenhoff: [C: 03+2] Add a Firewall::Portrange define [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff) [10:03:58] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1002.eqiad.wmnet [10:04:27] (03PS1) 10JMeybohm: kubernetes::node: Don't reserve CPUs for system [puppet] - 10https://gerrit.wikimedia.org/r/951065 (https://phabricator.wikimedia.org/T277876) [10:06:06] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp3068.esams.wmnet [10:06:09] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3069.esams.wmnet [10:06:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 6 CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42928/console" [puppet] - 10https://gerrit.wikimedia.org/r/951065 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [10:06:54] (03CR) 10Btullis: Retain yarn logs for 60 days and compress with gzip (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [10:06:55] PROBLEM - Check systemd state on cp3068 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:57] (03CR) 10Hnowlan: [C: 03+2] thumbor: remove thumbor server configuration [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:07:08] (03Abandoned) 10Muehlenhoff: Extend ganeti Netbox sync for new knams hosts [puppet] - 10https://gerrit.wikimedia.org/r/948130 (owner: 10Muehlenhoff) [10:07:29] (03CR) 10Clément Goubert: [C: 03+1] kubernetes::node: Don't reserve CPUs for system [puppet] - 10https://gerrit.wikimedia.org/r/951065 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [10:07:32] (03PS2) 10Muehlenhoff: confd: Explicitly require directory for systemd cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/949496 [10:08:28] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::node: Don't reserve CPUs for system [puppet] - 10https://gerrit.wikimedia.org/r/951065 (https://phabricator.wikimedia.org/T277876) (owner: 10JMeybohm) [10:08:49] (03CR) 10Jbond: [C: 04-1] "-1 see inline" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [10:10:05] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1002.eqiad.wmnet [10:12:27] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1109.eqiad.wmnet with reason: host reimage [10:13:12] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [10:14:25] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10jbond) p:05Triage→03Medium [10:14:37] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-conf1003.eqiad.wmnet [10:15:14] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3069.esams.wmnet [10:15:17] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3070.esams.wmnet [10:15:22] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1109.eqiad.wmnet with reason: host reimage [10:15:27] (03CR) 10Muehlenhoff: [C: 03+2] confd: Explicitly require directory for systemd cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff) [10:15:51] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:16:04] (03PS5) 10Stevemunene: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) [10:17:11] 10SRE, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034 (10jbond) [10:20:22] (03CR) 10Stevemunene: datahub: Enable OIDC to idp_test (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:20:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-conf1003.eqiad.wmnet [10:21:47] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/945755 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:22:44] 10SRE, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Define scap::sources in a way that is shared between prod and beta - https://phabricator.wikimedia.org/T196034 (10jbond) I think nthis request is ultimatly asking for `$_role` support in the wmcs puppet hiera c... [10:23:00] (HelmReleaseBadStatus) firing: (2) Helm release mw-web/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:24:00] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[3456] - https://phabricator.wikimedia.org/T344597 (10hnowlan) [10:24:34] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3070.esams.wmnet [10:24:37] ^The above Helm issue is known, and will be fixed in a few minutes as soon as a puppet patch to free some resources is deployed everywhere [10:24:38] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3071.esams.wmnet [10:24:41] 10SRE, 10SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (10jbond) 05In progress→03Resolved Closing this AFAICT all actions have been completed but please reopen if there is still so... [10:24:45] This doesn't have direct prodution impact [10:24:58] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:25:47] 10SRE, 10ops-eqiad, 10serviceops: Decommission thumbor100[1256] - https://phabricator.wikimedia.org/T344598 (10hnowlan) [10:26:52] (03CR) 10Btullis: [C: 03+1] datahub: Enable OIDC to idp_test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:27:12] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts thumbor[1001-1002,1005-1006].eqiad.wmnet [10:29:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1003.eqiad.wmnet [10:30:23] RECOVERY - Check systemd state on cp3068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:53] (03PS7) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [10:33:32] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1110.eqiad.wmnet with OS bullseye [10:34:22] 10SRE, 10SRE-Access-Requests: Requesting access to Wiki Replicas end-to-end tiers for dr0ptp4kt - https://phabricator.wikimedia.org/T343039 (10jbond) 05In progress→03Resolved a:03jbond >>! In T343039#9078943, @dr0ptp4kt wrote: > I'm going to open a separate task for the `wmcs-admin` membership, but leave... [10:35:02] (03PS1) 10Muehlenhoff: prometheus::haproxy_exporter: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951070 [10:36:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1003.eqiad.wmnet [10:36:28] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1004.eqiad.wmnet [10:37:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:38:06] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951070 (owner: 10Muehlenhoff) [10:38:40] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1109.eqiad.wmnet with OS bullseye [10:38:41] !log btullis@cumin1001 Added views for new wiki: suwikisource T343547 [10:38:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [10:38:45] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cp3071.esams.wmnet [10:38:46] T343547: Prepare and check storage layer for suwikisource - https://phabricator.wikimedia.org/T343547 [10:38:49] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3072.esams.wmnet [10:39:04] (03CR) 10Btullis: [C: 03+2] Update the default user role in Superset to be 'WMF Analyst' [puppet] - 10https://gerrit.wikimedia.org/r/950157 (https://phabricator.wikimedia.org/T328457) (owner: 10Btullis) [10:39:29] !log klausman@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-cache1001.eqiad.wmnet [10:39:39] PROBLEM - Check systemd state on cp3071 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:41:45] !log Redeploying mw-on-k8s [10:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:55] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:41:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:42:00] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:42:03] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:42:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:42:59] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:43:00] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:43:18] (ProbeDown) firing: (2) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:43:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1004.eqiad.wmnet [10:43:44] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:43:45] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:44:06] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [10:44:47] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:44:48] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:45:34] !log klausman@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-cache1001.eqiad.wmnet [10:45:42] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:45:43] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:46:27] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:46:28] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:46:32] 10SRE, 10MW-on-K8s, 10Observability-Logging, 10serviceops: Keep calculating latencies for MediaWiki requests in the WikiKube environment - https://phabricator.wikimedia.org/T276095 (10kamila) a:03kamila [10:46:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:46:49] !log klausman@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Reboot to activate microcode and security updates for T344587 - klausman@cumin1001 [10:47:19] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:47:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [10:47:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [10:47:45] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [10:48:04] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3072.esams.wmnet [10:48:07] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3073.esams.wmnet [10:48:08] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [10:48:11] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1110.eqiad.wmnet with reason: host reimage [10:48:18] (ProbeDown) resolved: (2) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:48:39] !log mw-on-k8s up to date with bare metal [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:06] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1110.eqiad.wmnet with reason: host reimage [10:51:30] (HelmReleaseBadStatus) resolved: (2) Helm release mw-web/canary on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-web - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:51:36] (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:51:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:52:36] (ProbeDown) firing: (8) Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:52:45] Yeah, that page is mw-api-int I think [10:52:51] topranks: eoghan, checking [10:52:54] should we worry? [10:53:10] text-https seems to be constrained to esams [10:53:11] that's for 1% of traffic, right? [10:53:20] and esams is currently depooled [10:53:21] jynus: yeah [10:53:31] hi, is something related to a reboot activity should have no impact but sorry for the alert [10:53:33] (ProbeDown) firing: (9) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:53:34] I was refering to k8s mw [10:53:43] vgutierrez: Am I reading the graph wrong? [10:54:02] Ah no, I got mislead by the legend [10:54:04] fabfur: thanks for confirming [10:54:06] Carry on [10:54:12] I see, I was misslead too [10:54:18] PROBLEM - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:54:18] PROBLEM - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:54:28] claime, jynus: thanks for jumping on [10:54:37] specially as there was a recent deployment [10:55:10] as a side note, I think the new alerts are great in functionality, but the text is not great [10:55:18] I think the idea is to re-pool esams this afternoon (EU time) once suk.he is online [10:55:28] jynus: Yeah, because we're sending the summary, and it's a bit too summarized [10:55:34] the ones from alertmanager / proving [10:55:49] jynus: I agree on the alerts, hard to drill down on them [10:56:12] fabfur: I will resolve this one that ok? [10:56:14] we should probably send the description instead of the summary [10:56:16] I mentioned this to obs, maybe we should create a ticket to try to brainstorme better options [10:56:25] or make the summaries more verbose [10:56:44] +1, but maybe is not that easy due to aggregation [10:56:53] Possibly yeah [10:56:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3073.esams.wmnet [10:57:05] but I would like to explore this topic further [10:57:09] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [10:57:11] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Allow autocomputing the memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951045 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [10:57:34] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-mariadb1001.eqiad.wmnet [10:58:24] (03Merged) 10jenkins-bot: mediawiki: Allow autocomputing the memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951045 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [10:59:44] !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thumbor[1001-1002,1005-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1001" [10:59:49] !log Deploying memory limit autocompute for mw-on-k8s - T342748 [10:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] T342748: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 [10:59:54] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:59:58] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:59:59] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:00:03] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:00:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:00:04] RECOVERY - Check systemd state on cp3071 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:08] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:00:09] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:00:12] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:00:13] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:00:16] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:00:17] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:00:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:00:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:00:25] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:00:26] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:00:29] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:00:30] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [11:00:33] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [11:00:34] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [11:00:37] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [11:00:54] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: autocompute memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951052 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [11:01:40] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [11:01:47] (03Merged) 10jenkins-bot: mw-api-int: autocompute memory limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/951052 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [11:02:07] !log Enabling memory limit autocompute for mw-api-int - T342748 [11:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1001.eqiad.wmnet [11:02:21] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:02:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thumbor[1001-1002,1005-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1001" [11:02:25] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:02:26] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thumbor[1001-1002,1005-1006].eqiad.wmnet [11:02:32] 10SRE, 10ops-eqiad, 10serviceops: Decommission thumbor100[1256] - https://phabricator.wikimedia.org/T344598 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by hnowlan@cumin1001 for hosts: `thumbor[1001-1002,1005-1006].eqiad.wmnet` - thumbor1001.eqiad.wmnet (**PASS**) - Downtimed host on Ic... [11:02:44] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:58] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:34] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:38] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:03:50] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:03:52] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:04:06] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:04:06] 10SRE, 10ops-eqiad, 10serviceops: Decommission thumbor100[1256] - https://phabricator.wikimedia.org/T344598 (10hnowlan) [11:04:11] !log hnowlan@cumin1001 START - Cookbook sre.hosts.decommission for hosts thumbor[2003-2006].codfw.wmnet [11:04:21] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:04:29] !log klausman@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Reboot to activate microcode and security updates for T344587 - klausman@cumin1001 [11:05:03] (03PS18) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [11:07:14] (03PS19) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [11:07:20] !log klausman@cumin1001 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-eqiad [11:07:27] (03PS1) 10Muehlenhoff: dragonfly::dfdaemon: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/951079 [11:08:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42929/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [11:12:36] (ProbeDown) firing: (10) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:12:41] (03CR) 10Slyngshede: [V: 03+1] C:bigtop::hadoop move net-topology.py to files. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [11:12:51] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1110.eqiad.wmnet with OS bullseye [11:13:33] (ProbeDown) firing: (10) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:15:42] (03PS3) 10Urbanecm: [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) [11:16:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:18:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [11:19:01] (03CR) 10Sergio Gimeno: [C: 03+1] [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [11:21:19] jouncebot: nowandnext [11:21:19] No deployments scheduled for the forseeable future! [11:21:19] No deployments scheduled for the forseeable future! [11:21:24] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/951086 [11:21:37] calendar's stuck in the past it seems. /me goes ahead. [11:22:07] urbanecm: yeah, i've relayed to releng [11:22:12] ty claime [11:22:20] (03PS2) 10Urbanecm: Revert "Growth: Temporarily disable link-recommendation FE on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949585 (https://phabricator.wikimedia.org/T316079) [11:22:23] (03CR) 10Urbanecm: [C: 03+2] Revert "Growth: Temporarily disable link-recommendation FE on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949585 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [11:22:33] there was an issue with mw-on-k8s deployments this morning, but we've fixed it, so you should be good [11:22:36] (ProbeDown) firing: (12) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:06] great! [11:23:13] (that you fixed it, not that it happened, heh) [11:23:13] (03Merged) 10jenkins-bot: Revert "Growth: Temporarily disable link-recommendation FE on arwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949585 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [11:23:18] (03PS1) 10Cathal Mooney: Add static routes for new ns2.wikimedia.org IP in esams [homer/public] - 10https://gerrit.wikimedia.org/r/951106 (https://phabricator.wikimedia.org/T343942) [11:23:30] !log btullis@cumin1001 Added views for new wiki: blkwiktionary T343541 [11:23:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [11:23:34] T343541: Prepare and check storage layer for blkwiktionary - https://phabricator.wikimedia.org/T343541 [11:23:36] (ProbeDown) firing: (12) Service ml-cache1001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:949585|Revert "Growth: Temporarily disable link-recommendation FE on arwiki" (T316079)]] [11:23:46] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [11:24:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto analytics cluster: Reboot Presto nodes [11:25:15] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:949585|Revert "Growth: Temporarily disable link-recommendation FE on arwiki" (T316079)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:25:42] !log urbanecm@deploy1002 urbanecm: Continuing with sync [11:26:34] PROBLEM - Host ps1-oe16-esams is DOWN: PING CRITICAL - Packet loss = 100% [11:26:42] PROBLEM - Host ps1-oe14-esams is DOWN: PING CRITICAL - Packet loss = 100% [11:26:53] (03CR) 10Urbanecm: [C: 03+1] "code LGTM; still pending communication." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950168 (https://phabricator.wikimedia.org/T344319) (owner: 10Sergio Gimeno) [11:26:56] PROBLEM - Host ps1-oe15-esams is DOWN: PING CRITICAL - Packet loss = 100% [11:27:27] ^ what happened here? [11:27:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Idle - PyBal, AS64600/IPv4: Idle - PyBal, AS64600/IPv4: Idle - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:28:13] XioNoX: ^ [11:29:06] sukhe: those that the old rack's PDUs, so I guess downtime expired and we didn't scrub them from puppet yet [11:29:16] oh thankfully [11:31:09] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [11:31:48] https://www.irccloud.com/pastebin/86L0v5m6/ [11:31:51] sukhe: ^ [11:31:55] old IPs too [11:32:09] (03CR) 10Ssingh: [C: 03+1] Add static routes for new ns2.wikimedia.org IP in esams [homer/public] - 10https://gerrit.wikimedia.org/r/951106 (https://phabricator.wikimedia.org/T343942) (owner: 10Cathal Mooney) [11:32:25] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus::haproxy_exporter: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951070 (owner: 10Muehlenhoff) [11:32:25] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:949585|Revert "Growth: Temporarily disable link-recommendation FE on arwiki" (T316079)]] (duration: 08m 42s) [11:32:26] ah right [11:32:29] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [11:32:32] * urbanecm done [11:32:36] (ProbeDown) firing: (12) Service ml-cache1002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:30] !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thumbor[2003-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1001" [11:34:29] sukhe: should recover [11:34:40] !log klausman@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-eqiad [11:34:53] !log klausman@cumin1001 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:ml-cache-codfw [11:35:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: thumbor[2003-2006].codfw.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1001" [11:35:20] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:35:21] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts thumbor[2003-2006].codfw.wmnet [11:35:37] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[3456] - https://phabricator.wikimedia.org/T344597 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by hnowlan@cumin1001 for hosts: `thumbor[2003-2006].codfw.wmnet` - thumbor2003.codfw.wmnet (**PASS**) - Downtimed host on Icinga/Alert... [11:35:41] 10SRE, 10Infrastructure-Foundations, 10netops: Add non-EVPN L3 Switch routing policy definitions to Homer - https://phabricator.wikimedia.org/T344601 (10cmooney) p:05Triage→03Low [11:35:52] 10SRE, 10Infrastructure-Foundations, 10netops: Add non-EVPN L3 Switch routing policy definitions to Homer - https://phabricator.wikimedia.org/T344601 (10cmooney) [11:35:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Consolidate Automation Templates for DC Switches - https://phabricator.wikimedia.org/T312635 (10cmooney) [11:36:35] (03CR) 10Cathal Mooney: [C: 03+2] Add static routes for new ns2.wikimedia.org IP in esams [homer/public] - 10https://gerrit.wikimedia.org/r/951106 (https://phabricator.wikimedia.org/T343942) (owner: 10Cathal Mooney) [11:37:09] (03PS1) 10Ayounsi: Esams: update PDUs in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/951112 [11:37:17] (03CR) 10Muehlenhoff: [C: 03+2] prometheus::haproxy_exporter: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951070 (owner: 10Muehlenhoff) [11:37:31] (03PS2) 10Ayounsi: Esams: update PDUs in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/951112 [11:37:36] (ProbeDown) firing: (12) Service ml-cache1002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:36] (03Merged) 10jenkins-bot: Add static routes for new ns2.wikimedia.org IP in esams [homer/public] - 10https://gerrit.wikimedia.org/r/951106 (https://phabricator.wikimedia.org/T343942) (owner: 10Cathal Mooney) [11:38:33] (ProbeDown) firing: (14) Service ml-cache1002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:53] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/951112 (owner: 10Ayounsi) [11:39:28] (03CR) 10Muehlenhoff: [C: 03+2] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:39:50] (03CR) 10Ssingh: [C: 03+1] Esams: update PDUs in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/951112 (owner: 10Ayounsi) [11:39:52] (03CR) 10Ayounsi: [C: 03+2] Esams: update PDUs in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/951112 (owner: 10Ayounsi) [11:40:06] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [11:42:17] (03PS1) 10Zabe: Revert "Revert "add su namespace translations"" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950811 [11:42:24] (03CR) 10Zabe: [C: 03+2] Revert "Revert "add su namespace translations"" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950811 (owner: 10Zabe) [11:42:36] (ProbeDown) firing: (12) Service ml-cache1003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:44:44] !log klausman@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching A:ml-cache-eqiad: Restart to pick up OpenJDK 11 security updates - klausman@cumin1001 [11:46:41] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [11:46:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950811 (owner: 10Zabe) [11:47:01] (03CR) 10Jbond: "nit inline otherwise lgtm but im not familiar with what what this is used for" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [11:52:36] (ProbeDown) firing: (14) Service ml-cache1003-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:55:13] !log switch config-master.w.o to config-master hosts [11:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:18] (03Merged) 10jenkins-bot: Revert "Revert "add su namespace translations"" [extensions/ProofreadPage] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950811 (owner: 10Zabe) [11:55:20] (03CR) 10Jbond: [C: 03+2] trafficserver: update config-master to use discovery record [puppet] - 10https://gerrit.wikimedia.org/r/949515 (https://phabricator.wikimedia.org/T341717) (owner: 10Jbond) [11:55:28] !log installing openjdk-11 security updates [11:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:36] !log zabe@deploy1002 Started scap: Backport for [[gerrit:950811|Revert "Revert "add su namespace translations""]] [11:55:38] (03PS20) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) [11:56:12] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.wikimedia.org on all recursors [11:56:12] !log jbond@cumin1001 END (ERROR) - Cookbook sre.dns.wipe-cache (exit_code=97) puppetboard.wikimedia.org on all recursors [11:56:13] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [11:56:23] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache config-master.wikimedia.org on all recursors [11:56:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.wikimedia.org on all recursors [11:57:00] !log zabe@deploy1002 zabe: Backport for [[gerrit:950811|Revert "Revert "add su namespace translations""]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:57:36] (ProbeDown) firing: (12) Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:53] !log zabe@deploy1002 zabe: Continuing with sync [11:59:09] (03PS1) 10Jbond: README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/951114 [12:00:25] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [12:01:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:01:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1010.eqiad.wmnet [12:02:22] !log klausman@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:ml-cache-eqiad: Restart to pick up OpenJDK 11 security updates - klausman@cumin1001 [12:02:36] (ProbeDown) firing: (12) Service ml-cache2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:03:03] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [12:03:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:03:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951079 (owner: 10Muehlenhoff) [12:03:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42930/console" [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:03:50] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:950811|Revert "Revert "add su namespace translations""]] (duration: 08m 13s) [12:03:52] * zabe done [12:04:39] !log klausman@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:ml-cache-codfw [12:04:48] * urbanecm deploys a beta only patch [12:05:14] (03PS4) 10Urbanecm: [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) [12:05:16] (03CR) 10Urbanecm: [C: 03+2] [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [12:05:58] (03Merged) 10jenkins-bot: [beta] Growth: Enable user research opt-in checkbox on few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949849 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [12:06:30] (03CR) 10Slyngshede: C:bigtop::hadoop move net-topology.py to files. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [12:06:41] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:06:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [12:07:28] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [12:07:36] (ProbeDown) firing: (12) Service ml-cache2002-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1010.eqiad.wmnet [12:09:15] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1011.eqiad.wmnet [12:11:17] * urbanecm done [12:12:39] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [12:12:42] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1002.eqiad.wmnet [12:13:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:13:46] !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-staging-worker [12:13:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [12:16:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1011.eqiad.wmnet [12:16:38] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1012.eqiad.wmnet [12:16:46] PROBLEM - SSH on gitlab2002 is CRITICAL: connect to address 208.80.153.7 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:22:02] PROBLEM - Host gitlab2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:22:15] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1002.eqiad.wmnet [12:22:18] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1003.eqiad.wmnet [12:23:45] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1012.eqiad.wmnet [12:23:48] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1013.eqiad.wmnet [12:25:52] (03CR) 10Jbond: [C: 03+2] README: test puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/951114 (owner: 10Jbond) [12:27:28] RECOVERY - Host gitlab2002 is UP: PING OK - Packet loss = 0%, RTA = 32.67 ms [12:29:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1013.eqiad.wmnet [12:29:26] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1014.eqiad.wmnet [12:31:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [12:31:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1130.eqiad.wmnet with reason: Maintenance [12:31:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T344589)', diff saved to https://phabricator.wikimedia.org/P50603 and previous config saved to /var/cache/conftool/dbconfig/20230821-123123-ladsgroup.json [12:31:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1003.eqiad.wmnet [12:31:52] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1004.eqiad.wmnet [12:34:38] PROBLEM - Host gitlab2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:35:03] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1014.eqiad.wmnet [12:35:05] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-jumbo1015.eqiad.wmnet [12:35:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [12:35:28] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [12:35:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2101.codfw.wmnet with reason: Maintenance [12:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T344589)', diff saved to https://phabricator.wikimedia.org/P50604 and previous config saved to /var/cache/conftool/dbconfig/20230821-123631-ladsgroup.json [12:38:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [12:39:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2111.codfw.wmnet with reason: Maintenance [12:39:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2111 (T344589)', diff saved to https://phabricator.wikimedia.org/P50605 and previous config saved to /var/cache/conftool/dbconfig/20230821-123906-ladsgroup.json [12:40:35] (03PS1) 10Muehlenhoff: firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) [12:40:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-jumbo1015.eqiad.wmnet [12:40:58] (03CR) 10CI reject: [V: 04-1] firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:41:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:41:30] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1004.eqiad.wmnet [12:41:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1005.eqiad.wmnet [12:41:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [12:41:49] (03PS1) 10Urbanecm: revalidateLinkRecommendations: Load scoreLessThan correctly [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950812 (https://phabricator.wikimedia.org/T316079) [12:42:06] !log enabling BGP over Lumen transport cr2-eqiad to cr1-esams [12:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:13] (03PS1) 10Urbanecm: LinkRecommendationUpdater: Load link-recommendation even if disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950813 (https://phabricator.wikimedia.org/T344343) [12:42:44] (03PS1) 10Muehlenhoff: Remove access for skvjold [puppet] - 10https://gerrit.wikimedia.org/r/951119 [12:43:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1001.eqiad.wmnet [12:44:38] !log jelto@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host gitlab2002.wikimedia.org [12:44:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for skvjold [puppet] - 10https://gerrit.wikimedia.org/r/951119 (owner: 10Muehlenhoff) [12:45:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T344589)', diff saved to https://phabricator.wikimedia.org/P50606 and previous config saved to /var/cache/conftool/dbconfig/20230821-124529-ladsgroup.json [12:46:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:46:46] RECOVERY - Host gitlab2002 is UP: PING OK - Packet loss = 0%, RTA = 31.70 ms [12:47:38] RECOVERY - SSH on gitlab2002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:47:48] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:14] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:31] ^ working on gitlab2002 [12:51:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1001.eqiad.wmnet [12:51:07] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch1002.eqiad.wmnet [12:51:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1005.eqiad.wmnet [12:51:30] (03PS1) 10Giuseppe Lavagetto: benthos::instance: remove unused parameter port [puppet] - 10https://gerrit.wikimedia.org/r/951120 [12:51:32] (03PS1) 10Giuseppe Lavagetto: benthos: stop using strings as configuration [puppet] - 10https://gerrit.wikimedia.org/r/951121 [12:51:32] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P50607 and previous config saved to /var/cache/conftool/dbconfig/20230821-125137-ladsgroup.json [12:51:41] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:56] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42931/console" [puppet] - 10https://gerrit.wikimedia.org/r/951120 (owner: 10Giuseppe Lavagetto) [12:53:09] !log setting Lumen transport esams eqiad to default OSPF cost of 800 (bring circuit into normal usage) [12:53:10] PROBLEM - Host gitlab2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:35] 10SRE, 10Traffic: NIC autonegotiation takes 4s in esams - https://phabricator.wikimedia.org/T344604 (10Fabfur) [12:53:54] RECOVERY - Host gitlab2002 is UP: PING OK - Packet loss = 0%, RTA = 32.29 ms [12:54:57] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42932/console" [puppet] - 10https://gerrit.wikimedia.org/r/951121 (owner: 10Giuseppe Lavagetto) [12:55:53] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1111.eqiad.wmnet with OS bullseye [12:55:59] jouncebot: nowandnext [12:55:59] No deployments scheduled for the forseeable future! [12:55:59] No deployments scheduled for the forseeable future! [12:56:06] nice? [12:56:13] (03CR) 10Ladsgroup: [C: 03+2] manage-dblist: Add lang to langlist if not present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950132 (owner: 10Zabe) [12:56:20] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1112.eqiad.wmnet with OS bullseye [12:56:27] Amir1: no calendar yet for this week [12:56:49] yeah I knew it already, this is not the first time [12:56:51] 10SRE, 10Traffic: NIC autonegotiation takes 4s in esams - https://phabricator.wikimedia.org/T344604 (10cmooney) Hi @Fabfur it won't be anything related to configuration on either side I would expect. Possibly the different JunOS version or NIC firmware (I'd say latter more likely). Or maybe to do with the pa... [12:57:01] (03Merged) 10jenkins-bot: manage-dblist: Add lang to langlist if not present [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950132 (owner: 10Zabe) [12:57:13] !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-staging-worker [12:58:17] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2001.codfw.wmnet [12:58:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch1002.eqiad.wmnet [12:58:24] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [12:58:26] (03PS2) 10Muehlenhoff: firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) [13:00:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P50608 and previous config saved to /var/cache/conftool/dbconfig/20230821-130036-ladsgroup.json [13:00:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:01:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [13:01:14] (03CR) 10Muehlenhoff: Blacklist exfat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950145 (owner: 10Muehlenhoff) [13:01:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50609 and previous config saved to /var/cache/conftool/dbconfig/20230821-130118-ladsgroup.json [13:01:24] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:02:46] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2001.codfw.wmnet [13:03:19] !log klausman@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-ctrl2002.codfw.wmnet [13:03:26] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:03:56] (03CR) 10Ladsgroup: [C: 03+1] Use ClusterConfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951046 (owner: 10Giuseppe Lavagetto) [13:04:21] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:04:24] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:04:55] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [13:05:26] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130', diff saved to https://phabricator.wikimedia.org/P50610 and previous config saved to /var/cache/conftool/dbconfig/20230821-130643-ladsgroup.json [13:07:27] jouncebot: nowandnext [13:07:27] No deployments scheduled for the forseeable future! [13:07:27] No deployments scheduled for the forseeable future! [13:07:36] ah, right [13:09:20] !log klausman@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-ctrl2002.codfw.wmnet [13:09:21] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:10:02] (03CR) 10Zabe: [C: 03+2] SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [13:10:40] !log disabling puppet and rebooting cp3075 to test network configuration (T344604) [13:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:44] T344604: NIC autonegotiation takes 4s in esams - https://phabricator.wikimedia.org/T344604 [13:10:44] (03Merged) 10jenkins-bot: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) (owner: 10Winston Sung) [13:10:53] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1111.eqiad.wmnet with reason: host reimage [13:10:57] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1112.eqiad.wmnet with reason: host reimage [13:11:02] !log zabe@deploy1002 Started scap: Backport for [[gerrit:884494|SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes (T172035 T111876)]] [13:11:08] T172035: Blockers for Wikimedia wiki domain renaming - https://phabricator.wikimedia.org/T172035 [13:11:08] T111876: Figure out how renaming languages without database names is supposed to work with SiteMatrix, and implement for be-x-old -> be-tarask - https://phabricator.wikimedia.org/T111876 [13:11:53] <_joe_> zabe: uhhh right for some reason the deployments page is not updated? [13:12:18] yeah, tyler needs to do that once he's awake [13:12:20] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3075.esams.wmnet [13:12:31] !log zabe@deploy1002 zabe and wsung: Backport for [[gerrit:884494|SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes (T172035 T111876)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:12:58] !log zabe@deploy1002 zabe and wsung: Continuing with sync [13:13:49] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1111.eqiad.wmnet with reason: host reimage [13:15:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111', diff saved to https://phabricator.wikimedia.org/P50611 and previous config saved to /var/cache/conftool/dbconfig/20230821-131542-ladsgroup.json [13:15:44] (03PS1) 10Muehlenhoff: Add a nftables::file::service define to install a custom nftables input rule [puppet] - 10https://gerrit.wikimedia.org/r/951123 (https://phabricator.wikimedia.org/T336497) [13:16:07] (03PS1) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) [13:16:20] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1112.eqiad.wmnet with reason: host reimage [13:16:29] (03CR) 10CI reject: [V: 04-1] confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [13:16:46] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:16:50] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [13:17:57] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3075.esams.wmnet [13:18:55] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:884494|SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes (T172035 T111876)]] (duration: 07m 53s) [13:19:00] T172035: Blockers for Wikimedia wiki domain renaming - https://phabricator.wikimedia.org/T172035 [13:19:00] T111876: Figure out how renaming languages without database names is supposed to work with SiteMatrix, and implement for be-x-old -> be-tarask - https://phabricator.wikimedia.org/T111876 [13:19:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:19:42] PROBLEM - purged service on cp3075 is CRITICAL: CRITICAL - Expecting active but unit purged is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:19:56] (03PS2) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) [13:21:20] (03CR) 10Ladsgroup: ClusterConfig: also allow to return hostname (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951047 (owner: 10Giuseppe Lavagetto) [13:21:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T344589)', diff saved to https://phabricator.wikimedia.org/P50612 and previous config saved to /var/cache/conftool/dbconfig/20230821-132150-ladsgroup.json [13:21:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:21:58] (03CR) 10CI reject: [V: 04-1] confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [13:22:07] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) a:03Jhancock.wm [13:22:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [13:22:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P50613 and previous config saved to /var/cache/conftool/dbconfig/20230821-132214-ladsgroup.json [13:22:30] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T344269 (10Jhancock.wm) a:03Jhancock.wm [13:23:22] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:23:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:23:26] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T344269 (10Jhancock.wm) p:05Triage→03Medium It doesn't appear that the alert is firing anymore. Will check back to make sure that it has remained stable by EoD [13:23:34] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [13:23:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50614 and previous config saved to /var/cache/conftool/dbconfig/20230821-132341-ladsgroup.json [13:24:04] (03PS3) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) [13:24:34] (03PS4) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) [13:24:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:26:42] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42936/console" [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341554) (owner: 10JMeybohm) [13:29:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P50615 and previous config saved to /var/cache/conftool/dbconfig/20230821-132945-ladsgroup.json [13:30:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2111 (T344589)', diff saved to https://phabricator.wikimedia.org/P50616 and previous config saved to /var/cache/conftool/dbconfig/20230821-133048-ladsgroup.json [13:30:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:31:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [13:31:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2123 (T344589)', diff saved to https://phabricator.wikimedia.org/P50617 and previous config saved to /var/cache/conftool/dbconfig/20230821-133113-ladsgroup.json [13:32:50] PROBLEM - Check systemd state on cp3075 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:34:08] (03CR) 10Bking: Start Blazegraph from systemd unit, without runBlazegraph.sh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/950136 (https://phabricator.wikimedia.org/T342361) (owner: 10Gehel) [13:35:32] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1111.eqiad.wmnet with OS bullseye [13:38:14] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T344589)', diff saved to https://phabricator.wikimedia.org/P50618 and previous config saved to /var/cache/conftool/dbconfig/20230821-133849-ladsgroup.json [13:39:44] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1112.eqiad.wmnet with OS bullseye [13:39:56] RECOVERY - purged service on cp3075 is OK: OK - purged is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:40:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2001.codfw.wmnet [13:40:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/951123 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:41:15] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Autocompute requests and limits for all [deployment-charts] - 10https://gerrit.wikimedia.org/r/951051 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:41:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:41:44] !log btullis@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:dse-k8s-worker [13:42:01] (03Merged) 10jenkins-bot: mediawiki: Autocompute requests and limits for all [deployment-charts] - 10https://gerrit.wikimedia.org/r/951051 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:42:36] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Papaul) @Jhancock.wm thanks for working on this. since we have some R440 in storage can you pull the backplane out of one of those servers and try while we waiting for Dell to send us o... [13:42:48] !log Enabling memory limit autocompute for all mw-on-k8s deployments - T342748 [13:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:52] T342748: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 [13:42:58] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [13:43:02] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [13:43:41] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:44:15] (03CR) 10Joal: [C: 03+1] "LGTM ! Thanks @btullis :)" [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [13:44:29] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:44:42] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:44:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P50620 and previous config saved to /var/cache/conftool/dbconfig/20230821-134452-ladsgroup.json [13:45:15] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:45:38] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [13:45:38] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [13:45:52] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [13:45:53] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [13:46:04] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [13:46:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [13:46:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:47:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2001.codfw.wmnet [13:47:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host kafka-stretch2002.codfw.wmnet [13:47:46] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:48:12] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Jhancock.wm) I can give it a shot. Dell confirmed they will be sending a new part this morning. [13:48:38] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:49:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50621 and previous config saved to /var/cache/conftool/dbconfig/20230821-134926-ladsgroup.json [13:49:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [13:49:31] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:49:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [13:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P50622 and previous config saved to /var/cache/conftool/dbconfig/20230821-134949-ladsgroup.json [13:49:51] 10ops-codfw, 10serviceops-radar, 10Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (10Papaul) @Jhancock.wm in that case just wait for the new part. thanks [13:50:00] 10SRE, 10SRE-Access-Requests, 10DBA: mariadb: grant user 'phstats' additional select on phabricator_repository DB - https://phabricator.wikimedia.org/T344513 (10Aklapper) Thanks, confirming it works as expected! [13:50:13] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to analytics-privatedata-users for ATsay-WMF - https://phabricator.wikimedia.org/T344199 (10jbond) 05In progress→03Resolved @ATsay-WMF i have added you to the ldap wmf group which should give you access to superset. If you need additional a... [13:50:59] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10jbond) 05In progress→03Stalled Setting to stalled until confirmation of NDA [13:51:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:51:54] (03PS1) 10Clément Goubert: mw-misc: Enforce fixed requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951125 (https://phabricator.wikimedia.org/T342748) [13:52:56] (03CR) 10Clément Goubert: [C: 03+2] mw-misc: Enforce fixed requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951125 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:53:12] (03PS3) 10Muehlenhoff: firewall: Add SSH rules in firewall-agnostic form [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) [13:53:42] (03CR) 10Stevemunene: [C: 03+2] datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:53:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kafka-stretch2002.codfw.wmnet [13:53:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P50623 and previous config saved to /var/cache/conftool/dbconfig/20230821-135355-ladsgroup.json [13:54:22] (03Merged) 10jenkins-bot: mw-misc: Enforce fixed requests and limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/951125 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:54:34] (03Merged) 10jenkins-bot: datahub: Enable OIDC to idp_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/949516 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [13:55:02] jouncebot: nowandnext [13:55:03] For the next 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1300) [13:55:03] In 1 hour(s) and 34 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1530) [13:55:11] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [13:55:16] !log Re-enforcing limits and requests for mw-misc - T342748 [13:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:20] T342748: mw-on-k8s app container CPU throttling at low average load - https://phabricator.wikimedia.org/T342748 [13:55:20] TheresNoTime: Gimme a sec [13:55:36] claime: was just checking, not intending to deploy [13:55:50] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [13:56:43] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10jbond) 05Open→03Stalled [13:57:45] (03PS1) 10Clément Goubert: mw-misc: Fix bad config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/951126 [13:57:47] TheresNoTime: ack [13:57:49] thx [13:57:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951118 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:59:40] (03CR) 10Clément Goubert: [C: 03+2] mw-misc: Fix bad config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/951126 (owner: 10Clément Goubert) [13:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P50624 and previous config saved to /var/cache/conftool/dbconfig/20230821-135958-ladsgroup.json [14:00:16] RECOVERY - Check systemd state on cp3075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:23] (03Merged) 10jenkins-bot: mw-misc: Fix bad config key [deployment-charts] - 10https://gerrit.wikimedia.org/r/951126 (owner: 10Clément Goubert) [14:01:10] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [14:01:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [14:01:23] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [14:01:26] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [14:02:20] !log killed cebwiki's dump gen on snapshot1010 [14:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:40] apergos: ^ FYI, it was blocking an important maint work [14:02:52] PROBLEM - Check systemd state on dse-k8s-worker1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:15] Amir1: ok, I'll check later to make sure the run picks up again as it should, thanks for the heads up [14:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50625 and previous config saved to /var/cache/conftool/dbconfig/20230821-140433-ladsgroup.json [14:06:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:42] RECOVERY - Check systemd state on dse-k8s-worker1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123', diff saved to https://phabricator.wikimedia.org/P50626 and previous config saved to /var/cache/conftool/dbconfig/20230821-140901-ladsgroup.json [14:10:02] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:30:00 on lvs[3008-3009].esams.wmnet with reason: rebooting for microcode updates [14:10:19] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lvs[3008-3009].esams.wmnet with reason: rebooting for microcode updates [14:11:03] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [14:11:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs3008.esams.wmnet [14:13:43] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs3009.esams.wmnet [14:14:06] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host lvs3010.esams.wmnet [14:14:21] !log hnowlan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hnowlan: updating records for reuse of thumbor servers for k8s nodes T343996 T343993 - hnowlan@cumin1001" [14:14:26] T343996: Move codfw thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343996 [14:14:27] T343993: Move eqiad thumbor hosts to kubernetes cluster - https://phabricator.wikimedia.org/T343993 [14:15:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P50627 and previous config saved to /var/cache/conftool/dbconfig/20230821-141504-ladsgroup.json [14:15:05] (03PS1) 10Eevans: aqs: upgrade aqs2001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951127 (https://phabricator.wikimedia.org/T339299) [14:15:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: hnowlan: updating records for reuse of thumbor servers for k8s nodes T343996 T343993 - hnowlan@cumin1001" [14:15:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:15:22] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal, AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:16:26] !log hnowlan@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1058 [14:16:41] (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:12] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951127 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [14:17:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50628 and previous config saved to /var/cache/conftool/dbconfig/20230821-141719-ladsgroup.json [14:17:43] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1058 [14:17:58] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:08] !log hnowlan@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1057 [14:18:24] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [14:19:26] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:28] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:19:41] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1057 [14:19:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50629 and previous config saved to /var/cache/conftool/dbconfig/20230821-141942-ladsgroup.json [14:19:46] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:19:55] !log hnowlan@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes2055 [14:19:57] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:20:41] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs3009.esams.wmnet [14:21:09] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes2055 [14:21:36] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:21:53] ^ please ACK [14:22:18] !log hnowlan@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes2056 [14:22:19] sukhe: known issue? [14:22:26] yes please [14:22:36] (ProbeDown) firing: (12) Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:43] sukhe: acked [14:22:44] should have downtimed [14:23:11] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host lvs3008.esams.wmnet [14:23:32] PROBLEM - PyBal IPVS diff check on lvs3008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:23:42] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes2056 [14:24:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2123 (T344589)', diff saved to https://phabricator.wikimedia.org/P50630 and previous config saved to /var/cache/conftool/dbconfig/20230821-142408-ladsgroup.json [14:24:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:24:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2128.codfw.wmnet with reason: Maintenance [14:24:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:24:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:24:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2128 (T344589)', diff saved to https://phabricator.wikimedia.org/P50631 and previous config saved to /var/cache/conftool/dbconfig/20230821-142448-ladsgroup.json [14:24:59] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:25:13] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/951127 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [14:26:10] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host lvs3010.esams.wmnet [14:26:20] (03PS1) 10Stevemunene: datahub: fix cidr typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/951128 (https://phabricator.wikimedia.org/T305874) [14:26:34] PROBLEM - PyBal IPVS diff check on lvs3010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:26:39] (03PS1) 10Krinkle: search-grafana-dashboards: Make work on plan Node.js [software] - 10https://gerrit.wikimedia.org/r/951129 [14:27:15] (03PS2) 10Krinkle: search-grafana-dashboards: Make work on plan Node.js [software] - 10https://gerrit.wikimedia.org/r/951129 [14:27:16] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:27:48] (03CR) 10Krinkle: "I switched this from npm:request to node-fetch a few years ago, that transition is now complete." [software] - 10https://gerrit.wikimedia.org/r/951129 (owner: 10Krinkle) [14:28:22] (03PS1) 10Hnowlan: site: change kubernetes::worker regex to include kubernetes105[78] [puppet] - 10https://gerrit.wikimedia.org/r/951130 (https://phabricator.wikimedia.org/T343993) [14:28:57] (03CR) 10Eevans: [C: 03+2] aqs: upgrade aqs2001 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951127 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [14:29:52] (03PS1) 10Clément Goubert: mw-on-k8s: Raise traffic to 2% [puppet] - 10https://gerrit.wikimedia.org/r/951131 (https://phabricator.wikimedia.org/T341780) [14:30:06] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3003.esams.wmnet [14:31:21] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host ncredir3004.esams.wmnet [14:31:39] (03PS1) 10Ssingh: esams: remove dns3003 and dns3004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/951133 (https://phabricator.wikimedia.org/T344587) [14:32:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T344589)', diff saved to https://phabricator.wikimedia.org/P50632 and previous config saved to /var/cache/conftool/dbconfig/20230821-143221-ladsgroup.json [14:32:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P50633 and previous config saved to /var/cache/conftool/dbconfig/20230821-143226-ladsgroup.json [14:32:33] !log Upgrading aq2001/cassandra-a (canary) to Cassandra 4.1.1 — T339299 [14:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:38] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [14:33:57] (03CR) 10Ssingh: [C: 03+2] esams: remove dns3003 and dns3004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/951133 (https://phabricator.wikimedia.org/T344587) (owner: 10Ssingh) [14:34:04] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir3003.esams.wmnet [14:34:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50634 and previous config saved to /var/cache/conftool/dbconfig/20230821-143449-ladsgroup.json [14:34:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:35:06] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ncredir3004.esams.wmnet [14:37:51] (03CR) 10Muehlenhoff: "(Acked in SRE meeting since it involves a change to an existing permission group)" [puppet] - 10https://gerrit.wikimedia.org/r/950194 (https://phabricator.wikimedia.org/T334493) (owner: 10Btullis) [14:38:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns3003.wikimedia.org [14:39:24] !log Upgrading aq2001/cassandra-b (canary) to Cassandra 4.1.1 — T339299 [14:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:27] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [14:39:38] 10SRE, 10SRE-Access-Requests: Requesting access to people.wikimedia.org for Panagiotis Penloglou - https://phabricator.wikimedia.org/T344405 (10DBu-WMF) Approved [14:39:50] (03PS2) 10Hnowlan: site: change kubernetes::worker regex to include ex-thumbor hosts [puppet] - 10https://gerrit.wikimedia.org/r/951130 (https://phabricator.wikimedia.org/T343993) [14:41:37] (03CR) 10Clément Goubert: [C: 03+1] site: change kubernetes::worker regex to include ex-thumbor hosts [puppet] - 10https://gerrit.wikimedia.org/r/951130 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [14:42:09] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/951120 (owner: 10Giuseppe Lavagetto) [14:42:34] PROBLEM - BGP status on asw1-by27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:42:38] PROBLEM - BFD status on asw1-by27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:12] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts2001.codfw.wmnet [14:45:24] RECOVERY - BGP status on asw1-by27-esams.mgmt is OK: BGP OK - up: 8, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:45:30] RECOVERY - BFD status on asw1-by27-esams.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:45:57] jouncebot: nowandnext [14:45:57] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [14:45:57] In 0 hour(s) and 44 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1530) [14:46:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3003.wikimedia.org [14:46:23] going to sneak in 950806 before I forget [14:46:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns3004.wikimedia.org [14:47:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950806 (owner: 10Majavah) [14:47:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P50635 and previous config saved to /var/cache/conftool/dbconfig/20230821-144728-ladsgroup.json [14:47:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P50636 and previous config saved to /var/cache/conftool/dbconfig/20230821-144738-ladsgroup.json [14:47:49] (03CR) 10Filippo Giunchedi: [C: 03+1] benthos: stop using strings as configuration [puppet] - 10https://gerrit.wikimedia.org/r/951121 (owner: 10Giuseppe Lavagetto) [14:47:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P50637 and previous config saved to /var/cache/conftool/dbconfig/20230821-144757-ladsgroup.json [14:48:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:48:03] (03Merged) 10jenkins-bot: Revert "throttle: add rules for Wikimania 2023" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/950806 (owner: 10Majavah) [14:48:17] !log taavi@deploy1002 Started scap: Backport for [[gerrit:950806|Revert "throttle: add rules for Wikimania 2023"]] [14:48:21] (03CR) 10Hnowlan: [C: 03+2] site: change kubernetes::worker regex to include ex-thumbor hosts [puppet] - 10https://gerrit.wikimedia.org/r/951130 (https://phabricator.wikimedia.org/T343993) (owner: 10Hnowlan) [14:48:35] urbanecm: want me to backport T344495 too? [14:48:35] T344495: Special:GlobalUserRights no longer works on accounts with a space in their username - https://phabricator.wikimedia.org/T344495 [14:48:53] taavi: if you're already backporting, yes please. [14:48:57] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts2001.codfw.wmnet [14:49:17] otherwise i can do it with my pending GE backports later. [14:49:19] I'm deploying the throttle rule revert, so will do [14:49:22] thanks! [14:49:27] (03CR) 10Majavah: [C: 03+2] SpecialGlobalGroupMembership: Normalize usernames [extensions/CentralAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950071 (https://phabricator.wikimedia.org/T344495) (owner: 10Urbanecm) [14:49:29] (03PS1) 10Ssingh: interface::noflow: change up to pre-up [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) [14:49:47] !log taavi@deploy1002 taavi: Backport for [[gerrit:950806|Revert "throttle: add rules for Wikimania 2023"]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:49:59] !log taavi@deploy1002 taavi: Continuing with sync [14:50:08] PROBLEM - BGP status on asw1-bw27-esams.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:27] (03CR) 10Filippo Giunchedi: [C: 03+2] search-grafana-dashboards: Make work on plan Node.js [software] - 10https://gerrit.wikimedia.org/r/951129 (owner: 10Krinkle) [14:50:29] (03PS1) 10Ssingh: Revert "esams: remove dns3003 and dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/950815 [14:51:10] PROBLEM - BFD status on asw1-bw27-esams.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:51:29] (03PS1) 10Muehlenhoff: firewall::service: Use correct type for port range [puppet] - 10https://gerrit.wikimedia.org/r/951135 (https://phabricator.wikimedia.org/T336497) [14:52:58] RECOVERY - BGP status on asw1-bw27-esams.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:53:31] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/951134/42937/" [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [14:53:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns3004.wikimedia.org [14:53:58] RECOVERY - BFD status on asw1-bw27-esams.mgmt is OK: UP: 1 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:54:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [14:54:12] (03CR) 10Ssingh: [C: 03+2] Revert "esams: remove dns3003 and dns3004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/950815 (owner: 10Ssingh) [14:54:14] (03Merged) 10jenkins-bot: SpecialGlobalGroupMembership: Normalize usernames [extensions/CentralAuth] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950071 (https://phabricator.wikimedia.org/T344495) (owner: 10Urbanecm) [14:54:52] (03CR) 10Muehlenhoff: [C: 03+2] autoinstall: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/950143 (owner: 10Muehlenhoff) [14:55:14] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host arclamp1001.eqiad.wmnet [14:55:56] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2004.codfw.wmnet [14:56:29] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:950806|Revert "throttle: add rules for Wikimania 2023"]] (duration: 08m 11s) [14:56:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:56:42] jouncebot: next [14:56:42] In 0 hour(s) and 33 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1530) [14:56:59] godog: I'm doing some backports [14:57:05] !log taavi@deploy1002 Started scap: Backport for [[gerrit:950071|SpecialGlobalGroupMembership: Normalize usernames (T344495)]] [14:57:09] T344495: Special:GlobalUserRights no longer works on accounts with a space in their username - https://phabricator.wikimedia.org/T344495 [14:57:24] taavi: ack, thanks! that's the graphite standby host FWIW [14:58:35] !log taavi@deploy1002 taavi and urbanecm: Backport for [[gerrit:950071|SpecialGlobalGroupMembership: Normalize usernames (T344495)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:59:10] taavi: patch seems to be working :) [14:59:16] yep, I was testing too [14:59:18] deploying [14:59:23] !log taavi@deploy1002 taavi and urbanecm: Continuing with sync [15:00:41] (03PS1) 10Muehlenhoff: apt: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/951136 [15:00:57] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host arclamp1001.eqiad.wmnet [15:01:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128', diff saved to https://phabricator.wikimedia.org/P50638 and previous config saved to /var/cache/conftool/dbconfig/20230821-150234-ladsgroup.json [15:02:43] (03PS7) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [15:02:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50639 and previous config saved to /var/cache/conftool/dbconfig/20230821-150244-ladsgroup.json [15:02:45] (03PS1) 10Jbond: puppetmaster::fetch_swift_rings: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/951138 (https://phabricator.wikimedia.org/T341056) [15:02:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:02:54] hmm, trying to resolve bast3006.wikimedia.org but apparently that isnt a thing anymore? [15:03:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [15:03:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P50640 and previous config saved to /var/cache/conftool/dbconfig/20230821-150304-ladsgroup.json [15:03:07] addshore: bast3007 is [15:03:09] addshore: see ops-l [15:03:13] (03CR) 10CI reject: [V: 04-1] puppetmaster::fetch_swift_rings: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/951138 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:03:22] (03CR) 10Vgutierrez: [C: 04-1] "This also impacts to interface::rps and interface::txqueuelen, we need to double check that those actions can be performed before bringing" [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [15:03:41] classic timing, thanks [15:03:45] or you might want to consider Marseille / bast6002 :) [15:03:58] *updates wikitech too* [15:05:52] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:950071|SpecialGlobalGroupMembership: Normalize usernames (T344495)]] (duration: 08m 47s) [15:06:04] aand I'm done [15:06:04] (03PS1) 10Andrew Bogott: backy2: make postgres data dir configurable. [puppet] - 10https://gerrit.wikimedia.org/r/951139 (https://phabricator.wikimedia.org/T344065) [15:06:04] T344495: Special:GlobalUserRights no longer works on accounts with a space in their username - https://phabricator.wikimedia.org/T344495 [15:06:14] godog: ^ [15:06:44] urbanecm: but yes, lets dry 6002 for a while :D [15:07:15] taavi: cheers [15:07:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:07:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [15:07:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:07:44] (03PS2) 10Jbond: puppetmaster::fetch_swift_rings: rename profile [puppet] - 10https://gerrit.wikimedia.org/r/951138 (https://phabricator.wikimedia.org/T341056) [15:07:46] (03PS8) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [15:07:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:07:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T344589)', diff saved to https://phabricator.wikimedia.org/P50641 and previous config saved to /var/cache/conftool/dbconfig/20230821-150755-ladsgroup.json [15:07:59] can I do some deployments? [15:08:08] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host graphite2004.codfw.wmnet [15:08:28] +1 on my end, FWIW [15:08:29] Amir1: I just finished deploying, so sure I guess [15:08:37] (03CR) 10CI reject: [V: 04-1] backy2: make postgres data dir configurable. [puppet] - 10https://gerrit.wikimedia.org/r/951139 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [15:08:40] PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:06] RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:07] awesome [15:10:55] (03PS9) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [15:10:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 20): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42942/console" [puppet] - 10https://gerrit.wikimedia.org/r/951138 (https://phabricator.wikimedia.org/T341056) (owner: 10Jbond) [15:10:59] (03PS3) 10Ladsgroup: Enable url shortener in sidebar in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921) [15:11:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [15:12:18] (03Merged) 10jenkins-bot: Enable url shortener in sidebar in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947823 (https://phabricator.wikimedia.org/T267921) (owner: 10Ladsgroup) [15:12:27] (03PS2) 10Andrew Bogott: backy2: make postgres data dir configurable. [puppet] - 10https://gerrit.wikimedia.org/r/951139 (https://phabricator.wikimedia.org/T344065) [15:12:31] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:947823|Enable url shortener in sidebar in testwiki (T267921)]] [15:12:35] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [15:13:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951136 (owner: 10Muehlenhoff) [15:13:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:dse-k8s-worker [15:14:00] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:947823|Enable url shortener in sidebar in testwiki (T267921)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [15:14:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T344589)', diff saved to https://phabricator.wikimedia.org/P50642 and previous config saved to /var/cache/conftool/dbconfig/20230821-151408-ladsgroup.json [15:14:50] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [15:15:51] (03CR) 10Andrew Bogott: [C: 03+2] backy2: make postgres data dir configurable. [puppet] - 10https://gerrit.wikimedia.org/r/951139 (https://phabricator.wikimedia.org/T344065) (owner: 10Andrew Bogott) [15:17:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2128 (T344589)', diff saved to https://phabricator.wikimedia.org/P50643 and previous config saved to /var/cache/conftool/dbconfig/20230821-151740-ladsgroup.json [15:17:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:17:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [15:18:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P50644 and previous config saved to /var/cache/conftool/dbconfig/20230821-151805-ladsgroup.json [15:18:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P50645 and previous config saved to /var/cache/conftool/dbconfig/20230821-151810-ladsgroup.json [15:18:14] (03PS1) 10Filippo Giunchedi: confd: create run_dir via tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/951141 [15:19:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50646 and previous config saved to /var/cache/conftool/dbconfig/20230821-151931-ladsgroup.json [15:21:34] RECOVERY - PyBal IPVS diff check on lvs3008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:21:36] (ProbeDown) resolved: (8) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:22:08] RECOVERY - PyBal IPVS diff check on lvs3010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:22:18] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:947823|Enable url shortener in sidebar in testwiki (T267921)]] (duration: 09m 46s) [15:22:25] T267921: Roll out the Toolbox link for URL Shortener in Wikimedia sites - https://phabricator.wikimedia.org/T267921 [15:22:27] (03PS1) 10Ladsgroup: Stop writing to the old columns of extlinks everywhere except s1, s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951142 (https://phabricator.wikimedia.org/T342683) [15:22:37] (ProbeDown) resolved: (8) Service text-https:443 has failed probes (http_text-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:22:47] jouncebot: nowandnext [15:22:47] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [15:22:47] In 0 hour(s) and 7 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1530) [15:23:41] 10SRE, 10ops-codfw, 10collaboration-services, 10decommission-hardware: Decommission contint2001.wikimedia.org - https://phabricator.wikimedia.org/T342017 (10Jhancock.wm) a:03Jhancock.wm [15:23:44] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 12897 [15:24:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 12897 [15:24:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 26801 [15:24:33] (03PS10) 10Jbond: puppetserver: add volatile config [puppet] - 10https://gerrit.wikimedia.org/r/948647 (https://phabricator.wikimedia.org/T341056) [15:24:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 26801 [15:24:40] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45356 [15:25:09] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45356 [15:25:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 141081 [15:25:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141081 [15:26:54] !log cleaning confd stale ncredir errors in config-master2001 [15:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P50647 and previous config saved to /var/cache/conftool/dbconfig/20230821-152657-ladsgroup.json [15:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P50648 and previous config saved to /var/cache/conftool/dbconfig/20230821-152914-ladsgroup.json [15:29:44] (03PS1) 10Lucas Werkmeister: tools-static: Further header cleanup [puppet] - 10https://gerrit.wikimedia.org/r/951144 [15:29:54] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:29:58] (03CR) 10Lucas Werkmeister: tools-static: Hide more Cloudflare response headers (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/940506 (owner: 10Lucas Werkmeister) [15:30:04] jan_drewniak: That opportune time is upon us again. Time for a Wikimedia Portals Update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1530). [15:32:11] (03PS1) 10Eevans: aqs: upgrade aqs1010 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951145 (https://phabricator.wikimedia.org/T339299) [15:32:14] (03CR) 10CI reject: [V: 04-1] tools-static: Further header cleanup [puppet] - 10https://gerrit.wikimedia.org/r/951144 (owner: 10Lucas Werkmeister) [15:32:47] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951145 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [15:33:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T342617)', diff saved to https://phabricator.wikimedia.org/P50649 and previous config saved to /var/cache/conftool/dbconfig/20230821-153316-ladsgroup.json [15:33:21] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:34:04] (03CR) 10MSantos: [C: 03+2] wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) (owner: 10Jgiannelos) [15:34:28] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951147 (https://phabricator.wikimedia.org/T128546) [15:34:42] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[3456] - https://phabricator.wikimedia.org/T344597 (10Jhancock.wm) @hnowlan Wasn't sure so wanted to reach out. Is it safe to proceed with decommissioning these servers? [15:34:55] (ConfdResourceFailed) firing: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:35:08] (03Merged) 10jenkins-bot: wikifeeds: Use GET instead of POST for mwapi requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/949046 (https://phabricator.wikimedia.org/T343950) (owner: 10Jgiannelos) [15:35:10] jan_drewniak: hi, let me know once you're done [15:35:37] Amir1: ok will do :) [15:36:00] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951147 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:26] 10SRE-swift-storage, 10collaboration-services: Investigate object storage for Gitlab - https://phabricator.wikimedia.org/T336234 (10eoghan) [15:36:40] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951147 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:39:54] (ConfdResourceFailed) resolved: (120) confd resource _srv_config-master_pybal_codfw_ncredir-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:42:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P50650 and previous config saved to /var/cache/conftool/dbconfig/20230821-154203-ladsgroup.json [15:43:03] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New kernel, T344587 [15:43:17] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New kernel, T344587 [15:44:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P50651 and previous config saved to /var/cache/conftool/dbconfig/20230821-154420-ladsgroup.json [15:44:26] (03PS1) 10Urbanecm: Growth: Remove wgWelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951151 (https://phabricator.wikimedia.org/T342353) [15:45:08] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export [15:45:21] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export [15:45:33] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:951147| Bumping portals to master (T128546)]] (duration: 07m 20s) [15:45:40] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:46:28] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:01] 10SRE, 10Content-Transform-Team-WIP, 10Mobile-Content-Service, 10RESTbase Sunsetting, and 3 others: Setup allowed list for MCS decom - https://phabricator.wikimedia.org/T340036 (10MSantos) [15:47:06] (03PS2) 10Ssingh: interface::noflow: add optional parameter for pre-up [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) [15:47:08] (03PS1) 10Ssingh: hiera: add profile::cache::base::use_iface_preup for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951152 [15:47:59] (03PS2) 10Urbanecm: Growth: Remove wgWelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951151 (https://phabricator.wikimedia.org/T342353) [15:48:12] (03CR) 10Xcollazo: [C: 03+1] Retain yarn logs for 60 days and compress with gzip [puppet] - 10https://gerrit.wikimedia.org/r/950191 (https://phabricator.wikimedia.org/T342923) (owner: 10Btullis) [15:48:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42945/console" [puppet] - 10https://gerrit.wikimedia.org/r/951152 (owner: 10Ssingh) [15:48:39] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New kernel, T344587 [15:48:52] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New kernel, T344587 [15:50:38] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:20] (03CR) 10Vgutierrez: interface::noflow: add optional parameter for pre-up (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [15:53:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:52] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:951147| Bumping portals to master (T128546)]] (duration: 08m 18s) [15:53:56] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:54:16] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New kernel, T344587 [15:54:29] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New kernel, T344587 [15:55:17] Amir1: Finally done :) [15:55:34] awesome. thanks [15:56:31] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1184 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/951088 (https://phabricator.wikimedia.org/T344621) [15:56:36] (03PS1) 10Gerrit maintenance bot: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/951089 (https://phabricator.wikimedia.org/T344621) [15:57:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P50652 and previous config saved to /var/cache/conftool/dbconfig/20230821-155710-ladsgroup.json [15:57:28] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/951136 (owner: 10Muehlenhoff) [15:57:48] (03PS2) 10Stevemunene: datahub: fix cidr typo [deployment-charts] - 10https://gerrit.wikimedia.org/r/951128 (https://phabricator.wikimedia.org/T305874) [15:58:05] Amir1: oh no! I fetched master but forgot to rebase [15:58:20] Can I do the deployment again? [15:58:33] no worries. I'll be in a meeting right now [15:58:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:58:37] take your time [15:59:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T344589)', diff saved to https://phabricator.wikimedia.org/P50653 and previous config saved to /var/cache/conftool/dbconfig/20230821-155927-ladsgroup.json [15:59:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:59:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:59:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1185 (T344589)', diff saved to https://phabricator.wikimedia.org/P50654 and previous config saved to /var/cache/conftool/dbconfig/20230821-155950-ladsgroup.json [16:00:14] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-codfw cluster: Reboot kafka nodes [16:03:30] (03PS1) 10Muehlenhoff: aptrepo: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951155 [16:06:47] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:951147| Bumping portals to master (T128546)]] (duration: 07m 18s) [16:06:58] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:07:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:12:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T344589)', diff saved to https://phabricator.wikimedia.org/P50655 and previous config saved to /var/cache/conftool/dbconfig/20230821-161216-ladsgroup.json [16:12:34] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:13:59] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:951147| Bumping portals to master (T128546)]] (duration: 07m 11s) [16:14:03] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:14:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50656 and previous config saved to /var/cache/conftool/dbconfig/20230821-161432-ladsgroup.json [16:17:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:20:17] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[3456] - https://phabricator.wikimedia.org/T344597 (10hnowlan) >>! In T344597#9106635, @Jhancock.wm wrote: > @hnowlan Wasn't sure so wanted to reach out. Is it safe to proceed with decommissioning these servers? Apologies - I should be much clearer i... [16:21:47] 10SRE, 10ops-codfw, 10serviceops: Decommission thumbor200[34] - https://phabricator.wikimedia.org/T344597 (10hnowlan) [16:22:30] 10SRE, 10ops-eqiad, 10serviceops: Decommission thumbor100[12] - https://phabricator.wikimedia.org/T344598 (10hnowlan) [16:22:37] (03PS3) 10Ssingh: interface::noflow: add optional parameter for pre-up [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) [16:28:02] (03PS2) 10Ssingh: hiera: add profile::cache::base::use_iface_preup for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951152 [16:28:24] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/951152 (owner: 10Ssingh) [16:29:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P50657 and previous config saved to /var/cache/conftool/dbconfig/20230821-162939-ladsgroup.json [16:29:46] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/951135 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [16:30:09] (03CR) 10Cathal Mooney: "Makes sense to do this in a "pre-up" command. Whether it's needed for the 25G DACs we use / PHY I can't say for sure but I'm sure it's no" [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [16:30:24] (03CR) 10Jbond: [C: 03+1] apt: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/951136 (owner: 10Muehlenhoff) [16:30:37] (03CR) 10Jbond: [C: 03+1] aptrepo: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/951155 (owner: 10Muehlenhoff) [16:33:13] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/951141 (owner: 10Filippo Giunchedi) [16:33:46] PROBLEM - Check systemd state on kafka-logging2001 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:25] (03Abandoned) 10Ssingh: hiera: add profile::cache::base::use_iface_preup for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951152 (owner: 10Ssingh) [16:37:44] (03PS1) 10Ssingh: hiera: add profile::cache::base::use_perf_iface_preup for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951161 [16:39:03] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42947/console" [puppet] - 10https://gerrit.wikimedia.org/r/951161 (owner: 10Ssingh) [16:39:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T344589)', diff saved to https://phabricator.wikimedia.org/P50658 and previous config saved to /var/cache/conftool/dbconfig/20230821-163913-ladsgroup.json [16:43:34] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [16:44:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315', diff saved to https://phabricator.wikimedia.org/P50659 and previous config saved to /var/cache/conftool/dbconfig/20230821-164445-ladsgroup.json [16:44:48] PROBLEM - Check systemd state on kafka-logging2002 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:51:40] (03CR) 10Eevans: [C: 03+1] "LGTM; If no one objects, I'll merge & deploy tomorrow" [deployment-charts] - 10https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: 10Ahmon Dancy) [16:53:01] (03PS4) 10Ssingh: interface::noflow: add optional parameter for pre-up [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) [16:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P50660 and previous config saved to /var/cache/conftool/dbconfig/20230821-165420-ladsgroup.json [16:54:41] (03PS5) 10JMeybohm: confd: Move -prefix from confd cli to confd::file instances [puppet] - 10https://gerrit.wikimedia.org/r/951124 (https://phabricator.wikimedia.org/T341669) [16:54:44] RECOVERY - Check systemd state on kafka-logging2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:55:12] (03PS2) 10Ssingh: hiera: add profile::cache::base::use_perf_iface_preup for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951161 [16:56:43] (03Abandoned) 10Ssingh: hiera: add profile::cache::base::use_perf_iface_preup for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951161 (owner: 10Ssingh) [16:57:57] (03PS1) 10Ssingh: hiera: add profile::cache::base::use_iface_preup only for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951164 [16:59:30] (03PS2) 10Ssingh: hiera: add profile::cache::base::use_noflow_iface_preup only for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951164 [16:59:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50661 and previous config saved to /var/cache/conftool/dbconfig/20230821-165951-ladsgroup.json [16:59:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1700) [17:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1700). [17:00:08] (03CR) 10Vgutierrez: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [17:00:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2157.codfw.wmnet with reason: Maintenance [17:00:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2157 (T344589)', diff saved to https://phabricator.wikimedia.org/P50662 and previous config saved to /var/cache/conftool/dbconfig/20230821-170016-ladsgroup.json [17:00:36] RECOVERY - Check systemd state on kafka-logging2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:41] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42949/console" [puppet] - 10https://gerrit.wikimedia.org/r/951164 (owner: 10Ssingh) [17:00:58] (03CR) 10Vgutierrez: [C: 03+1] hiera: add profile::cache::base::use_noflow_iface_preup only for esams cp nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951164 (owner: 10Ssingh) [17:01:05] (03PS4) 10Hamish: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) [17:01:48] (03PS3) 10Ssingh: hiera: add profile::cache::base::use_noflow_iface_preup only for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951164 (https://phabricator.wikimedia.org/T344604) [17:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T344589)', diff saved to https://phabricator.wikimedia.org/P50663 and previous config saved to /var/cache/conftool/dbconfig/20230821-170531-ladsgroup.json [17:08:38] PROBLEM - Check systemd state on kafka-logging2003 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P50664 and previous config saved to /var/cache/conftool/dbconfig/20230821-170926-ladsgroup.json [17:13:40] jouncebot: nowandnext [17:13:40] For the next 0 hour(s) and 46 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1700) [17:13:40] For the next 0 hour(s) and 16 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T1700) [17:13:40] In 2 hour(s) and 46 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T2000) [17:14:02] (03CR) 10Lucas Werkmeister: "pretty sure I’ve had annoying encounters with GerritMessageValidator before 🙄 I guess I’ll rephrase the commit message to be less machine " [puppet] - 10https://gerrit.wikimedia.org/r/951144 (owner: 10Lucas Werkmeister) [17:15:17] (03PS2) 10Lucas Werkmeister: tools-static: Further header cleanup [puppet] - 10https://gerrit.wikimedia.org/r/951144 [17:15:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951142 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [17:16:03] (03Merged) 10jenkins-bot: Stop writing to the old columns of extlinks everywhere except s1, s4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951142 (https://phabricator.wikimedia.org/T342683) (owner: 10Ladsgroup) [17:16:18] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:951142|Stop writing to the old columns of extlinks everywhere except s1, s4 (T342683)]] [17:16:24] T342683: Stop writing to the old externallinks columns in beta cluster and production - https://phabricator.wikimedia.org/T342683 [17:17:50] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:951142|Stop writing to the old columns of extlinks everywhere except s1, s4 (T342683)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [17:18:26] !log ladsgroup@deploy1002 Sync cancelled. [17:19:41] (03PS1) 10Ladsgroup: Fix update for WRITE stage of extlinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951167 [17:19:49] (03CR) 10CI reject: [V: 04-1] Fix update for WRITE stage of extlinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951167 (owner: 10Ladsgroup) [17:20:06] (03PS2) 10Ladsgroup: Fix update for WRITE stage of extlinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951167 [17:20:17] (03CR) 10Ladsgroup: [C: 03+2] Fix update for WRITE stage of extlinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951167 (owner: 10Ladsgroup) [17:20:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951167 (owner: 10Ladsgroup) [17:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P50665 and previous config saved to /var/cache/conftool/dbconfig/20230821-172037-ladsgroup.json [17:21:08] (03Merged) 10jenkins-bot: Fix update for WRITE stage of extlinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951167 (owner: 10Ladsgroup) [17:21:23] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:951167|Fix update for WRITE stage of extlinks migration]] [17:22:53] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:951167|Fix update for WRITE stage of extlinks migration]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [17:24:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T344589)', diff saved to https://phabricator.wikimedia.org/P50666 and previous config saved to /var/cache/conftool/dbconfig/20230821-172432-ladsgroup.json [17:24:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [17:24:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [17:24:51] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [17:24:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1200 (T344589)', diff saved to https://phabricator.wikimedia.org/P50667 and previous config saved to /var/cache/conftool/dbconfig/20230821-172456-ladsgroup.json [17:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T344589)', diff saved to https://phabricator.wikimedia.org/P50668 and previous config saved to /var/cache/conftool/dbconfig/20230821-173056-ladsgroup.json [17:31:14] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:951167|Fix update for WRITE stage of extlinks migration]] (duration: 09m 50s) [17:31:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:32:48] (03CR) 10Ssingh: [C: 03+2] hiera: add profile::cache::base::use_noflow_iface_preup only for esams cp nodes [puppet] - 10https://gerrit.wikimedia.org/r/951164 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [17:32:51] (03CR) 10Ssingh: [C: 03+2] interface::noflow: add optional parameter for pre-up [puppet] - 10https://gerrit.wikimedia.org/r/951134 (https://phabricator.wikimedia.org/T344604) (owner: 10Ssingh) [17:33:56] PROBLEM - Check systemd state on kafka-logging2004 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:35] 10ops-knams: Port with no description on access switch - https://phabricator.wikimedia.org/T344633 (10phaultfinder) [17:35:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P50669 and previous config saved to /var/cache/conftool/dbconfig/20230821-173544-ladsgroup.json [17:36:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:37:26] (03PS4) 10Bartosz Dziewoński: Remove unneeded $wgDefaultUserOptions['visualeditor-enable'] settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/933998 (https://phabricator.wikimedia.org/T340696) [17:37:38] (03PS2) 10Bartosz Dziewoński: Move visual editor out of Beta Features (without changing prefs) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947015 (https://phabricator.wikimedia.org/T335056) [17:37:46] (03PS2) 10Bartosz Dziewoński: Clarify 2017 wikitext editor's Beta Feature status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/949588 (https://phabricator.wikimedia.org/T344158) [17:44:10] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Ricki Jay (WMDE) - https://phabricator.wikimedia.org/T343700 (10KFrancis) The NDA has been signed. Please proceed with next steps. [17:46:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P50670 and previous config saved to /var/cache/conftool/dbconfig/20230821-174602-ladsgroup.json [17:50:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T344589)', diff saved to https://phabricator.wikimedia.org/P50671 and previous config saved to /var/cache/conftool/dbconfig/20230821-175050-ladsgroup.json [17:50:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:51:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [17:51:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50672 and previous config saved to /var/cache/conftool/dbconfig/20230821-175115-ladsgroup.json [17:52:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50673 and previous config saved to /var/cache/conftool/dbconfig/20230821-175240-ladsgroup.json [17:57:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50674 and previous config saved to /var/cache/conftool/dbconfig/20230821-175746-ladsgroup.json [17:57:50] PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:21] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-codfw cluster: Reboot kafka nodes [18:00:06] RECOVERY - Check systemd state on kafka-logging2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:00] RECOVERY - Check systemd state on kafka-logging2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P50675 and previous config saved to /var/cache/conftool/dbconfig/20230821-180108-ladsgroup.json [18:03:04] PROBLEM - Check systemd state on kafka-logging2005 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:12] (03CR) 10Gmodena: "This change is ready for review." (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/939651 (https://phabricator.wikimedia.org/T342258) (owner: 10Gmodena) [18:05:43] !log reboot cp3066 for T344587 [18:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:54] RECOVERY - Check systemd state on kafka-logging2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:54] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3066.esams.wmnet [18:06:07] !log herron@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka logging-eqiad cluster: Reboot kafka nodes [18:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P50676 and previous config saved to /var/cache/conftool/dbconfig/20230821-181252-ladsgroup.json [18:14:59] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3066.esams.wmnet [18:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T344589)', diff saved to https://phabricator.wikimedia.org/P50677 and previous config saved to /var/cache/conftool/dbconfig/20230821-181615-ladsgroup.json [18:16:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:16:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [18:16:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50678 and previous config saved to /var/cache/conftool/dbconfig/20230821-181629-ladsgroup.json [18:16:41] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1213:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50679 and previous config saved to /var/cache/conftool/dbconfig/20230821-181752-ladsgroup.json [18:17:53] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3081.esams.wmnet [18:18:06] !log fabfur@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cp3081.esams.wmnet [18:18:09] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum3003.esams.wmnet [18:19:03] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3081.esams.wmnet [18:19:04] (03CR) 10Majavah: [C: 03+2] tools-static: Further header cleanup [puppet] - 10https://gerrit.wikimedia.org/r/951144 (owner: 10Lucas Werkmeister) [18:19:22] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host durum3004.esams.wmnet [18:19:31] !log reboot cp3081 for T344587 [18:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:34] RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:21:26] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10RLazarus) >>! In T343377#9105446, @SLyngshede-WMF wrote: > If it's just a matter of managing a LDAP group, then that's perfectly within scope... [18:21:59] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum3003.esams.wmnet [18:23:25] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host durum3004.esams.wmnet [18:24:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50680 and previous config saved to /var/cache/conftool/dbconfig/20230821-182452-ladsgroup.json [18:27:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315', diff saved to https://phabricator.wikimedia.org/P50681 and previous config saved to /var/cache/conftool/dbconfig/20230821-182759-ladsgroup.json [18:28:02] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3081.esams.wmnet [18:31:03] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3067.esams.wmnet [18:31:19] !log reboot cp[3067-3073].esams.wmnet for T344587 [18:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:52] PROBLEM - Check systemd state on kafka-logging1001 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:40:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P50682 and previous config saved to /var/cache/conftool/dbconfig/20230821-184000-ladsgroup.json [18:40:14] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3067.esams.wmnet [18:40:17] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3068.esams.wmnet [18:43:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50683 and previous config saved to /var/cache/conftool/dbconfig/20230821-184305-ladsgroup.json [18:43:46] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:45:54] (03PS8) 10Ssingh: knams migration: remove references to old esams [dns] - 10https://gerrit.wikimedia.org/r/949930 (https://phabricator.wikimedia.org/T329219) [18:46:31] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-worker1001.eqiad.wmnet with reason: New kernel, T344587 [18:46:36] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:46:44] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-worker1001.eqiad.wmnet with reason: New kernel, T344587 [18:47:36] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50684 and previous config saved to /var/cache/conftool/dbconfig/20230821-184818-ladsgroup.json [18:48:34] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:49:14] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3068.esams.wmnet [18:49:17] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3069.esams.wmnet [18:49:58] RECOVERY - BFD status on cr2-eqsin is OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:50:24] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:50:54] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqsin and A:cp [18:54:06] (03CR) 10Eevans: [C: 03+2] aqs: upgrade aqs1010 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/951145 (https://phabricator.wikimedia.org/T339299) (owner: 10Eevans) [18:54:35] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: New kernel, T344587 [18:54:59] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: New kernel, T344587 [18:55:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315', diff saved to https://phabricator.wikimedia.org/P50685 and previous config saved to /var/cache/conftool/dbconfig/20230821-185506-ladsgroup.json [18:56:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:56:34] !log Upgrading aq1010/cassandra-{a,b} (canary) to Cassandra 4.1.1 — T339299 [18:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:38] T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299 [18:57:38] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3069.esams.wmnet [18:57:41] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3070.esams.wmnet [18:59:42] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: New kernel, T344587 [18:59:55] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: New kernel, T344587 [19:00:04] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: New kernel, T344587 [19:00:29] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: New kernel, T344587 [19:00:56] RECOVERY - Check systemd state on kafka-logging1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:02:50] PROBLEM - Check systemd state on kafka-logging1002 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P50686 and previous config saved to /var/cache/conftool/dbconfig/20230821-190324-ladsgroup.json [19:04:02] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:06:16] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - aux-k8s-ctrl_6443: Servers aux-k8s-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:06:24] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:46] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (LIST authorizationpolicies) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:06:54] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3070.esams.wmnet [19:06:57] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3071.esams.wmnet [19:07:28] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for wdqs202[3-5] - pt1979@cumin2002" [19:07:40] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:08:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for wdqs202[3-5] - pt1979@cumin2002" [19:08:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:10:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3315 (T344589)', diff saved to https://phabricator.wikimedia.org/P50687 and previous config saved to /var/cache/conftool/dbconfig/20230821-191013-ladsgroup.json [19:10:15] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: New kernel, T344587 [19:10:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2023.mgmt.codfw.wmnet with reboot policy FORCED [19:10:39] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: New kernel, T344587 [19:10:53] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: New kernel, T344587 [19:11:06] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: New kernel, T344587 [19:11:46] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [19:11:46] (KubernetesAPILatency) resolved: (11) High Kubernetes API latency (LIST authorizationpolicies) on k8s-aux@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-aux - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:12:02] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:12:34] PROBLEM - Check systemd state on cp5017 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:46] PROBLEM - Check systemd state on kafka-logging1003 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:12:47] ^ brett that might be the reboot? [19:12:50] cp5017 [19:12:56] just check if an agent fixes that [19:13:10] yeah, it's cleared up [19:13:18] oops, no it hasn't [19:15:53] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3071.esams.wmnet [19:15:56] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3072.esams.wmnet [19:16:02] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1008.eqiad.wmnet [19:16:13] !log bking@cumin1001 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts wdqs1008.eqiad.wmnet [19:16:25] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1008.eqiad.wmnet [19:16:35] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1008.eqiad.wmnet [19:17:03] !log bking@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts wdqs1009.eqiad.wmnet [19:17:16] !log bking@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts wdqs1009.eqiad.wmnet [19:17:18] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:18:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50688 and previous config saved to /var/cache/conftool/dbconfig/20230821-191825-ladsgroup.json [19:18:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P50689 and previous config saved to /var/cache/conftool/dbconfig/20230821-191830-ladsgroup.json [19:18:49] Looks like clean-confd-rundir.service had failed because of "/usr/bin/find: ‘/var/run/confd-template’: No such file or directory" [19:19:36] RECOVERY - Check systemd state on cp5017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:50] But it's been fixed? https://gerrit.wikimedia.org/r/c/operations/puppet/+/949496 [19:19:51] brett: yeah, we will need to fix this I guess but for now, the additional run fixed it [19:20:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:07] Oh, no, that's a require directive [19:20:08] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:23:15] (03CR) 10Ssingh: "Sorry for commenting on an already merged patch, but we are seeing this on cp hosts as well." [puppet] - 10https://gerrit.wikimedia.org/r/949496 (owner: 10Muehlenhoff) [19:25:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:25:09] Is it because of the command not being set to -type f? [19:25:10] "/usr/bin/find ${run_dir} -mtime +30 -delete" [19:25:11] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3072.esams.wmnet [19:25:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2023.mgmt.codfw.wmnet with reboot policy FORCED [19:25:15] !log fabfur@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3073.esams.wmnet [19:25:26] that'd wipe out the dir in question after 30 days too, wouldn't it? [19:25:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2024.mgmt.codfw.wmnet with reboot policy FORCED [19:26:13] brett: I am not sure what the intended usecase here is tbh so I have asked. if it's some ordering issue we can fix, we can submit a patch! [19:26:23] brett, sukhe: see https://gerrit.wikimedia.org/r/c/operations/puppet/+/951141 :-) [19:27:14] moritzm: Sweet, thanks! [19:27:17] moritzm: :D [19:27:18] thanks! [19:27:22] Mind if it gets merged? [19:27:29] yeah good ol' systemd::tmpfile [19:27:44] jouncebot: nowandnext [19:27:44] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [19:27:44] In 0 hour(s) and 32 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T2000) [19:27:46] moritzm: shouldn't the timer have a require on the systemd tmpfile as well? or redundant? [19:28:47] the tmpfile gets created early on the system startup, so this should ensure that's it's presence [19:28:59] (03CR) 10BCornwall: [C: 03+1] "Nit: Link to bug T321678" [puppet] - 10https://gerrit.wikimedia.org/r/951141 (owner: 10Filippo Giunchedi) [19:28:59] yeah, that makes sense [19:29:04] but best to leave a comment on task, it's no merged after all:-) [19:29:08] !log run extensions/OATHAuth/maintenance/UpdateForMultipleDevicesSupport.php on checkuserwiki, T242031 [19:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:14] T242031: Allow multiple different 2FA devices - https://phabricator.wikimedia.org/T242031 [19:29:22] Probably best to not assume that it will work just for good measure [19:30:17] (03CR) 10Ssingh: confd: create run_dir via tmpfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/951141 (owner: 10Filippo Giunchedi) [19:30:23] I remembered where I did this before [19:30:28] require => Systemd::Tmpfile['esitest'] [19:30:33] but anyway, left it on the task [19:30:36] thanks moritzm! [19:30:54] sukhe: yeah, bnest leave a comment on gerrit, especially if this has hit you before [19:30:56] RECOVERY - Check systemd state on kafka-logging1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:15] (03CR) 10BCornwall: [C: 03+1] confd: create run_dir via tmpfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/951141 (owner: 10Filippo Giunchedi) [19:33:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P50691 and previous config saved to /var/cache/conftool/dbconfig/20230821-193331-ladsgroup.json [19:33:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50692 and previous config saved to /var/cache/conftool/dbconfig/20230821-193337-ladsgroup.json [19:33:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [19:33:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [19:34:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2178 (T344589)', diff saved to https://phabricator.wikimedia.org/P50693 and previous config saved to /var/cache/conftool/dbconfig/20230821-193402-ladsgroup.json [19:34:34] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3073.esams.wmnet [19:37:23] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124 [19:37:26] PROBLEM - Check systemd state on kafka-logging1004 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:37:27] T343124: Migrate WDQS and WCQS servers to Debian Bullseye - https://phabricator.wikimedia.org/T343124 [19:37:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2024.mgmt.codfw.wmnet with reboot policy FORCED [19:37:36] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1008.eqiad.wmnet with reason: T343124 [19:38:05] !log bking@wdqs1008 'depooling for firmware update T343124' [19:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host wdqs2025.mgmt.codfw.wmnet with reboot policy FORCED [19:39:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T344589)', diff saved to https://phabricator.wikimedia.org/P50694 and previous config saved to /var/cache/conftool/dbconfig/20230821-193914-ladsgroup.json [19:43:20] !log fabfur@cumin1001 conftool action : set/pooled=yes; selector: dc=esams,cluster=cache_text [19:48:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316', diff saved to https://phabricator.wikimedia.org/P50695 and previous config saved to /var/cache/conftool/dbconfig/20230821-194838-ladsgroup.json [19:51:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs2025.mgmt.codfw.wmnet with reboot policy FORCED [19:51:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [19:53:12] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [19:54:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P50696 and previous config saved to /var/cache/conftool/dbconfig/20230821-195420-ladsgroup.json [19:54:39] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wdqs2023'] [19:57:21] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [19:58:24] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2024'] [19:59:48] PROBLEM - Check systemd state on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T2000). [20:00:06] Hamishcz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] RECOVERY - Check systemd state on kafka-logging1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:11] I can deploy today [20:00:29] yes im here [20:00:34] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:00:44] RECOVERY - Check systemd state on kafka-logging1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:00] (03PS5) 10Urbanecm: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [20:01:05] (03CR) 10Urbanecm: [C: 03+2] Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [20:01:56] (03Merged) 10jenkins-bot: Add botadmin group on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/940486 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [20:02:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:940486|Add botadmin group on eswiki (T342484)]] [20:02:18] T342484: Add botadmin group on eswiki - https://phabricator.wikimedia.org/T342484 [20:02:45] (03CR) 10Urbanecm: [C: 03+2] revalidateLinkRecommendations: Load scoreLessThan correctly [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950812 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [20:02:51] (03CR) 10Urbanecm: [C: 03+2] LinkRecommendationUpdater: Load link-recommendation even if disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950813 (https://phabricator.wikimedia.org/T344343) (owner: 10Urbanecm) [20:03:14] PROBLEM - Check systemd state on kafka-logging1005 is CRITICAL: CRITICAL - degraded: The following units failed: clean-confd-rundir.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:03:25] (03PS3) 10Urbanecm: Growth: Remove wgWelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951151 (https://phabricator.wikimedia.org/T342353) [20:03:43] (03CR) 10Urbanecm: [C: 03+2] Growth: Remove wgWelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951151 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [20:03:44] !log urbanecm@deploy1002 hamishz and urbanecm: Backport for [[gerrit:940486|Add botadmin group on eswiki (T342484)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:03:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1213:3316 (T344589)', diff saved to https://phabricator.wikimedia.org/P50697 and previous config saved to /var/cache/conftool/dbconfig/20230821-200344-ladsgroup.json [20:03:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [20:04:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [20:04:09] so it's ok now? this is my first time to deploy, so i'm not so familiar with this [20:04:22] Hamishcz: do you know how to test patches on mwdebug1001 please? [20:04:25] (03Merged) 10jenkins-bot: Growth: Remove wgWelcomeSurveyEnableWithHomepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951151 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [20:04:53] (it's okay if not, just asking) [20:05:14] i read about the manual [20:05:23] but not practically used [20:05:24] !log herron@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka logging-eqiad cluster: Reboot kafka nodes [20:07:31] Hamishcz: okay, that's fine. do you have the extension installed please? [20:07:36] yes [20:07:42] PROBLEM - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:07:55] can you enable it, select `mwdebug1001.eqiad.wmnet` as a debug backend, and then verify the patch works? [20:08:15] yea im now trying to do [20:08:15] you can do so for example by going to Special:ListGroupRights on eswiki [20:08:19] !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_esams and A:cp [20:08:20] a second pls [20:08:25] sure [20:08:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:08:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [20:09:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [20:09:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2024'] [20:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P50698 and previous config saved to /var/cache/conftool/dbconfig/20230821-200927-ladsgroup.json [20:09:33] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [20:09:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [20:11:27] i found a problem. then can I revise the patch directly? [20:11:34] Hamishcz: which kind of a problem please? [20:11:46] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [20:11:49] supressredirect should be suppressredirect [20:11:59] double "p" at the beginning [20:12:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [20:12:04] oh [20:12:34] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [20:12:37] I'm really sorry about that [20:12:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [20:12:43] Hamishcz: no worries, it happens. [20:12:59] Hamishcz: I'll finish the deployment, as it's a relatively minor issue. can you please upload a new patch that fixes the problem? [20:13:03] !log urbanecm@deploy1002 hamishz and urbanecm: Continuing with sync [20:13:16] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [20:13:22] Hamishcz: it is only possible to edit patches that were not merged. by starting the deployment, i merged the patch, so the patch now cannot be edited. [20:13:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [20:13:28] and a new one needs to be uploaded, which fixes this issue [20:13:30] does that make sense? [20:13:37] yea thanks a lot [20:14:07] doing right now [20:14:43] sounds good [20:17:26] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2024'] [20:18:14] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2023'] [20:18:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2023'] [20:18:51] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2025'] [20:20:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:940486|Add botadmin group on eswiki (T342484)]] (duration: 18m 08s) [20:20:26] T342484: Add botadmin group on eswiki - https://phabricator.wikimedia.org/T342484 [20:20:46] (03Merged) 10jenkins-bot: revalidateLinkRecommendations: Load scoreLessThan correctly [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950812 (https://phabricator.wikimedia.org/T316079) (owner: 10Urbanecm) [20:20:49] (03Merged) 10jenkins-bot: LinkRecommendationUpdater: Load link-recommendation even if disabled [extensions/GrowthExperiments] (wmf/1.41.0-wmf.22) - 10https://gerrit.wikimedia.org/r/950813 (https://phabricator.wikimedia.org/T344343) (owner: 10Urbanecm) [20:21:18] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host wdqs1008.eqiad.wmnet with OS bullseye [20:21:27] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:951151|Growth: Remove wgWelcomeSurveyEnableWithHomepage (T342353 T344619)]], [[gerrit:950812|revalidateLinkRecommendations: Load scoreLessThan correctly (T316079)]], [[gerrit:950813|LinkRecommendationUpdater: Load link-recommendation even if disabled (T344343)]] [20:21:36] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [20:21:36] T344619: Remove wgWelcomeSurveyEnableWithHomepage from GrowthExperiments - https://phabricator.wikimedia.org/T344619 [20:21:36] T342353: enable opt-in checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T342353 [20:21:37] T344343: revalidateLinkRecommendations.php cannot run if link-recommendation is disabled in Community configuration - https://phabricator.wikimedia.org/T344343 [20:23:02] !log urbanecm@deploy1002 urbanecm: Backport for [[gerrit:951151|Growth: Remove wgWelcomeSurveyEnableWithHomepage (T342353 T344619)]], [[gerrit:950812|revalidateLinkRecommendations: Load scoreLessThan correctly (T316079)]], [[gerrit:950813|LinkRecommendationUpdater: Load link-recommendation even if disabled (T344343)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwde [20:23:02] bug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:23:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2024'] [20:23:59] (03PS1) 10Hamish: Add botadmin group on eswiki, correction patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951183 (https://phabricator.wikimedia.org/T342484) [20:24:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T344589)', diff saved to https://phabricator.wikimedia.org/P50699 and previous config saved to /var/cache/conftool/dbconfig/20230821-202433-ladsgroup.json [20:25:30] urbanecm, does this looks good? [20:25:49] let me have a look [20:25:52] !log urbanecm@deploy1002 urbanecm: Continuing with sync [20:26:00] (I'm deploying my own patch in the meantime) [20:26:25] (03PS2) 10Urbanecm: Add botadmin group on eswiki, correction patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951183 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [20:26:26] it's okay, I will wait for you [20:26:30] (03CR) 10Urbanecm: [C: 03+2] Add botadmin group on eswiki, correction patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951183 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [20:26:36] looks good to me [20:26:41] will deploy once my patch finishes :) [20:27:10] (03Merged) 10jenkins-bot: Add botadmin group on eswiki, correction patch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/951183 (https://phabricator.wikimedia.org/T342484) (owner: 10Hamish) [20:27:25] so many thanks :) and ltns [20:29:31] Ty, Hamishcz and urbanecm :) [20:29:47] PROBLEM - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [20:30:36] LuchoCR, sorry again :) I was attending without this programming laptop so I missed backport window last week [20:30:45] (03PS1) 10Papaul: Add wdqs202[3-5] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/951184 (https://phabricator.wikimedia.org/T342659) [20:30:59] no prob :) [20:31:01] RECOVERY - Check systemd state on kafka-logging1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:951151|Growth: Remove wgWelcomeSurveyEnableWithHomepage (T342353 T344619)]], [[gerrit:950812|revalidateLinkRecommendations: Load scoreLessThan correctly (T316079)]], [[gerrit:950813|LinkRecommendationUpdater: Load link-recommendation even if disabled (T344343)]] (duration: 11m 02s) [20:32:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:32:38] T316079: Bump threshold for confidence score on link recommendation service suggestions - https://phabricator.wikimedia.org/T316079 [20:32:38] T344619: Remove wgWelcomeSurveyEnableWithHomepage from GrowthExperiments - https://phabricator.wikimedia.org/T344619 [20:32:39] T342353: enable opt-in checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T342353 [20:32:39] T344343: revalidateLinkRecommendations.php cannot run if link-recommendation is disabled in Community configuration - https://phabricator.wikimedia.org/T344343 [20:32:51] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:951183|Add botadmin group on eswiki, correction patch (T342484)]] [20:32:54] T342484: Add botadmin group on eswiki - https://phabricator.wikimedia.org/T342484 [20:33:47] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1008.eqiad.wmnet with reason: host reimage [20:34:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2025'] [20:34:19] !log urbanecm@deploy1002 hamishz and urbanecm: Backport for [[gerrit:951183|Add botadmin group on eswiki, correction patch (T342484)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:34:26] Hamishcz: can you test it again please? [20:34:28] (on mwdebug1001) [20:34:31] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wdqs2025'] [20:34:52] doing [20:35:22] (03CR) 10Papaul: [C: 03+2] Add wdqs202[3-5] to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/951184 (https://phabricator.wikimedia.org/T342659) (owner: 10Papaul) [20:36:36] yea it works perfectly IMO [20:36:54] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1008.eqiad.wmnet with reason: host reimage [20:37:32] great, proceeding [20:37:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:37:34] !log urbanecm@deploy1002 hamishz and urbanecm: Continuing with sync [20:41:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wdqs2025'] [20:41:18] so the process for me is done now? [20:41:36] Hamishcz: almost. please wait for the deployment to finish :) [20:42:07] np [20:45:13] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:951183|Add botadmin group on eswiki, correction patch (T342484)]] (duration: 12m 22s) [20:45:17] T342484: Add botadmin group on eswiki - https://phabricator.wikimedia.org/T342484 [20:45:20] Hamishcz: and deployed [20:45:22] anything else? :) [20:45:42] no for me, appreciated [20:45:46] :) [20:45:58] happy to help! [20:48:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [20:48:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye [20:50:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [20:51:22] (03PS1) 10Cathal Mooney: Announce Anycast prefixes from esams [homer/public] - 10https://gerrit.wikimedia.org/r/951186 (https://phabricator.wikimedia.org/T329219) [20:57:51] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2025.codfw.wmnet with OS bullseye [20:58:02] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye [20:59:06] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1008.eqiad.wmnet with OS bullseye [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230821T2100). [21:05:48] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE, 10Patch-For-Review: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) @Jhancock.wm when you are back onsite can you please check the network cable for wdqs2023 both 10G nic's are showing down. Thanks [21:09:55] (03CR) 10Ssingh: [C: 03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/951186 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [21:11:33] (03CR) 10Cathal Mooney: [C: 03+2] Announce Anycast prefixes from esams [homer/public] - 10https://gerrit.wikimedia.org/r/951186 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [21:12:07] (03Merged) 10jenkins-bot: Announce Anycast prefixes from esams [homer/public] - 10https://gerrit.wikimedia.org/r/951186 (https://phabricator.wikimedia.org/T329219) (owner: 10Cathal Mooney) [21:14:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:22:24] Hey all - I have a couple of quick security mitigation updates to deploy [21:25:29] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:31:20] !log Deployed updated security mitigation for T336027 [21:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:42] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye [21:35:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye executed with errors: - wdqs2... [21:36:41] !log mwmaint1002: foreachwikiindblist 'group2 & s6' extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all --touched-after=20230615000000 (T315510; restart, per T315510#9107776) [21:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:50] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [21:37:31] 10sre-alert-triage, 10cloud-services-team: Alert triage: Adjust severity of backup_cinder_volumes from critical to warning - https://phabricator.wikimedia.org/T342764 (10BCornwall) [21:45:32] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2025.codfw.wmnet with OS bullseye [21:45:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye executed with errors: - wdqs2... [22:16:41] (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:33:13] (03PS2) 10Cwhite: admin: add amyt to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/948686 (https://phabricator.wikimedia.org/T344199) [22:33:57] (03CR) 10CI reject: [V: 04-1] admin: add amyt to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/948686 (https://phabricator.wikimedia.org/T344199) (owner: 10Cwhite) [22:35:21] (03PS3) 10Cwhite: admin: add amyt to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/948686 (https://phabricator.wikimedia.org/T344199) [22:36:59] (03CR) 10Cwhite: [C: 03+2] admin: add amyt to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/948686 (https://phabricator.wikimedia.org/T344199) (owner: 10Cwhite) [22:41:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2024.codfw.wmnet with OS bullseye [22:41:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye [22:48:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2024.codfw.wmnet with reason: host reimage [22:51:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2024.codfw.wmnet with reason: host reimage [22:52:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs2025.codfw.wmnet with OS bullseye [22:52:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye [22:54:33] (RedisMemoryFull) firing: (2) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:58:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2025.codfw.wmnet with reason: host reimage [23:02:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2025.codfw.wmnet with reason: host reimage [23:07:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:08:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:08:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2024.codfw.wmnet with OS bullseye [23:08:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2024.codfw.wmnet with OS bullseye completed: - wdqs2024 (**PASS... [23:16:25] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:22:33] (03PS1) 10BCornwall: sre.cdn.roll-reboot: Reduce min_grace_sleep to 300 [cookbooks] - 10https://gerrit.wikimedia.org/r/951196 [23:26:07] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [23:39:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:39:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs2025.codfw.wmnet with OS bullseye [23:39:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host wdqs2025.codfw.wmnet with OS bullseye completed: - wdqs2025 (**PASS... [23:39:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q1:rack/setup/install wdqs20[23-25].codfw.wmnet - https://phabricator.wikimedia.org/T342659 (10Papaul) [23:51:18] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_esams and A:cp [23:58:03] (03CR) 10Ssingh: [C: 03+1] sre.cdn.roll-reboot: Reduce min_grace_sleep to 300 [cookbooks] - 10https://gerrit.wikimedia.org/r/951196 (owner: 10BCornwall)