[00:00:04] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T0000). [00:00:04] No Gerrit patches in the queue for this window AFAICS. [00:00:38] (03PS3) 10Ryan Kemper: elasticsearch: new masters for psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/760684 (https://phabricator.wikimedia.org/T294805) (owner: 10Bking) [00:00:48] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elasticsearch: new masters for psi cluster [puppet] - 10https://gerrit.wikimedia.org/r/760684 (https://phabricator.wikimedia.org/T294805) (owner: 10Bking) [00:05:40] !log T294805 new psi masters `elastic1073`, `elastic1075`, and `elastic1083` are in [00:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:05:44] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [00:10:57] RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:25] !log T294805 old psi masters are out, done with all elastic master operations [00:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:12:29] !log T294805 Re-enabling puppet across eqiad elastic fleet: `ryankemper@cumin1001:~$ sudo cumin -b 8 'elastic1*' 'sudo enable-puppet "Add new eqiad replacement hosts elastic10[68-83] - T294805 - root" && sudo run-puppet-agent'` tmux session `elastic` [00:12:30] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [00:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:33] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:22:53] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [00:23:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:25:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:29:39] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [00:31:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:32:10] (03PS2) 10Cwhite: logstash: use java home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) [00:32:14] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) 05Open→03Resolved [00:32:37] (03CR) 10Cwhite: logstash: use java home from profile::java (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [00:35:19] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [00:35:33] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:44:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:46:09] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:17] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:50:31] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:50:53] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:50:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [00:52:53] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:53:17] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:58:01] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:00:23] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:06:55] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [01:09:15] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [01:12:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:19:09] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:24:07] 10SRE-Access-Requests, 10Data-Engineering: Give bmansurov access necessary to support Research Airflow jobs - https://phabricator.wikimedia.org/T301215 (10bmansurov) [01:28:35] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:32:41] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [01:37:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:45:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [01:56:51] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T0200) [02:03:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:05:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:05] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:07:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.38.0-wmf.21 [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/760693 [02:07:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.38.0-wmf.21 [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/760693 (owner: 10TrainBranchBot) [02:07:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:10:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:18:05] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:21:31] (03Merged) 10jenkins-bot: Branch commit for wmf/1.38.0-wmf.21 [core] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/760693 (owner: 10TrainBranchBot) [02:28:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:29:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [02:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [02:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:48:55] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:55:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [02:57:47] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [02:58:19] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:04:51] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [03:10:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={CREATE,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:19:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:26:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:33:21] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [03:37:59] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [03:40:51] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:50:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:55:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [03:57:29] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:09:23] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:10:35] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev network tests: Switch to new puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/760642 (owner: 10Andrew Bogott) [04:11:18] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: fix a race condition with logfile ownership [puppet] - 10https://gerrit.wikimedia.org/r/760632 (owner: 10Andrew Bogott) [04:13:06] (03PS1) 10Andrew Bogott: cinder backups: include the maps nfs volume [puppet] - 10https://gerrit.wikimedia.org/r/760703 (https://phabricator.wikimedia.org/T300694) [04:13:50] (03CR) 10Andrew Bogott: [C: 03+2] cinder backups: include the maps nfs volume [puppet] - 10https://gerrit.wikimedia.org/r/760703 (https://phabricator.wikimedia.org/T300694) (owner: 10Andrew Bogott) [04:15:34] (03CR) 10Andrew Bogott: [C: 03+2] Revert "nfs-mounts.yaml.erb: temporarily mount 'maps' in cloudinfra-nfs" [puppet] - 10https://gerrit.wikimedia.org/r/758913 (owner: 10Andrew Bogott) [04:15:55] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [04:21:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:26:03] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:30:11] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [04:37:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:45:01] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [04:51:37] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [04:54:37] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:01:11] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [05:04:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=UPDATE https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:05:57] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [05:13:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:20:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:27:57] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:40:45] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:42:17] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:44:39] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [05:45:27] <_joe_> someone should take a look at mobileapps and understand what's going on with that response [05:56:15] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:58:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:02:25] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove contributions group from s1 eqiad T263127', diff saved to https://phabricator.wikimedia.org/P20236 and previous config saved to /var/cache/conftool/dbconfig/20220208-060310-marostegui.json [06:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:15] T263127: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 [06:04:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:04:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:09:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [06:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20237 and previous config saved to /var/cache/conftool/dbconfig/20220208-060943-marostegui.json [06:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:48] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:13:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:17:13] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:19:35] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:20:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:20:31] (03PS1) 10Marostegui: db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760814 (https://phabricator.wikimedia.org/T300835) [06:21:27] (03CR) 10Marostegui: [C: 03+2] db2134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760814 (https://phabricator.wikimedia.org/T300835) (owner: 10Marostegui) [06:22:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db2134.codfw.wmnet with OS bullseye [06:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:56] (03PS1) 10Marostegui: Revert "db2096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760556 [06:24:43] (03CR) 10Marostegui: [C: 03+2] Revert "db2096: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760556 (owner: 10Marostegui) [06:25:00] 10SRE, 10ops-codfw, 10DBA: x1 codfw master crashed due to faulty DIMM - https://phabricator.wikimedia.org/T300965 (10Marostegui) Icinga all green for this host, so I have re-enabled notifications [06:25:15] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:25:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:25:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [06:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [06:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:09] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:29:17] (03PS1) 10Marostegui: misc_multiinstance.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/760815 (https://phabricator.wikimedia.org/T268869) [06:31:24] (03CR) 10Marostegui: [C: 03+2] misc_multiinstance.my.cnf: innodb_adaptive_hash_index=OFF [puppet] - 10https://gerrit.wikimedia.org/r/760815 (https://phabricator.wikimedia.org/T268869) (owner: 10Marostegui) [06:33:00] dbproxy alert is expected [06:48:29] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={LIST,UPDATE} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:53:11] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [06:55:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2134.codfw.wmnet with OS bullseye [06:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:27] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:03:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:03:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:03:35] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T300402)', diff saved to https://phabricator.wikimedia.org/P20238 and previous config saved to /var/cache/conftool/dbconfig/20220208-070339-marostegui.json [07:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:43] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:08:31] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:10:17] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:19:01] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:21:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300402)', diff saved to https://phabricator.wikimedia.org/P20239 and previous config saved to /var/cache/conftool/dbconfig/20220208-072155-marostegui.json [07:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:00] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [07:23:44] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:27:02] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:27:40] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp5013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:28:40] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp5013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:28:42] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp5013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:29:10] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp5013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [07:29:18] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:30:46] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp5013 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:31:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::docker::engine: add param to ignore docker storage settings [puppet] - 10https://gerrit.wikimedia.org/r/759678 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [07:31:50] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp5013 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:31:50] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp5013 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:32:18] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp5013 is OK: HTTP OK: HTTP/1.1 200 OK - 473 bytes in 0.450 second response time https://wikitech.wikimedia.org/wiki/Varnish [07:33:12] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:36:13] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:37:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P20240 and previous config saved to /var/cache/conftool/dbconfig/20220208-073659-marostegui.json [07:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [07:44:42] (03PS1) 10Marostegui: Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760558 [07:44:50] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:45:37] (03CR) 10Marostegui: [C: 03+2] Revert "db2134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760558 (owner: 10Marostegui) [07:50:40] (03PS1) 10Majavah: hieradata: cloudinfra: refresh puppetmaster list [puppet] - 10https://gerrit.wikimedia.org/r/760879 [07:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P20241 and previous config saved to /var/cache/conftool/dbconfig/20220208-075204-marostegui.json [07:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20242 and previous config saved to /var/cache/conftool/dbconfig/20220208-075254-marostegui.json [07:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:58] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [07:57:46] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [07:59:06] (03PS1) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [08:02:01] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [08:07:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T300402)', diff saved to https://phabricator.wikimedia.org/P20243 and previous config saved to /var/cache/conftool/dbconfig/20220208-080709-marostegui.json [08:07:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:07:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:15] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P20244 and previous config saved to /var/cache/conftool/dbconfig/20220208-080758-marostegui.json [08:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:43] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:13:00] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [08:15:31] (03CR) 10Majavah: Add cookbooks for running maintain-views (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [08:18:05] (03PS1) 10Marostegui: db1115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760883 (https://phabricator.wikimedia.org/T297605) [08:18:48] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:19:03] (03CR) 10Marostegui: [C: 03+2] db1115: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760883 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:20:09] !log Stop MySQL on db1115 to backup tendril T297605 [08:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:14] T297605: Shutdown Tendril and dbtree - https://phabricator.wikimedia.org/T297605 [08:23:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P20245 and previous config saved to /var/cache/conftool/dbconfig/20220208-082303-marostegui.json [08:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:26] PROBLEM - SSH on bast3005 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:24:30] RECOVERY - SSH on bast3005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:27:45] (03PS1) 10Marostegui: db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760884 (https://phabricator.wikimedia.org/T297605) [08:28:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:28:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [08:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:10] PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:28:18] ^ expected [08:28:28] (03CR) 10Marostegui: [C: 03+2] db2093: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760884 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:32:09] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:32:38] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:56] PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:18] ^ expected [08:33:24] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:34:08] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:35:55] (03PS1) 10Marostegui: tendril: Disable systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/760885 (https://phabricator.wikimedia.org/T297605) [08:38:02] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:38:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20246 and previous config saved to /var/cache/conftool/dbconfig/20220208-083808-marostegui.json [08:38:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:38:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [08:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:13] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [08:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T300775)', diff saved to https://phabricator.wikimedia.org/P20247 and previous config saved to /var/cache/conftool/dbconfig/20220208-083815-marostegui.json [08:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:28] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [08:43:20] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:47:12] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:47:58] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:08] RECOVERY - MariaDB memory on db1115 is OK: OK Memory 2% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:48:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:48:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [08:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T300402)', diff saved to https://phabricator.wikimedia.org/P20248 and previous config saved to /var/cache/conftool/dbconfig/20220208-084851-marostegui.json [08:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:55] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [08:49:39] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] sslcert: additional search paths for certificates [puppet] - 10https://gerrit.wikimedia.org/r/716370 (https://phabricator.wikimedia.org/T290261) (owner: 10Filippo Giunchedi) [08:51:34] (03CR) 10Ladsgroup: [C: 03+1] tendril: Disable systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/760885 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:51:50] (03CR) 10Marostegui: [C: 03+2] tendril: Disable systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/760885 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [08:52:13] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:33] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:58:15] (03PS1) 10Marostegui: tendril: Remove systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/760888 (https://phabricator.wikimedia.org/T297605) [08:58:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10User-fgiunchedi: Search public keys in additional places for sslcert::certificate - https://phabricator.wikimedia.org/T290261 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is done! `sslcert::certificate` will first search f... [09:00:53] (03PS1) 10Ema: Revert "ATS: lower number of allowed Lua states on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760889 (https://phabricator.wikimedia.org/T265625) [09:04:41] (03CR) 10Ema: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33609/console" [puppet] - 10https://gerrit.wikimedia.org/r/760889 (https://phabricator.wikimedia.org/T265625) (owner: 10Ema) [09:09:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300402)', diff saved to https://phabricator.wikimedia.org/P20249 and previous config saved to /var/cache/conftool/dbconfig/20220208-090906-marostegui.json [09:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:11] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [09:10:21] (03PS1) 10Elukey: Add ml-serve2005 to the ml-serve-codfw k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/760892 (https://phabricator.wikimedia.org/T300744) [09:10:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:10:54] (03CR) 10jerkins-bot: [V: 04-1] Add ml-serve2005 to the ml-serve-codfw k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/760892 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [09:10:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:46] (03PS2) 10Elukey: Add ml-serve2005 to the ml-serve-codfw k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/760892 (https://phabricator.wikimedia.org/T300744) [09:13:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300775)', diff saved to https://phabricator.wikimedia.org/P20250 and previous config saved to /var/cache/conftool/dbconfig/20220208-091349-marostegui.json [09:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:54] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [09:15:51] (03PS1) 10Ladsgroup: microsites: Remove link to tendril-legacy [puppet] - 10https://gerrit.wikimedia.org/r/760893 (https://phabricator.wikimedia.org/T297605) [09:16:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:17:39] (03CR) 10Marostegui: [C: 03+1] microsites: Remove link to tendril-legacy [puppet] - 10https://gerrit.wikimedia.org/r/760893 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [09:18:32] (03CR) 10Ladsgroup: [C: 03+2] tendril: Remove systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/760888 (https://phabricator.wikimedia.org/T297605) (owner: 10Marostegui) [09:20:11] (03PS1) 10Ladsgroup: wikimedia.org: Drop tendril-legacy [dns] - 10https://gerrit.wikimedia.org/r/760894 (https://phabricator.wikimedia.org/T297605) [09:20:28] (03PS2) 10Ladsgroup: microsites: Remove link to tendril-legacy [puppet] - 10https://gerrit.wikimedia.org/r/760893 (https://phabricator.wikimedia.org/T297605) [09:20:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] microsites: Remove link to tendril-legacy [puppet] - 10https://gerrit.wikimedia.org/r/760893 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [09:21:50] (03CR) 10Volans: "The vast majority of the cookbook is just printing instruction to the user, instead of actually performing the actions. I just reviewed th" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [09:22:32] ACKNOWLEDGEMENT - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service Marostegui known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:21] (03PS1) 10Ema: Revert "cache: test atskafka webrequest on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760895 (https://phabricator.wikimedia.org/T247497) [09:23:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:24:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P20251 and previous config saved to /var/cache/conftool/dbconfig/20220208-092410-marostegui.json [09:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:45] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:30] (03CR) 10Marostegui: [C: 03+1] wikimedia.org: Drop tendril-legacy [dns] - 10https://gerrit.wikimedia.org/r/760894 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [09:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20252 and previous config saved to /var/cache/conftool/dbconfig/20220208-092853-marostegui.json [09:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:25] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wikimedia.org: Drop tendril-legacy [dns] - 10https://gerrit.wikimedia.org/r/760894 (https://phabricator.wikimedia.org/T297605) (owner: 10Ladsgroup) [09:30:35] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:33:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [09:33:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [09:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298554)', diff saved to https://phabricator.wikimedia.org/P20253 and previous config saved to /var/cache/conftool/dbconfig/20220208-093315-ladsgroup.json [09:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:19] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [09:33:22] (03CR) 10JMeybohm: Add ingress support to miscweb chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [09:37:41] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:39:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P20254 and previous config saved to /var/cache/conftool/dbconfig/20220208-093915-marostegui.json [09:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:43:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P20255 and previous config saved to /var/cache/conftool/dbconfig/20220208-094358-marostegui.json [09:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:09] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:50:25] <_joe_> !issync [09:50:26] Syncing #wikimedia-operations (requested by joe_oblivian) [09:50:27] Set /cs flags #wikimedia-operations Emperor +Aiotv [09:50:29] Set /cs flags #wikimedia-operations jynus +Aiotv [09:51:47] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [09:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T300402)', diff saved to https://phabricator.wikimedia.org/P20256 and previous config saved to /var/cache/conftool/dbconfig/20220208-095420-marostegui.json [09:54:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:54:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:25] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [09:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T300402)', diff saved to https://phabricator.wikimedia.org/P20257 and previous config saved to /var/cache/conftool/dbconfig/20220208-095427-marostegui.json [09:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298554)', diff saved to https://phabricator.wikimedia.org/P20258 and previous config saved to /var/cache/conftool/dbconfig/20220208-095900-ladsgroup.json [09:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:06] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [09:59:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T300775)', diff saved to https://phabricator.wikimedia.org/P20259 and previous config saved to /var/cache/conftool/dbconfig/20220208-095909-marostegui.json [09:59:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:59:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:13] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [09:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20260 and previous config saved to /var/cache/conftool/dbconfig/20220208-095916-marostegui.json [09:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:31] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:08:15] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:09:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20261 and previous config saved to /var/cache/conftool/dbconfig/20220208-100926-marostegui.json [10:09:26] (03PS1) 10Elukey: Add ml-serve2005 to the ml-serve codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/760903 (https://phabricator.wikimedia.org/T300744) [10:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:31] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [10:09:36] !log updates scap to 4.3.0 on all hosts - T300804 [10:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:40] T300804: Deploy Scap version 4.3.0 - https://phabricator.wikimedia.org/T300804 [10:10:37] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:12:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:12:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [10:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T300510)', diff saved to https://phabricator.wikimedia.org/P20262 and previous config saved to /var/cache/conftool/dbconfig/20220208-101238-ladsgroup.json [10:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:42] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [10:13:24] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33612/console" [puppet] - 10https://gerrit.wikimedia.org/r/760895 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:14:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P20263 and previous config saved to /var/cache/conftool/dbconfig/20220208-101404-ladsgroup.json [10:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:36] (03PS1) 10Ladsgroup: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760904 (https://phabricator.wikimedia.org/T300510) [10:15:23] (03PS2) 10Ladsgroup: db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760904 (https://phabricator.wikimedia.org/T300510) [10:15:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760904 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [10:16:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300402)', diff saved to https://phabricator.wikimedia.org/P20264 and previous config saved to /var/cache/conftool/dbconfig/20220208-101631-marostegui.json [10:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:38] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [10:17:01] (03CR) 10Elukey: [V: 03+1 C: 03+1] Revert "cache: test atskafka webrequest on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760895 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [10:18:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1162.eqiad.wmnet with OS bullseye [10:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:31] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "could you please drop hieradata/cloud/eqiad1/cloudinfra/hosts/cloud-puppetmaster-04.yaml in a follow up patch?" [puppet] - 10https://gerrit.wikimedia.org/r/760879 (owner: 10Majavah) [10:24:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P20265 and previous config saved to /var/cache/conftool/dbconfig/20220208-102430-marostegui.json [10:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P20266 and previous config saved to /var/cache/conftool/dbconfig/20220208-102909-ladsgroup.json [10:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P20267 and previous config saved to /var/cache/conftool/dbconfig/20220208-103137-marostegui.json [10:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:14] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:39:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P20268 and previous config saved to /var/cache/conftool/dbconfig/20220208-103935-marostegui.json [10:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:53] (03PS3) 10Zabe: MWMultiVersion: move ombudsmen.wikimedia.org to ombuds.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/756734 (https://phabricator.wikimedia.org/T273323) [10:40:44] (03CR) 10Ladsgroup: admin: Fully deprecate sc-admins group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [10:43:12] !log update pcc facts [10:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298554)', diff saved to https://phabricator.wikimedia.org/P20269 and previous config saved to /var/cache/conftool/dbconfig/20220208-104414-ladsgroup.json [10:44:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [10:44:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [10:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:19] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [10:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298554)', diff saved to https://phabricator.wikimedia.org/P20270 and previous config saved to /var/cache/conftool/dbconfig/20220208-104421-ladsgroup.json [10:44:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P20271 and previous config saved to /var/cache/conftool/dbconfig/20220208-104642-marostegui.json [10:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1162.eqiad.wmnet with OS bullseye [10:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:29] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33614/console" [puppet] - 10https://gerrit.wikimedia.org/r/760903 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:53:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300510)', diff saved to https://phabricator.wikimedia.org/P20272 and previous config saved to /var/cache/conftool/dbconfig/20220208-105356-ladsgroup.json [10:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:01] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [10:54:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20273 and previous config saved to /var/cache/conftool/dbconfig/20220208-105440-marostegui.json [10:54:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:54:44] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [10:54:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T300775)', diff saved to https://phabricator.wikimedia.org/P20274 and previous config saved to /var/cache/conftool/dbconfig/20220208-105453-marostegui.json [10:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:50] (03PS2) 10Elukey: Add ml-serve2005 to the ml-serve codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/760903 (https://phabricator.wikimedia.org/T300744) [10:58:35] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33615/console" [puppet] - 10https://gerrit.wikimedia.org/r/760903 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [10:59:14] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2019.codfw.wmnet with OS buster [10:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T300402)', diff saved to https://phabricator.wikimedia.org/P20275 and previous config saved to /var/cache/conftool/dbconfig/20220208-110147-marostegui.json [11:01:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:01:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [11:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:52] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T300402)', diff saved to https://phabricator.wikimedia.org/P20276 and previous config saved to /var/cache/conftool/dbconfig/20220208-110154-marostegui.json [11:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:57] (03PS1) 10Filippo Giunchedi: hieradata: add ntp_servers for Pontoon [puppet] - 10https://gerrit.wikimedia.org/r/760908 [11:03:57] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add ntp_servers for Pontoon [puppet] - 10https://gerrit.wikimedia.org/r/760908 (owner: 10Filippo Giunchedi) [11:06:11] (03PS1) 10Jbond: realm.pp: try and fix populate puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/760909 (https://phabricator.wikimedia.org/T248169) [11:06:44] !log hnowlan@cumin1001 START - Cookbook sre.hosts.reimage for host restbase2020.codfw.wmnet with OS buster [11:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:55] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) [11:09:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P20277 and previous config saved to /var/cache/conftool/dbconfig/20220208-110901-ladsgroup.json [11:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/760892 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:11:14] (03PS1) 10Kormat: prometheus: Temporarily switch to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/760911 (https://phabricator.wikimedia.org/T297605) [11:11:22] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:11:29] (03CR) 10JMeybohm: "I think it would be nice to give some guidance on how to properly call the tests with particular arguments (like: "just run for chart X" o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [11:11:50] (03CR) 10jerkins-bot: [V: 04-1] prometheus: Temporarily switch to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/760911 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [11:12:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33616/console" [puppet] - 10https://gerrit.wikimedia.org/r/760909 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [11:12:47] (03PS2) 10Kormat: prometheus: Temporarily switch to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/760911 (https://phabricator.wikimedia.org/T297605) [11:14:36] (03CR) 10Jbond: [V: 03+1 C: 03+2] realm.pp: try and fix populate puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/760909 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [11:15:16] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:15:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298554)', diff saved to https://phabricator.wikimedia.org/P20278 and previous config saved to /var/cache/conftool/dbconfig/20220208-111540-ladsgroup.json [11:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:46] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [11:16:07] (03PS1) 104nn1l2: cowikimedia: Let admins grant confirmed and accountcreator flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760912 (https://phabricator.wikimedia.org/T300948) [11:18:08] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [11:20:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300402)', diff saved to https://phabricator.wikimedia.org/P20279 and previous config saved to /var/cache/conftool/dbconfig/20220208-112042-marostegui.json [11:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:47] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [11:22:46] jouncebot: now [11:22:47] No deployments scheduled for the next 0 hour(s) and 37 minute(s) [11:24:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P20280 and previous config saved to /var/cache/conftool/dbconfig/20220208-112406-ladsgroup.json [11:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:37] (03CR) 10Ladsgroup: [C: 03+1] prometheus: Temporarily switch to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/760911 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [11:26:58] (03CR) 10Kormat: [C: 03+2] prometheus: Temporarily switch to db2093 [puppet] - 10https://gerrit.wikimedia.org/r/760911 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [11:27:20] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:28:22] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:29:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300775)', diff saved to https://phabricator.wikimedia.org/P20281 and previous config saved to /var/cache/conftool/dbconfig/20220208-112909-marostegui.json [11:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:13] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [11:30:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P20282 and previous config saved to /var/cache/conftool/dbconfig/20220208-113045-ladsgroup.json [11:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:40] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:33:42] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:35:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P20283 and previous config saved to /var/cache/conftool/dbconfig/20220208-113547-marostegui.json [11:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:06] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [11:39:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T300510)', diff saved to https://phabricator.wikimedia.org/P20284 and previous config saved to /var/cache/conftool/dbconfig/20220208-113910-ladsgroup.json [11:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:16] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [11:40:36] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:41:44] (03PS1) 10Ladsgroup: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760561 [11:42:10] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760561 (owner: 10Ladsgroup) [11:42:14] (03PS2) 10Ladsgroup: Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760561 [11:42:16] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/760561 (owner: 10Ladsgroup) [11:43:42] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:06] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:44:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20285 and previous config saved to /var/cache/conftool/dbconfig/20220208-114413-marostegui.json [11:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:21] (03PS1) 10Ladsgroup: db1182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760919 (https://phabricator.wikimedia.org/T300510) [11:45:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P20286 and previous config saved to /var/cache/conftool/dbconfig/20220208-114549-ladsgroup.json [11:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] db1182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/760919 (https://phabricator.wikimedia.org/T300510) (owner: 10Ladsgroup) [11:46:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:46:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T300510)', diff saved to https://phabricator.wikimedia.org/P20287 and previous config saved to /var/cache/conftool/dbconfig/20220208-114639-ladsgroup.json [11:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:43] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [11:47:40] 10SRE, 10Maps: Allow Wikimedia Maps usage on bbcrewind.co.uk - https://phabricator.wikimedia.org/T297968 (10LWyatt) When implemented, could the appropriate attribution text also be provided/clarified too, please. So that the BBC team can implement that from the start. [11:47:42] (03CR) 10JMeybohm: [C: 04-1] "This looks pretty great and I don't feel like I'm able to properly review without investing substantial amounts of time because of my very" [deployment-charts] - 10https://gerrit.wikimedia.org/r/757977 (owner: 10Giuseppe Lavagetto) [11:50:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P20288 and previous config saved to /var/cache/conftool/dbconfig/20220208-115051-marostegui.json [11:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:21] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2020.codfw.wmnet with OS buster [11:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reimage for host db1182.eqiad.wmnet with OS bullseye [11:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:08] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase2019.codfw.wmnet with OS buster [11:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:06] (03CR) 10Klausman: [C: 03+1] Add ml-serve2005 to the ml-serve codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/760903 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [11:59:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2020.wmnet [11:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:03] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=restbase2019.wmnet [11:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P20289 and previous config saved to /var/cache/conftool/dbconfig/20220208-115918-marostegui.json [11:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1011.eqiad.wmnet with OS buster [11:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:42] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS buster [12:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T1200). [12:00:05] nn1l2: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:15] I can deploy today [12:00:15] اه [12:00:15] o/ [12:00:20] hi [12:00:23] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:26] unless Lucas_WMDE wants to :)) [12:00:35] nah, I’m in a meeting [12:00:41] enjoy! [12:00:43] I can be on standby but if you’re available that sounds great :) [12:00:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298554)', diff saved to https://phabricator.wikimedia.org/P20290 and previous config saved to /var/cache/conftool/dbconfig/20220208-120054-ladsgroup.json [12:00:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:00:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1160.eqiad.wmnet with reason: Maintenance [12:00:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:59] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [12:01:01] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T298554)', diff saved to https://phabricator.wikimedia.org/P20291 and previous config saved to /var/cache/conftool/dbconfig/20220208-120102-ladsgroup.json [12:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:06] (03CR) 10Urbanecm: [C: 03+2] cowikimedia: Let admins grant confirmed and accountcreator flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760912 (https://phabricator.wikimedia.org/T300948) (owner: 104nn1l2) [12:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:49] (03Merged) 10jenkins-bot: cowikimedia: Let admins grant confirmed and accountcreator flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/760912 (https://phabricator.wikimedia.org/T300948) (owner: 104nn1l2) [12:02:25] nn1l2: pulled to mwdebug1001 [12:02:27] can you test? [12:02:30] ok [12:02:48] LGTM [12:03:08] syncing [12:04:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d9902a4: cowikimedia: Let admins grant confirmed and accountcreator flags (T300948) (duration: 00m 50s) [12:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:24] T300948: Allow co.wikimedia (chapter wiki) Admins to grant confirmed and accountcreator permission to other users - https://phabricator.wikimedia.org/T300948 [12:04:38] nn1l2: done [12:04:40] anything else? [12:04:47] Thanks! [12:05:31] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T300402)', diff saved to https://phabricator.wikimedia.org/P20292 and previous config saved to /var/cache/conftool/dbconfig/20220208-120556-marostegui.json [12:05:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [12:05:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [12:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:01] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T300402)', diff saved to https://phabricator.wikimedia.org/P20293 and previous config saved to /var/cache/conftool/dbconfig/20220208-120603-marostegui.json [12:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:06:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:23] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:07:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:07:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [12:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [12:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:03] !log Running c-foreach-nt decommission on restbase2010 in advance of decommissioning [12:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase2010.wmnet [12:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:30] RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:31] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:14:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T300775)', diff saved to https://phabricator.wikimedia.org/P20294 and previous config saved to /var/cache/conftool/dbconfig/20220208-121422-marostegui.json [12:14:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [12:14:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [12:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:28] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T300775)', diff saved to https://phabricator.wikimedia.org/P20295 and previous config saved to /var/cache/conftool/dbconfig/20220208-121430-marostegui.json [12:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:08] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:19:05] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2010.codfw.wmnet with reason: Decommissioning [12:19:07] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2010.codfw.wmnet with reason: Decommissioning [12:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1182.eqiad.wmnet with OS bullseye [12:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:34] 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10MoritzMuehlenhoff) [12:22:37] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7238 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:27:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti1011.eqiad.wmnet with OS buster [12:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:33] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate eqiad Ganeti cluster to Buster - https://phabricator.wikimedia.org/T296721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1011.eqiad.wmnet with OS buster executed with errors: - ganeti1011 (*... [12:28:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298554)', diff saved to https://phabricator.wikimedia.org/P20296 and previous config saved to /var/cache/conftool/dbconfig/20220208-122805-ladsgroup.json [12:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:10] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [12:29:13] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:29:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300510)', diff saved to https://phabricator.wikimedia.org/P20297 and previous config saved to /var/cache/conftool/dbconfig/20220208-122913-ladsgroup.json [12:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:18] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [12:33:01] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:34:29] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7311 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:35:29] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) I checked `thanos-be-01.swift.eqiad1.wikimedia.cloud` and couldn't find any obvious errors and problems, I'll proceed with reimaging a thanos backend in production [12:37:13] !log filippo@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-be2001.codfw.wmnet with OS bullseye [12:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:16] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host thanos-be2001.codfw.wmnet with OS bullseye [12:37:23] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:38:01] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6963 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [12:39:27] (03PS1) 10Jelto: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) [12:40:13] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:42:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on dbmonitor1002.wikimedia.org with reason: Host will be shutdown in a week (T297605) [12:42:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on dbmonitor1002.wikimedia.org with reason: Host will be shutdown in a week (T297605) [12:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:21] T297605: Shutdown Tendril and dbtree - https://phabricator.wikimedia.org/T297605 [12:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P20298 and previous config saved to /var/cache/conftool/dbconfig/20220208-124309-ladsgroup.json [12:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:16] !log shut down dbmonitor1002 (T297605) [12:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:21] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P20299 and previous config saved to /var/cache/conftool/dbconfig/20220208-124418-ladsgroup.json [12:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:53] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:50:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300775)', diff saved to https://phabricator.wikimedia.org/P20300 and previous config saved to /var/cache/conftool/dbconfig/20220208-125036-marostegui.json [12:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:41] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [12:50:42] (03PS1) 10Arturo Borrero Gonzalez: toolforge: grid: fix condition for legacy ssh_known_host file generation [puppet] - 10https://gerrit.wikimedia.org/r/760933 (https://phabricator.wikimedia.org/T284767) [12:55:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300402)', diff saved to https://phabricator.wikimedia.org/P20301 and previous config saved to /var/cache/conftool/dbconfig/20220208-125508-marostegui.json [12:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:13] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [12:56:41] (03PS2) 10Jelto: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) [12:57:17] (03CR) 10jerkins-bot: [V: 04-1] gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [12:58:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P20302 and previous config saved to /var/cache/conftool/dbconfig/20220208-125814-ladsgroup.json [12:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:31] 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10jnuche) [12:59:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P20303 and previous config saved to /var/cache/conftool/dbconfig/20220208-125922-ladsgroup.json [12:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:50] PROBLEM - SSH on mw2257.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:04:00] (03PS3) 10Jelto: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) [13:04:28] (03PS2) 10Muehlenhoff: Add Cumin aliases for edge sites [puppet] - 10https://gerrit.wikimedia.org/r/742686 [13:05:38] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add ml-serve2005 to the ml-serve codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/760903 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:05:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20304 and previous config saved to /var/cache/conftool/dbconfig/20220208-130541-marostegui.json [13:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:00] 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10jnuche) [13:07:56] (03CR) 10Volans: Add Cumin aliases for edge sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:09:01] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33618/console" [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [13:10:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20305 and previous config saved to /var/cache/conftool/dbconfig/20220208-131012-marostegui.json [13:10:13] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11390 MB (32% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:45] (03PS1) 10Elukey: Empty profile::docker::storage settings for ml-serve2005 [puppet] - 10https://gerrit.wikimedia.org/r/760935 (https://phabricator.wikimedia.org/T300744) [13:12:57] (03CR) 10Muehlenhoff: "One final bit, but looks good otherwise. Our lower level system user handling is poorly document (as in "I don't think it's documented at " [puppet] - 10https://gerrit.wikimedia.org/r/759254 (https://phabricator.wikimedia.org/T295481) (owner: 10Jelto) [13:13:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T298554)', diff saved to https://phabricator.wikimedia.org/P20306 and previous config saved to /var/cache/conftool/dbconfig/20220208-131319-ladsgroup.json [13:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:24] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [13:14:16] (03CR) 10Elukey: [C: 03+2] Empty profile::docker::storage settings for ml-serve2005 [puppet] - 10https://gerrit.wikimedia.org/r/760935 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:14:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [13:14:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [13:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T300510)', diff saved to https://phabricator.wikimedia.org/P20307 and previous config saved to /var/cache/conftool/dbconfig/20220208-131427-ladsgroup.json [13:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298554)', diff saved to https://phabricator.wikimedia.org/P20308 and previous config saved to /var/cache/conftool/dbconfig/20220208-131430-ladsgroup.json [13:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:38] T300510: Upgrade s2 to Bullseye - https://phabricator.wikimedia.org/T300510 [13:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:06] (03CR) 10Muehlenhoff: Add Cumin aliases for edge sites (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:16:59] (03CR) 10Elukey: [C: 03+2] Add ml-serve2005 to the ml-serve-codfw k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/760892 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [13:18:38] (03CR) 10Volans: Add Cumin aliases for edge sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:19:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: grid: fix condition for legacy ssh_known_host file generation [puppet] - 10https://gerrit.wikimedia.org/r/760933 (https://phabricator.wikimedia.org/T284767) (owner: 10Arturo Borrero Gonzalez) [13:19:37] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:20:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33619/console" [puppet] - 10https://gerrit.wikimedia.org/r/760933 (https://phabricator.wikimedia.org/T284767) (owner: 10Arturo Borrero Gonzalez) [13:20:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P20309 and previous config saved to /var/cache/conftool/dbconfig/20220208-132045-marostegui.json [13:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:53] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:24:13] (03PS1) 10Elukey: Update elukey's ssh public key [homer/public] - 10https://gerrit.wikimedia.org/r/760937 [13:24:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33620/console" [puppet] - 10https://gerrit.wikimedia.org/r/760911 (https://phabricator.wikimedia.org/T297605) (owner: 10Kormat) [13:25:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P20310 and previous config saved to /var/cache/conftool/dbconfig/20220208-132517-marostegui.json [13:25:19] (03CR) 10Jbond: "i drafted this yesterday then noticed that there where some issues with PCC which distracted me, so while the comments here should be true" [puppet] - 10https://gerrit.wikimedia.org/r/757700 (owner: 10Ryan Kemper) [13:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:19] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet populate failing on some nodes - https://phabricator.wikimedia.org/T248169 (10jbond) This is still not working. It seems we get to a state where puppetdb has an empty facts result. from what i can tell puppet queries puppetdb... [13:27:29] (03CR) 10Ottomata: [C: 03+1] Revert "cache: test atskafka webrequest on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/760895 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [13:31:34] (03PS1) 10Majavah: hieradata: cloudinfra: remove file for non-existent VM [puppet] - 10https://gerrit.wikimedia.org/r/760938 [13:32:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:33:45] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 6704 MB (18% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [13:33:52] this is the new host, but it should be solved now --^ (BGP) [13:34:13] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 82, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T300775)', diff saved to https://phabricator.wikimedia.org/P20312 and previous config saved to /var/cache/conftool/dbconfig/20220208-133550-marostegui.json [13:35:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:35:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:55] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [13:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T300775)', diff saved to https://phabricator.wikimedia.org/P20313 and previous config saved to /var/cache/conftool/dbconfig/20220208-133558-marostegui.json [13:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:46] 10Puppet, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet populate failing on some nodes - https://phabricator.wikimedia.org/T248169 (10jbond) currently running a local hack which adds a routes.yaml file ` master: facts: {cache: yaml, terminus: yaml} ` [13:37:27] !log migrating instances off ganeti1021 [13:37:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:59] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T300402)', diff saved to https://phabricator.wikimedia.org/P20314 and previous config saved to /var/cache/conftool/dbconfig/20220208-134022-marostegui.json [13:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:27] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:40:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:40:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [13:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:33] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:41:39] jouncebot: nowandnext [13:41:39] No deployments scheduled for the next 3 hour(s) and 18 minute(s) [13:41:39] In 3 hour(s) and 18 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T1700) [13:41:43] I'm quickly deploing a beta-only config change [13:42:04] (03PS2) 10Majavah: beta: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759725 (https://phabricator.wikimedia.org/T289068) [13:42:16] (03CR) 10Majavah: [C: 03+2] beta: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759725 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [13:42:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298554)', diff saved to https://phabricator.wikimedia.org/P20315 and previous config saved to /var/cache/conftool/dbconfig/20220208-134254-ladsgroup.json [13:42:56] (03Merged) 10jenkins-bot: beta: WRITE_NEW for CentralAuth hidden level migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759725 (https://phabricator.wikimedia.org/T289068) (owner: 10Majavah) [13:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:59] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [13:43:07] * taavi done [13:43:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:43:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T300402)', diff saved to https://phabricator.wikimedia.org/P20316 and previous config saved to /var/cache/conftool/dbconfig/20220208-134324-marostegui.json [13:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:47] (03PS1) 10Elukey: install_server: set the new k8s overlay recipe for new ml-serve nodes [puppet] - 10https://gerrit.wikimedia.org/r/760940 (https://phabricator.wikimedia.org/T294949) [13:46:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:46:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [13:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] (03CR) 10Elukey: [C: 03+2] install_server: set the new k8s overlay recipe for new ml-serve nodes [puppet] - 10https://gerrit.wikimedia.org/r/760940 (https://phabricator.wikimedia.org/T294949) (owner: 10Elukey) [13:47:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [13:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300402)', diff saved to https://phabricator.wikimedia.org/P20317 and previous config saved to /var/cache/conftool/dbconfig/20220208-134748-marostegui.json [13:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:52] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [13:48:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10elukey) Already set the partman recipe (we are going to use a new one). [13:48:12] (03CR) 10Filippo Giunchedi: Add Cumin aliases for edge sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/742686 (owner: 10Muehlenhoff) [13:48:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [13:51:05] (03PS1) 10Elukey: install_server: set dse-k8s-worker1* partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/760944 (https://phabricator.wikimedia.org/T291579) [13:52:07] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:52:13] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10Vgutierrez) this a gentle 7 days reminder :) @RobH. [13:52:23] (03CR) 10Elukey: [C: 03+2] install_server: set dse-k8s-worker1* partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/760944 (https://phabricator.wikimedia.org/T291579) (owner: 10Elukey) [13:53:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) I've set the partman recipe (we are using a newer one for these nodes). [13:54:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q1:(Need By: TBD) rack/setup/install dse-k8s-worker100[1-4] - https://phabricator.wikimedia.org/T291579 (10elukey) [13:54:34] (03PS2) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [13:55:31] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [13:57:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P20318 and previous config saved to /var/cache/conftool/dbconfig/20220208-135758-ladsgroup.json [13:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:09] (03PS2) 10Majavah: P:openstack::galera: drop puppetmaster firewall rules [puppet] - 10https://gerrit.wikimedia.org/r/760643 [14:01:13] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [14:01:25] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20319 and previous config saved to /var/cache/conftool/dbconfig/20220208-140252-marostegui.json [14:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:03] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:07:23] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:07:48] !log update NIC firmware on thanos-be2001 - T288937 [14:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:53] T288937: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 [14:12:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300775)', diff saved to https://phabricator.wikimedia.org/P20320 and previous config saved to /var/cache/conftool/dbconfig/20220208-141210-marostegui.json [14:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:15] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [14:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P20321 and previous config saved to /var/cache/conftool/dbconfig/20220208-141303-ladsgroup.json [14:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:44] (03CR) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. (032 comments) [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:13:52] (03PS3) 10Kormat: wmfdb/cli_admin/db_compare: Add db-compare utility. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) [14:15:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:17:20] !log update PERC firmware on thanos-be2001 - T288937 [14:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:24] T288937: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 [14:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P20322 and previous config saved to /var/cache/conftool/dbconfig/20220208-141757-marostegui.json [14:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:39] (03PS1) 10Jbond: populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 [14:20:53] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:21:41] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:22:03] (03CR) 10jerkins-bot: [V: 04-1] populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 (owner: 10Jbond) [14:23:17] (03CR) 10Volans: [C: 03+1] "LGTM with the proposed follow ups." [software/wmfdb] - 10https://gerrit.wikimedia.org/r/759504 (https://phabricator.wikimedia.org/T298236) (owner: 10Kormat) [14:26:38] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2001.codfw.wmnet with OS bullseye [14:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:42] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host thanos-be2001.codfw.wmnet with OS bullseye completed: - thanos-be2001 (**PASS**) - Downtimed on... [14:27:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20323 and previous config saved to /var/cache/conftool/dbconfig/20220208-142714-marostegui.json [14:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298554)', diff saved to https://phabricator.wikimedia.org/P20324 and previous config saved to /var/cache/conftool/dbconfig/20220208-142808-ladsgroup.json [14:28:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:28:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:12] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:28:13] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [14:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298554)', diff saved to https://phabricator.wikimedia.org/P20325 and previous config saved to /var/cache/conftool/dbconfig/20220208-142815-ladsgroup.json [14:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:00] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:33:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T300402)', diff saved to https://phabricator.wikimedia.org/P20326 and previous config saved to /var/cache/conftool/dbconfig/20220208-143302-marostegui.json [14:33:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:33:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:07] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:40] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [14:35:37] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [14:35:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [14:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T300402)', diff saved to https://phabricator.wikimedia.org/P20327 and previous config saved to /var/cache/conftool/dbconfig/20220208-143545-marostegui.json [14:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300402)', diff saved to https://phabricator.wikimedia.org/P20328 and previous config saved to /var/cache/conftool/dbconfig/20220208-144011-marostegui.json [14:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [14:40:46] (03PS1) 10Zabe: wmcs: stop accessing gu_hidden in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/760953 (https://phabricator.wikimedia.org/T289068) [14:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P20329 and previous config saved to /var/cache/conftool/dbconfig/20220208-144219-marostegui.json [14:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:45] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:51:58] (03PS1) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [14:53:53] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:53:54] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [14:55:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20330 and previous config saved to /var/cache/conftool/dbconfig/20220208-145516-marostegui.json [14:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298554)', diff saved to https://phabricator.wikimedia.org/P20331 and previous config saved to /var/cache/conftool/dbconfig/20220208-145527-ladsgroup.json [14:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:32] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [14:57:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T300775)', diff saved to https://phabricator.wikimedia.org/P20332 and previous config saved to /var/cache/conftool/dbconfig/20220208-145724-marostegui.json [14:57:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:57:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [14:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:31] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [14:57:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20333 and previous config saved to /var/cache/conftool/dbconfig/20220208-145731-marostegui.json [14:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:05] (03PS1) 10Elukey: profile::kubernetes::node: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/760956 (https://phabricator.wikimedia.org/T300744) [15:00:25] (03PS4) 10Jelto: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) [15:03:29] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:03:51] (03PS2) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [15:05:12] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33623/console" [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [15:05:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33622/console" [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [15:05:59] (03PS3) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [15:07:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33624/console" [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [15:08:16] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [15:10:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P20334 and previous config saved to /var/cache/conftool/dbconfig/20220208-151020-marostegui.json [15:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20335 and previous config saved to /var/cache/conftool/dbconfig/20220208-151032-ladsgroup.json [15:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:35] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:49] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7250 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:13:25] (03PS2) 10Volans: requests: add support for conn/read timeouts [software/pywmflib] - 10https://gerrit.wikimedia.org/r/754888 [15:13:28] (03PS1) 10Volans: setup.py: temporarily add upper limit to dnspython [software/pywmflib] - 10https://gerrit.wikimedia.org/r/760958 [15:15:03] (03PS4) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [15:15:37] 10SRE-swift-storage: Decommission ms-fe200[5-8] - https://phabricator.wikimedia.org/T301251 (10MatthewVernon) [15:16:13] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [15:17:49] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:18:21] !log depooling ms-fe200[5-8] T301251 [15:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:26] T301251: Decommission ms-fe200[5-8] - https://phabricator.wikimedia.org/T301251 [15:19:51] (03PS5) 10Jelto: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) [15:20:07] (03PS5) 10Jbond: C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) [15:21:51] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:22:29] (03CR) 10jerkins-bot: [V: 04-1] C:puppetdb::app: update puppet_compiler to scripts [puppet] - 10https://gerrit.wikimedia.org/r/760955 (https://phabricator.wikimedia.org/T248169) (owner: 10Jbond) [15:22:43] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:25:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T300402)', diff saved to https://phabricator.wikimedia.org/P20336 and previous config saved to /var/cache/conftool/dbconfig/20220208-152525-marostegui.json [15:25:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:25:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1133.eqiad.wmnet with reason: Maintenance [15:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:31] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P20337 and previous config saved to /var/cache/conftool/dbconfig/20220208-152536-ladsgroup.json [15:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:35] (03PS1) 10MVernon: hieradata: move codfw swiftrepl host to ms-fe2009 [puppet] - 10https://gerrit.wikimedia.org/r/760961 (https://phabricator.wikimedia.org/T301251) [15:27:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [15:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:28:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1134.eqiad.wmnet with reason: Maintenance [15:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T300402)', diff saved to https://phabricator.wikimedia.org/P20338 and previous config saved to /var/cache/conftool/dbconfig/20220208-152812-marostegui.json [15:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300402)', diff saved to https://phabricator.wikimedia.org/P20339 and previous config saved to /var/cache/conftool/dbconfig/20220208-153251-marostegui.json [15:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:56] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [15:33:04] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [15:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [15:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:03] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:39:09] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:40:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298554)', diff saved to https://phabricator.wikimedia.org/P20340 and previous config saved to /var/cache/conftool/dbconfig/20220208-154042-ladsgroup.json [15:40:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:40:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [15:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:47] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [15:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298554)', diff saved to https://phabricator.wikimedia.org/P20341 and previous config saved to /var/cache/conftool/dbconfig/20220208-154049-ladsgroup.json [15:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (including actions in commit message)" [puppet] - 10https://gerrit.wikimedia.org/r/760961 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [15:44:17] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:46:39] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:47:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20342 and previous config saved to /var/cache/conftool/dbconfig/20220208-154755-marostegui.json [15:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:11] (03PS1) 10JMeybohm: Ensure cleanup of CertificateRequest history [deployment-charts] - 10https://gerrit.wikimedia.org/r/760989 [15:54:27] (03CR) 10Herron: [C: 03+1] logstash: use java home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [15:54:45] (03PS1) 10Papaul: Add mc203[89] mc204[09] and mc205[05] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 [15:55:14] (03CR) 10Elukey: [C: 03+1] "Left some nits for the commit msg, looks good to me! (checked also the specs for the Certificate CRD, all good)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/760989 (owner: 10JMeybohm) [15:55:16] (03CR) 10Herron: [C: 03+1] opensearch: use java_home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759751 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [15:55:29] (03CR) 10jerkins-bot: [V: 04-1] Add mc203[89] mc204[09] and mc205[05] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 (owner: 10Papaul) [15:59:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] hieradata: cloudinfra: remove file for non-existent VM [puppet] - 10https://gerrit.wikimedia.org/r/760938 (owner: 10Majavah) [15:59:47] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 11275 MB (31% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:59:53] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:02:15] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:03:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P20343 and previous config saved to /var/cache/conftool/dbconfig/20220208-160300-marostegui.json [16:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:26] (03PS2) 10Papaul: Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 [16:04:11] (03CR) 10jerkins-bot: [V: 04-1] Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 (owner: 10Papaul) [16:05:33] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:51] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:05:59] PROBLEM - Host thanos-be2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:06:40] (03PS3) 10Papaul: Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 [16:07:19] (03CR) 10jerkins-bot: [V: 04-1] Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 (owner: 10Papaul) [16:07:22] (03PS4) 10Papaul: Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 [16:07:35] RECOVERY - Host thanos-be2001 is UP: PING OK - Packet loss = 0%, RTA = 34.07 ms [16:08:03] (03CR) 10jerkins-bot: [V: 04-1] Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 (owner: 10Papaul) [16:08:30] (03CR) 10MVernon: [C: 03+2] hieradata: move codfw swiftrepl host to ms-fe2009 [puppet] - 10https://gerrit.wikimedia.org/r/760961 (https://phabricator.wikimedia.org/T301251) (owner: 10MVernon) [16:09:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298554)', diff saved to https://phabricator.wikimedia.org/P20344 and previous config saved to /var/cache/conftool/dbconfig/20220208-160922-ladsgroup.json [16:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:27] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:10:39] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:11:29] (03CR) 10Cwhite: [C: 03+2] wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [16:11:41] (03PS4) 10Cwhite: wikimedia.org: add grafana-next-rw [dns] - 10https://gerrit.wikimedia.org/r/757780 (https://phabricator.wikimedia.org/T282863) [16:11:43] (03PS5) 10Papaul: Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 [16:11:55] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:08] (03PS6) 10Papaul: Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 (https://phabricator.wikimedia.org/T294962) [16:13:06] (03CR) 10Papaul: [C: 03+2] Add mc2038 -mc2055 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/760991 (https://phabricator.wikimedia.org/T294962) (owner: 10Papaul) [16:13:20] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [16:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:53] (03CR) 10Cwhite: [C: 03+2] hiera: set domainrw to grafana-next-rw in codfw [puppet] - 10https://gerrit.wikimedia.org/r/757774 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [16:15:25] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:16:18] (03CR) 10Cwhite: [C: 03+2] graphite: add grafana-next-rw to cors origins [puppet] - 10https://gerrit.wikimedia.org/r/757775 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [16:16:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2038.codfw.wmnet with OS buster [16:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2038.codfw.wmnet... [16:18:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T300402)', diff saved to https://phabricator.wikimedia.org/P20345 and previous config saved to /var/cache/conftool/dbconfig/20220208-161805-marostegui.json [16:18:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:18:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1135.eqiad.wmnet with reason: Maintenance [16:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:10] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [16:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T300402)', diff saved to https://phabricator.wikimedia.org/P20346 and previous config saved to /var/cache/conftool/dbconfig/20220208-161812-marostegui.json [16:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:09] (03CR) 10Cwhite: [C: 03+2] idp, grafana: configure grafana-next-rw for sso [puppet] - 10https://gerrit.wikimedia.org/r/757776 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [16:21:41] jouncebot now [16:21:41] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [16:22:23] I'm going to sneak in a mediawiki-config mod in preparation for the new scap and this week' [16:22:29] ..this week's train [16:22:37] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:22:44] (03CR) 10Ahmon Dancy: [C: 03+2] Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 (owner: 10Ahmon Dancy) [16:22:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300402)', diff saved to https://phabricator.wikimedia.org/P20347 and previous config saved to /var/cache/conftool/dbconfig/20220208-162250-marostegui.json [16:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P20348 and previous config saved to /var/cache/conftool/dbconfig/20220208-162427-ladsgroup.json [16:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:01] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:25:30] (03CR) 10Cwhite: [C: 03+2] hiera: add grafana-next and grafana-next-rw to grafana public_aliases [puppet] - 10https://gerrit.wikimedia.org/r/757777 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [16:25:47] (03CR) 10JMeybohm: [C: 03+2] Ensure cleanup of CertificateRequest history [deployment-charts] - 10https://gerrit.wikimedia.org/r/760989 (owner: 10JMeybohm) [16:25:55] 10SRE-swift-storage, 10User-fgiunchedi: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10fgiunchedi) `thanos-be2001` is running Bullseye, the reimage itself went fine and I'll leave the host alone to see if any obvious problems pop up. I've ran into an old/known issue with disk re... [16:27:20] (03PS2) 10JMeybohm: Ensure cleanup of CertificateRequest history [deployment-charts] - 10https://gerrit.wikimedia.org/r/760989 [16:27:34] (03PS2) 10Cwhite: hiera: configure mapping and cache rules for grafana-next-rw [puppet] - 10https://gerrit.wikimedia.org/r/757778 (https://phabricator.wikimedia.org/T282863) [16:28:13] (03CR) 10JMeybohm: [C: 03+2] Ensure cleanup of CertificateRequest history (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/760989 (owner: 10JMeybohm) [16:28:55] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jbond) I realised the scrip impersonates a super admin account but the account i was using was harries who has left. when running the script impersonatng harry i get the above error e.g. ` $... [16:30:03] (03CR) 10Cwhite: [C: 03+2] hiera: configure mapping and cache rules for grafana-next-rw [puppet] - 10https://gerrit.wikimedia.org/r/757778 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [16:32:02] (03Merged) 10jenkins-bot: Ensure cleanup of CertificateRequest history [deployment-charts] - 10https://gerrit.wikimedia.org/r/760989 (owner: 10JMeybohm) [16:33:54] (03CR) 10JMeybohm: [C: 03+1] profile::kubernetes::node: support bullseye [puppet] - 10https://gerrit.wikimedia.org/r/760956 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [16:34:07] (03PS3) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [16:34:44] 10SRE-swift-storage: Decommission ms-fe200[5-8] - https://phabricator.wikimedia.org/T301251 (10MatthewVernon) Nodes depooled, swiftrepl disabled (including removing the .timer file and swiftrepl.conf) on ms-fe2005. [16:35:23] (03CR) 10Majavah: toolforge: automated-tests: include tests for cron operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760942 (owner: 10Arturo Borrero Gonzalez) [16:35:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2039.codfw.wmnet with OS buster [16:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2039.codfw.wmnet with OS buster [16:37:05] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:29] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20349 and previous config saved to /var/cache/conftool/dbconfig/20220208-163755-marostegui.json [16:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:06] PROBLEM - etcd request latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 operation=list https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:38:56] (03PS3) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 [16:39:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P20350 and previous config saved to /var/cache/conftool/dbconfig/20220208-163932-ladsgroup.json [16:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:09] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jbond) 05Open→03Resolved FYI this is working now, for posterity the issue was that we impersonate one of the It services staff members to enable this functionality, however we where using @H... [16:40:43] (03CR) 10Jbond: [C: 03+1] setup.py: temporarily add upper limit to dnspython [software/pywmflib] - 10https://gerrit.wikimedia.org/r/760958 (owner: 10Volans) [16:40:54] (03CR) 10Ahmon Dancy: Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 (owner: 10Ahmon Dancy) [16:40:58] (03CR) 10Ahmon Dancy: [C: 03+2] Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 (owner: 10Ahmon Dancy) [16:42:12] (03Merged) 10jenkins-bot: Choose wikiversions.php file relative to MWMultiVersion.php (revived) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/759521 (owner: 10Ahmon Dancy) [16:42:22] RECOVERY - etcd request latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [16:43:45] (03PS4) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [16:45:14] !log dancy@deploy1002 Synchronized multiversion/MWMultiVersion.php: Config: [[gerrit:759521|Choose wikiversions.php file relative to MWMultiVersion.php (revived)]] (duration: 00m 49s) [16:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:09] (03CR) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760942 (owner: 10Arturo Borrero Gonzalez) [16:46:53] (03PS5) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [16:47:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2038.codfw.wmnet with OS buster [16:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2038.codfw.wmnet wit... [16:48:54] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7235 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [16:49:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:50:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [16:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20351 and previous config saved to /var/cache/conftool/dbconfig/20220208-165057-marostegui.json [16:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:02] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [16:51:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2040.codfw.wmnet with OS buster [16:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:17] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2040.codfw.wmnet... [16:51:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [16:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:34] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mc2040.codfw.wmnet with OS buster [16:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:39] 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q3:(Need By: TBD) rack/setup/install ganeti2029.codfw.wmnet, ganeti2030.codfw.wmnet - https://phabricator.wikimedia.org/T298998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2040.codfw.wmnet wit... [16:51:59] (03PS6) 10Jelto: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) [16:52:41] (03CR) 10jerkins-bot: [V: 04-1] gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [16:52:58] (03PS6) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [16:53:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P20352 and previous config saved to /var/cache/conftool/dbconfig/20220208-165300-marostegui.json [16:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:16] PROBLEM - k8s API server requests latencies on kubestagemaster2001 is CRITICAL: instance=10.192.48.10 verb={DELETE,LIST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [16:53:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) We have discussed this in the Traffic team and decided to go with `2001:67c:930::1/128`, mostly because we feel it's easy to memorize/copy (for cases where people want... [16:54:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2040.codfw.wmnet with OS buster [16:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2040.codfw.wmnet with OS buster [16:54:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298554)', diff saved to https://phabricator.wikimedia.org/P20353 and previous config saved to /var/cache/conftool/dbconfig/20220208-165436-ladsgroup.json [16:54:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:54:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:45] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [16:54:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298554)', diff saved to https://phabricator.wikimedia.org/P20354 and previous config saved to /var/cache/conftool/dbconfig/20220208-165445-ladsgroup.json [16:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:30] hmm.. DBReadOnlyError's happening more than usual. 708 in the last 15 minutes. [16:56:38] FIXME: Figure out what "usual" is [16:57:09] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10bcampbell) Hey @jbond awesome news! Glad it's working again. I agree that it would be better to use a service account to prevent this from happening again. We have a super admin service account... [16:57:10] Fading away now [16:57:12] approximately never [16:57:27] hehe. I see them all the time [16:59:15] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10RhinosF1) Side note: If @HMarcus has left, should the phab account be disabled? @bcampbell: is this something ITS have access to do or could you give a +1 for me to use the disable tool? [17:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T1700). [17:00:05] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:20] looking! [17:00:25] o/ [17:00:45] (03PS7) 10Arturo Borrero Gonzalez: toolforge: automated-tests: include tests for cron operations [puppet] - 10https://gerrit.wikimedia.org/r/760942 [17:00:59] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) > We have discussed this in the Traffic team and decided to go with 2001:67c:930::1/128, mostly because we feel it's easy to memorize/copy (for cases where people want... [17:01:18] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) [17:04:39] RECOVERY - k8s API server requests latencies on kubestagemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [17:05:06] (03PS1) 10JMeybohm: Bump resources of cert-manager and components [deployment-charts] - 10https://gerrit.wikimedia.org/r/761000 [17:05:17] zabe: pardon the delay, the patch looks trivial enough, just checking to make sure it's the right config change for this situation :) you're intentionally adding it as a separate vhost and not as a redirect, right? [17:05:33] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33626/console" [puppet] - 10https://gerrit.wikimedia.org/r/756733 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [17:06:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20355 and previous config saved to /var/cache/conftool/dbconfig/20220208-170601-marostegui.json [17:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:19] rzl: the plan is to move the wiki to this new url and then chaning the old url to redirect to the new one [17:06:32] ah okay [17:06:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2039.codfw.wmnet with OS buster [17:06:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:43] so we're following the first part of the "add a new wiki" procedure, at least as far as DNS, Apache, etc [17:06:45] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2039.codfw.wmnet with OS buster completed: - mc20... [17:07:32] i would say so [17:07:41] okay, sounds good [17:08:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T300402)', diff saved to https://phabricator.wikimedia.org/P20356 and previous config saved to /var/cache/conftool/dbconfig/20220208-170805-marostegui.json [17:08:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [17:08:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [17:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:10] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [17:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T300402)', diff saved to https://phabricator.wikimedia.org/P20357 and previous config saved to /var/cache/conftool/dbconfig/20220208-170812-marostegui.json [17:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:15] out of an abundance of caution, I'm going to stop puppet the appservers, merge this, deploy it to mwdebug, and make sure it looks as expected before resuming puppet everywhere [17:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:22] probably overkill for this patch but better over than under :) [17:08:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2041.codfw.wmnet with OS buster [17:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2041.codfw.wmnet with OS buster [17:09:08] ok :) [17:09:10] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@88cdfdc]: Deploy rdf-streaming-updater reconcilliation job [17:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:11] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@88cdfdc]: Deploy rdf-streaming-updater reconcilliation job (duration: 02m 01s) [17:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:55] !log rzl@cumin1001:~$ sudo cumin A:mw "disable-puppet T273323" [17:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:59] T273323: Rename private "ombudsmenwiki" to "ombudswiki" and change the logo - https://phabricator.wikimedia.org/T273323 [17:12:13] (03CR) 10RLazarus: [V: 03+1 C: 03+2] Add ombuds.wikimedia.org to mediawiki.yaml [puppet] - 10https://gerrit.wikimedia.org/r/756733 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [17:12:49] (03PS1) 10Ladsgroup: admin: Completely remove sc-admins group [puppet] - 10https://gerrit.wikimedia.org/r/761001 [17:12:56] (03PS9) 10JMeybohm: Add LVS service k8s-ingress-staging [puppet] - 10https://gerrit.wikimedia.org/r/759260 (https://phabricator.wikimedia.org/T300740) [17:13:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300402)', diff saved to https://phabricator.wikimedia.org/P20358 and previous config saved to /var/cache/conftool/dbconfig/20220208-171323-marostegui.json [17:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:28] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [17:14:17] (03CR) 10Ladsgroup: admin: Fully deprecate sc-admins group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/759219 (owner: 10Ladsgroup) [17:14:39] zabe: applied at mwdebug1001, is there anything for you to test? [17:15:57] rzl: not really, when I go there it says 'No wiki found', but that is expected [17:16:32] good enough for me :) and the other wikis seem to still be there too [17:16:38] always a pleasant surprise [17:16:59] !log rzl@cumin1001:~$ sudo cumin A:mw "enable-puppet T273323" [17:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:04] T273323: Rename private "ombudsmenwiki" to "ombudswiki" and change the logo - https://phabricator.wikimedia.org/T273323 [17:17:10] zabe: thanks! should be all set [17:17:23] rzl: thanks for your help :) [17:17:26] 10SRE, 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: gitlab: enable IPv6 for https - https://phabricator.wikimedia.org/T300816 (10Jelto) 05Open→03In progress p:05Triage→03Medium a:03Jelto [17:21:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P20359 and previous config saved to /var/cache/conftool/dbconfig/20220208-172106-marostegui.json [17:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:33] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:22:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298554)', diff saved to https://phabricator.wikimedia.org/P20360 and previous config saved to /var/cache/conftool/dbconfig/20220208-172248-ladsgroup.json [17:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:53] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [17:23:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2040.codfw.wmnet with OS buster [17:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:56] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2040.codfw.wmnet with OS buster completed: - mc2040 (**PASS**) - Remo... [17:25:07] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:25:53] (03PS1) 10Hnowlan: restbase: remove restbase2010 [puppet] - 10https://gerrit.wikimedia.org/r/761006 (https://phabricator.wikimedia.org/T295375) [17:27:39] PROBLEM - Cassandra instance data free space on restbase2012 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7253 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:28:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20361 and previous config saved to /var/cache/conftool/dbconfig/20220208-172827-marostegui.json [17:28:29] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2042.codfw.wmnet with OS buster [17:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2042.codfw.wmnet with OS buster [17:32:17] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) So checking on this and we have a few things: * a bad sda that is failing (SSD) * dimm b3 memory errors in dell service event log * system is out of warranty, and will be 5 years old on 2022-04-07 ** i... [17:32:18] RECOVERY - Cassandra instance data free space on restbase2012 is OK: DISK OK - free space: /srv/cassandra/instance-data 12850 MB (36% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [17:36:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T300775)', diff saved to https://phabricator.wikimedia.org/P20362 and previous config saved to /var/cache/conftool/dbconfig/20220208-173611-marostegui.json [17:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:16] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [17:36:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:36:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [17:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [17:36:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [17:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P20363 and previous config saved to /var/cache/conftool/dbconfig/20220208-173753-ladsgroup.json [17:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2041.codfw.wmnet with OS buster [17:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2041.codfw.wmnet with OS buster completed: - mc2041 (**PASS**) - Remo... [17:40:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2043.codfw.wmnet with OS buster [17:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2043.codfw.wmnet with OS buster [17:43:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P20364 and previous config saved to /var/cache/conftool/dbconfig/20220208-174332-marostegui.json [17:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:38] (03PS1) 10Zabe: httpbb: Update tests to reflect rename from ombudsmen to ombuds [puppet] - 10https://gerrit.wikimedia.org/r/761009 (https://phabricator.wikimedia.org/T273323) [17:44:59] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add ingress.staging switch [deployment-charts] - 10https://gerrit.wikimedia.org/r/759727 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [17:46:00] (03CR) 10Zabe: [C: 04-1] "not yet" [puppet] - 10https://gerrit.wikimedia.org/r/761009 (https://phabricator.wikimedia.org/T273323) (owner: 10Zabe) [17:46:20] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/metadata/{title} (retrieve extended metadata for Video article on English Wikipedia) is WARNING: Test retrieve extended metadata for Video article on English Wikipedia responds with unexpected value at path /protection = Missing keys: [edit, move] https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [17:51:16] (03CR) 10Dzahn: "I am surprised by Jenkins saying " profile::gitlab not in autoload module layout ". This normally just happens when a class name does not " [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [17:51:43] (03PS2) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [17:51:51] 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:52:03] 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:52:09] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) [17:52:14] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P20365 and previous config saved to /var/cache/conftool/dbconfig/20220208-175258-ladsgroup.json [17:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:53:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [17:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:03] (03CR) 10Razzi: "Thanks for the input @Volans and @Majavah. I'll take a look at the CI errors as well, then hopefully we can get this merged and I can test" [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [17:54:27] (03CR) 10jerkins-bot: [V: 04-1] Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) (owner: 10Razzi) [17:54:29] (03PS7) 10Dzahn: gitlab: nginx listen on IPv6, refactor variables [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [17:54:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [17:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:43] 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:54:56] 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) [17:55:18] (03PS1) 10Zabe: Add redirect from ombudsmen.wm.o to ombuds.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/761011 (https://phabricator.wikimedia.org/T273323) [17:56:20] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@79cb98e]: move query clicks from oozie to airflow [17:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:32] (03PS2) 10Zabe: Add redirect from ombudsmen.wm.o to ombuds.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/761011 (https://phabricator.wikimedia.org/T273323) [17:56:47] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp4031.ulsfo.wmnet [17:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:00] 10SRE, 10ops-ulsfo, 10Traffic: SMART errors on cp4031 - https://phabricator.wikimedia.org/T300493 (10RobH) 05Open→03Resolved IRC Update: Discussed this with @bblack in IRC and the call on this is to decom cp4031, and ensure the planned refresh for the ulsfo batch of everything (except the 4 new cp hosts... [17:57:40] (03CR) 10Dzahn: "well.. oddly enough a rebase fixed that" [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [17:58:22] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@79cb98e]: move query clicks from oozie to airflow (duration: 02m 01s) [17:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T300402)', diff saved to https://phabricator.wikimedia.org/P20366 and previous config saved to /var/cache/conftool/dbconfig/20220208-175837-marostegui.json [17:58:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [17:58:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1184.eqiad.wmnet with reason: Maintenance [17:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:43] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [17:58:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T300402)', diff saved to https://phabricator.wikimedia.org/P20367 and previous config saved to /var/cache/conftool/dbconfig/20220208-175844-marostegui.json [17:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:30] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2042.codfw.wmnet with OS buster [17:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:35] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2042.codfw.wmnet with OS buster completed: - mc2042 (**PASS**) - Remo... [18:00:05] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T1800). Please do the needful. [18:02:13] (03PS1) 10BBlack: Remove cp4031 from cluster data [puppet] - 10https://gerrit.wikimedia.org/r/761012 (https://phabricator.wikimedia.org/T301269) [18:02:21] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10wiki_willy) a:03Cmjohnson [18:03:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300402)', diff saved to https://phabricator.wikimedia.org/P20368 and previous config saved to /var/cache/conftool/dbconfig/20220208-180316-marostegui.json [18:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:18] 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) https://gerrit.wikimedia.org/r/c/operations/puppet/+/761012 has the puppet bits that should happen before true decom of its existence it will fail its own puppetizat... [18:04:25] RECOVERY - SSH on mw2257.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:04:36] 10SRE, 10ops-eqiad, 10DC-Ops: Broken disk on ganeti1011 - https://phabricator.wikimedia.org/T301240 (10wiki_willy) In warranty through May 2022 [18:05:04] 10ops-ulsfo, 10Traffic, 10decommission-hardware, 10Patch-For-Review: decommission cp4031 - https://phabricator.wikimedia.org/T301269 (10RobH) a:05BBlack→03RobH [18:08:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298554)', diff saved to https://phabricator.wikimedia.org/P20369 and previous config saved to /var/cache/conftool/dbconfig/20220208-180803-ladsgroup.json [18:08:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:08:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [18:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:08] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298554)', diff saved to https://phabricator.wikimedia.org/P20370 and previous config saved to /var/cache/conftool/dbconfig/20220208-180810-ladsgroup.json [18:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:20] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "compiled and output looks good to me. parameter names change but not the content of the actual config, with the exception of adding both l" [puppet] - 10https://gerrit.wikimedia.org/r/760930 (https://phabricator.wikimedia.org/T300816) (owner: 10Jelto) [18:11:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2043.codfw.wmnet with OS buster [18:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2043.codfw.wmnet with OS buster completed: - mc2043 (**PASS**) - Remo... [18:13:16] !log installing expat security updates [18:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:25] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jbond) 05Resolved→03Open >>! In T300193#7693835, @bcampbell wrote: > Hey @jbond awesome news! Glad it's working again. I agree that it would be better to use a service account to prevent thi... [18:14:28] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jbond) 05Open→03Resolved [18:15:27] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) >>! In T301165#7693858, @cmooney wrote: >> We have discussed this in the Traffic team and decided to go with 2001:67c:930::1/128, mostly because we feel it's easy to me... [18:15:45] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10RhinosF1) @jbond: I can't see that task. [18:16:50] (03CR) 10RhinosF1: [C: 04-1] Add drmrs to Hiera list of datacentres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff) [18:18:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20371 and previous config saved to /var/cache/conftool/dbconfig/20220208-181823-marostegui.json [18:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:10] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@ceff02f]: query_clicks: adjust start_date and catchup [18:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2044.codfw.wmnet with OS buster [18:20:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:39] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2044.codfw.wmnet with OS buster [18:21:00] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2045.codfw.wmnet with OS buster [18:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:05] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2045.codfw.wmnet with OS buster [18:22:07] (03CR) 10Zabe: Add drmrs to Hiera list of datacentres (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/737328 (owner: 10Muehlenhoff) [18:22:13] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@ceff02f]: query_clicks: adjust start_date and catchup (duration: 02m 03s) [18:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install conf100[789] - https://phabricator.wikimedia.org/T301272 (10RobH) [18:25:53] PROBLEM - Check that envoy is running on grafana2001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [18:26:33] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: connect to address 10.192.0.160 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:26:35] PROBLEM - grafana codfw port 443/tcp - Graphing and dashboarding IPv4 on grafana2001 is CRITICAL: connect to address 10.192.0.160 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [18:26:43] PROBLEM - Check systemd state on grafana2001 is CRITICAL: CRITICAL - degraded: The following units failed: envoyproxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:27:01] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana2001 is CRITICAL: connect to address 10.192.0.160 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:29:04] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761019 [18:29:06] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761019 (owner: 10Jeena Huneidi) [18:29:43] (03Merged) 10jenkins-bot: testwikis wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761019 (owner: 10Jeena Huneidi) [18:29:47] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.38.0-wmf.21 refs T300197 [18:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:52] T300197: 1.38.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T300197 [18:30:10] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10cmooney) ssingh: thanks! Yeah I'm not aware of any reason not to just match what was done with the IPv4, even if there are other options in this case. I've gone and added 3 I... [18:30:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/761001 (owner: 10Ladsgroup) [18:31:56] (03PS1) 10Cwhite: ssl: add regenerated grafana cert [puppet] - 10https://gerrit.wikimedia.org/r/761022 (https://phabricator.wikimedia.org/T282863) [18:32:17] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [18:32:25] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Enable IPv6 for Wikidough - https://phabricator.wikimedia.org/T301165 (10ssingh) >>! In T301165#7694515, @cmooney wrote: > ssingh: thanks! Yeah I'm not aware of any reason not to just match what was done with the IPv4, even if there are other optio... [18:32:55] (03CR) 10Cwhite: [C: 03+2] ssl: add regenerated grafana cert [puppet] - 10https://gerrit.wikimedia.org/r/761022 (https://phabricator.wikimedia.org/T282863) (owner: 10Cwhite) [18:33:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P20372 and previous config saved to /var/cache/conftool/dbconfig/20220208-183328-marostegui.json [18:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:53] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:33:59] (03PS3) 10Razzi: Add cookbooks for running maintain-views [cookbooks] - 10https://gerrit.wikimedia.org/r/760880 (https://phabricator.wikimedia.org/T297026) [18:34:01] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana2001 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:34:47] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana2001 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:34:49] RECOVERY - grafana codfw port 443/tcp - Graphing and dashboarding IPv4 on grafana2001 is OK: OK - Certificate grafana.discovery.wmnet will expire on Sun 07 Feb 2027 06:17:23 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org [18:34:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:07] RECOVERY - Check systemd state on grafana2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:29] RECOVERY - Check that envoy is running on grafana2001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [18:35:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298554)', diff saved to https://phabricator.wikimedia.org/P20373 and previous config saved to /var/cache/conftool/dbconfig/20220208-183532-ladsgroup.json [18:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:37] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [18:36:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:36:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [18:37:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:39:56] (03CR) 10Volans: [C: 03+2] setup.py: temporarily add upper limit to dnspython [software/pywmflib] - 10https://gerrit.wikimedia.org/r/760958 (owner: 10Volans) [18:40:01] (03CR) 10Volans: [C: 03+2] requests: add support for conn/read timeouts [software/pywmflib] - 10https://gerrit.wikimedia.org/r/754888 (owner: 10Volans) [18:42:36] (03Merged) 10jenkins-bot: setup.py: temporarily add upper limit to dnspython [software/pywmflib] - 10https://gerrit.wikimedia.org/r/760958 (owner: 10Volans) [18:42:38] (03Merged) 10jenkins-bot: requests: add support for conn/read timeouts [software/pywmflib] - 10https://gerrit.wikimedia.org/r/754888 (owner: 10Volans) [18:44:46] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10jbond) >>! In T300193#7694445, @RhinosF1 wrote: > @jbond: I can't see that task. oh sorry, tl;dr was suggested that harries account has been disabled now [18:45:42] 10SRE, 10Infrastructure-Foundations: check_user - authorization error - https://phabricator.wikimedia.org/T300193 (10RhinosF1) @HMarcus seems to still be enabled, no problem though about access to task! [18:48:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T300402)', diff saved to https://phabricator.wikimedia.org/P20374 and previous config saved to /var/cache/conftool/dbconfig/20220208-184832-marostegui.json [18:48:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:48:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:48:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:38] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [18:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P20375 and previous config saved to /var/cache/conftool/dbconfig/20220208-185037-ladsgroup.json [18:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [18:51:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [18:51:22] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2044.codfw.wmnet with OS buster [18:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [18:51:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2044.codfw.wmnet with OS buster completed: - mc2044 (**PASS**) - Remo... [18:51:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [18:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:12] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2045.codfw.wmnet with OS buster [18:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2045.codfw.wmnet with OS buster completed: - mc2045 (**PASS**) - Remo... [18:53:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2046.codfw.wmnet with OS buster [18:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:10] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2046.codfw.wmnet with OS buster [18:54:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:54:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T300402)', diff saved to https://phabricator.wikimedia.org/P20376 and previous config saved to /var/cache/conftool/dbconfig/20220208-185420-marostegui.json [18:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:24] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [18:54:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2047.codfw.wmnet with OS buster [18:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2047.codfw.wmnet with OS buster [18:56:54] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@49ba844]: query_clicks: resolve parse error in comment [18:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [18:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:41] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:58:56] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@49ba844]: query_clicks: resolve parse error in comment (duration: 02m 02s) [18:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] Deploy window Pre MediaWiki train break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T1900) [19:00:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300402)', diff saved to https://phabricator.wikimedia.org/P20377 and previous config saved to /var/cache/conftool/dbconfig/20220208-190006-marostegui.json [19:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:13] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [19:03:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:03:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:31] (03PS1) 10Cwhite: grafana: add configurable execute_alerts option [puppet] - 10https://gerrit.wikimedia.org/r/761026 (https://phabricator.wikimedia.org/T300997) [19:05:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P20378 and previous config saved to /var/cache/conftool/dbconfig/20220208-190542-ladsgroup.json [19:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:33] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) [19:09:21] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.38.0-wmf.21 refs T300197 (duration: 39m 34s) [19:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:26] T300197: 1.38.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T300197 [19:09:35] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) [19:09:58] (03PS1) 10Reedy: Revert "Add submodule for new-lexeme-special-page" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/760979 (https://phabricator.wikimedia.org/T301273) [19:10:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:10:21] thcipriani: jeena ^ as it's unused, just revert it [19:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:39] thanks Reedy [19:11:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [19:11:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [19:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:14] (03CR) 10Reedy: [C: 03+2] Revert "Add submodule for new-lexeme-special-page" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/760979 (https://phabricator.wikimedia.org/T301273) (owner: 10Reedy) [19:12:44] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.38.0-wmf.19 (duration: 03m 12s) [19:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20379 and previous config saved to /var/cache/conftool/dbconfig/20220208-191511-marostegui.json [19:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:15:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:07] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) 05Open→03In progress [19:19:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:19:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298554)', diff saved to https://phabricator.wikimedia.org/P20380 and previous config saved to /var/cache/conftool/dbconfig/20220208-192047-ladsgroup.json [19:20:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [19:20:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [19:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:52] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298554)', diff saved to https://phabricator.wikimedia.org/P20381 and previous config saved to /var/cache/conftool/dbconfig/20220208-192055-ladsgroup.json [19:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2046.codfw.wmnet with OS buster [19:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2046.codfw.wmnet with OS buster completed: - mc2046 (**PASS**) - Remo... [19:25:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:55] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2048.codfw.wmnet with OS buster [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2048.codfw.wmnet with OS buster [19:26:11] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jhathaway) I was able to confirm that the problem is due to https://salsa.debian.org/apt-team/apt/-/commit/fa375493c5a4ed9c10d4e5257ac82c6e687862d3, as mentioned in this bug re... [19:26:11] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2047.codfw.wmnet with OS buster [19:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2047.codfw.wmnet with OS buster completed: - mc2047 (**PASS**) - Remo... [19:26:28] (03Merged) 10jenkins-bot: Revert "Add submodule for new-lexeme-special-page" [extensions/WikibaseLexeme] (wmf/1.38.0-wmf.21) - 10https://gerrit.wikimedia.org/r/760979 (https://phabricator.wikimedia.org/T301273) (owner: 10Reedy) [19:28:26] (03PS1) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [19:29:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/33628/console" [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [19:29:38] (03CR) 10Jbond: [V: 03+1 C: 04-1] P:sre::check_user: add support for namely API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [19:30:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P20382 and previous config saved to /var/cache/conftool/dbconfig/20220208-193016-marostegui.json [19:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:30] jeena: Should be good to pull that down now [19:30:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:46] 👍 will do [19:30:56] 10SRE: mirrors.wikimedia.org debian repository fails to serve packages from time to time - https://phabricator.wikimedia.org/T300985 (10jhathaway) >>! In T300985#7689441, @MoritzMuehlenhoff wrote: > Two things/tests here which came to my mind: Thanks for the suggestions! > 1. The reproducer pulls packages from... [19:31:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:31:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [19:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:47] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) [19:32:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [19:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2049.codfw.wmnet with OS buster [19:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2049.codfw.wmnet with OS buster [19:37:41] (03PS1) 10Andrew Bogott: cloud-vps nfs mounts: remove Snuggle project [puppet] - 10https://gerrit.wikimedia.org/r/761033 [19:44:37] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps nfs mounts: remove Snuggle project [puppet] - 10https://gerrit.wikimedia.org/r/761033 (owner: 10Andrew Bogott) [19:45:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T300402)', diff saved to https://phabricator.wikimedia.org/P20383 and previous config saved to /var/cache/conftool/dbconfig/20220208-194520-marostegui.json [19:45:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:45:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [19:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:26] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [19:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T300402)', diff saved to https://phabricator.wikimedia.org/P20384 and previous config saved to /var/cache/conftool/dbconfig/20220208-194528-marostegui.json [19:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300402)', diff saved to https://phabricator.wikimedia.org/P20385 and previous config saved to /var/cache/conftool/dbconfig/20220208-195007-marostegui.json [19:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298554)', diff saved to https://phabricator.wikimedia.org/P20386 and previous config saved to /var/cache/conftool/dbconfig/20220208-195115-ladsgroup.json [19:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:24] (03CR) 10Eevans: [C: 03+1] restbase: remove restbase2010 [puppet] - 10https://gerrit.wikimedia.org/r/761006 (https://phabricator.wikimedia.org/T295375) (owner: 10Hnowlan) [19:51:26] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [19:55:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2048.codfw.wmnet with OS buster [19:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2048.codfw.wmnet with OS buster completed: - mc2048 (**PASS**) - Remo... [19:58:13] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2050.codfw.wmnet with OS buster [19:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:19] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2050.codfw.wmnet with OS buster [20:00:05] jeena and dancy: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220208T2000). [20:00:39] I'll deploy to group 0 in a few minutes [20:04:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2049.codfw.wmnet with OS buster [20:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2049.codfw.wmnet with OS buster completed: - mc2049 (**PASS**) - Remo... [20:05:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20387 and previous config saved to /var/cache/conftool/dbconfig/20220208-200512-marostegui.json [20:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:25] (03Abandoned) 10JHathaway: ferm: replace systemd unit to ensure success on boot [puppet] - 10https://gerrit.wikimedia.org/r/758548 (owner: 10JHathaway) [20:06:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P20388 and previous config saved to /var/cache/conftool/dbconfig/20220208-200621-ladsgroup.json [20:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2051.codfw.wmnet with OS buster [20:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2051.codfw.wmnet with OS buster [20:14:16] (03PS1) 10Jeena Huneidi: group0 wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761042 [20:14:21] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761042 (owner: 10Jeena Huneidi) [20:15:59] (03Merged) 10jenkins-bot: group0 wikis to 1.38.0-wmf.21 refs T300197 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/761042 (owner: 10Jeena Huneidi) [20:17:40] (03PS2) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [20:17:41] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.38.0-wmf.21 refs T300197 [20:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:45] T300197: 1.38.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T300197 [20:17:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:58] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:18:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:39] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) Hi @jnuche Kindly read through and sign the document linked below: https://phabricator.wikimedia.org/L3 @LSobanski / @akosiaris Kindly approve for the group `gi... [20:20:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P20389 and previous config saved to /var/cache/conftool/dbconfig/20220208-202016-marostegui.json [20:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P20390 and previous config saved to /var/cache/conftool/dbconfig/20220208-202127-ladsgroup.json [20:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:55] (03CR) 10Jbond: [C: 04-1] P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) (owner: 10Jbond) [20:22:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:38] ls [20:24:45] oops :P [20:26:27] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) a:03Arnoldokoth [20:28:32] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2050.codfw.wmnet with OS buster [20:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2050.codfw.wmnet with OS buster completed: - mc2050 (**PASS**) - Remo... [20:30:12] 10SRE, 10SRE-Access-Requests: Access to required prod servers for new member of RelEng - https://phabricator.wikimedia.org/T301241 (10Arnoldokoth) a:05Arnoldokoth→03jnuche [20:33:34] !log T294805 Banned `elastic10[32-47]` from main, omega, and psi elasticsearch clusters. Shards are relocating on main and omega clusters as expected, but they don't seem to be moving on psi. Investigating that currently. Might have to do with row allocation constraints, but unsure currently [20:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:39] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [20:35:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T300402)', diff saved to https://phabricator.wikimedia.org/P20391 and previous config saved to /var/cache/conftool/dbconfig/20220208-203521-marostegui.json [20:35:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:35:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [20:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:26] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [20:35:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T300402)', diff saved to https://phabricator.wikimedia.org/P20392 and previous config saved to /var/cache/conftool/dbconfig/20220208-203529-marostegui.json [20:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:14] !log jhuneidi@deploy1002 Started scap: sync again in attempt to deploy 1.38.0-wmf.21 to group0 [20:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298554)', diff saved to https://phabricator.wikimedia.org/P20393 and previous config saved to /var/cache/conftool/dbconfig/20220208-203634-ladsgroup.json [20:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:39] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [20:38:33] (03PS2) 10Jbond: populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 [20:40:12] (03CR) 10jerkins-bot: [V: 04-1] populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 (owner: 10Jbond) [20:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300402)', diff saved to https://phabricator.wikimedia.org/P20394 and previous config saved to /var/cache/conftool/dbconfig/20220208-204036-marostegui.json [20:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:41] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [20:43:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:26] (03CR) 10Cwhite: [C: 03+2] logstash: use java home from profile::java [puppet] - 10https://gerrit.wikimedia.org/r/759757 (https://phabricator.wikimedia.org/T300853) (owner: 10Cwhite) [20:43:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2051.codfw.wmnet with OS buster [20:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:03] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2051.codfw.wmnet with OS buster completed: - mc2051 (**PASS**) - Remo... [20:47:55] (03PS3) 10Jbond: P:sre::check_user: add support for namely API [puppet] - 10https://gerrit.wikimedia.org/r/761029 (https://phabricator.wikimedia.org/T255750) [20:49:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:50:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply on pinkunicorn [20:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:31] !log jhuneidi@deploy1002 Finished scap: sync again in attempt to deploy 1.38.0-wmf.21 to group0 (duration: 16m 17s) [20:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:54:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [20:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20395 and previous config saved to /var/cache/conftool/dbconfig/20220208-205541-marostegui.json [20:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: sync on pinkunicorn [20:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:36] (03PS1) 10Dzahn: add content of the annual report 2015 site, aka 15.wikipedia.org [container/miscweb] - 10https://gerrit.wikimedia.org/r/761049 (https://phabricator.wikimedia.org/T300171) [21:04:22] (03PS3) 10Jbond: populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 [21:10:46] (03PS1) 10Dzahn: add httpd config for 15.wikipedia.org [container/miscweb] - 10https://gerrit.wikimedia.org/r/761050 (https://phabricator.wikimedia.org/T300171) [21:10:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P20396 and previous config saved to /var/cache/conftool/dbconfig/20220208-211046-marostegui.json [21:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:28] (03CR) 10Dzahn: [C: 03+2] "None of this is new code, this is stuff from 2015 that was reviewed back then, just moving to a different repo." [container/miscweb] - 10https://gerrit.wikimedia.org/r/761049 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:12:38] (03PS2) 10Dzahn: add content of the annual report 2015 site, aka 15.wikipedia.org [container/miscweb] - 10https://gerrit.wikimedia.org/r/761049 (https://phabricator.wikimedia.org/T300171) [21:22:41] (03PS4) 10Jbond: populate_puppetdb: Add support for reading facts directly from disk [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/760949 (https://phabricator.wikimedia.org/T248169) [21:24:07] (03PS2) 10Dzahn: add httpd config for 15.wikipedia.org [container/miscweb] - 10https://gerrit.wikimedia.org/r/761050 (https://phabricator.wikimedia.org/T300171) [21:24:48] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:25:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T300402)', diff saved to https://phabricator.wikimedia.org/P20397 and previous config saved to /var/cache/conftool/dbconfig/20220208-212550-marostegui.json [21:25:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [21:25:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [21:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:56] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [21:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1164 (T300402)', diff saved to https://phabricator.wikimedia.org/P20398 and previous config saved to /var/cache/conftool/dbconfig/20220208-212558-marostegui.json [21:25:59] (03PS1) 10Jbond: 2.1.0: prepare for release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/761051 [21:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300402)', diff saved to https://phabricator.wikimedia.org/P20399 and previous config saved to /var/cache/conftool/dbconfig/20220208-213031-marostegui.json [21:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:35] (03CR) 10Dzahn: [C: 03+2] add httpd config for 15.wikipedia.org [container/miscweb] - 10https://gerrit.wikimedia.org/r/761050 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:36:52] PROBLEM - Check systemd state on elastic1047 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:33] (03Merged) 10jenkins-bot: add httpd config for 15.wikipedia.org [container/miscweb] - 10https://gerrit.wikimedia.org/r/761050 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [21:41:33] (03CR) 10Herron: [C: 03+1] grafana: add configurable execute_alerts option [puppet] - 10https://gerrit.wikimedia.org/r/761026 (https://phabricator.wikimedia.org/T300997) (owner: 10Cwhite) [21:44:18] (03CR) 10Dzahn: [C: 03+1] Add ingress support to miscweb chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/757935 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [21:45:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20400 and previous config saved to /var/cache/conftool/dbconfig/20220208-214536-marostegui.json [21:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:48] !log [Elastic] `ryankemper@elastic1081:~$ sudo systemctl restart elasticsearch_6*psi*` (9600 but not 9200 seemed to be having connectivity issues) [21:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:13] 10SRE, 10Wikimedia-Etherpad, 10serviceops, 10vm-requests: create bullseye VM for Etherpad upgrade (and upgrade it:) - https://phabricator.wikimedia.org/T300568 (10Dzahn) This is scheduled for Thursday, Feb 10 2022, 9 to 10.30 UTC. Added to SRE calendar "vendor maintenance", mailed ops-l and reached out to... [21:52:17] !log ryankemper@puppetmaster1001 conftool action : GET; selector: service=search [21:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:28] !log ryankemper@puppetmaster1001 conftool action : GET; selector: service=search [21:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:49] ^ wrong syntax, ignore that [21:58:37] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: cluster=elasticsearch,name=elastic1* [21:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:10] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: cluster=elasticsearch [21:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:51] !log T294805 elastic10[68-83] erroneously weren't in pybal, added them just now: `sudo confctl select 'cluster=elasticsearch' set/pooled=yes:weight=10` (there's no hosts in the `conftool-data` list that we want depooled so we're okay setting all to pooled w/ equal weight) [21:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:55] T294805: Service implementation for elastic10[68-83].eqiad.wmnet - https://phabricator.wikimedia.org/T294805 [22:00:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164', diff saved to https://phabricator.wikimedia.org/P20401 and previous config saved to /var/cache/conftool/dbconfig/20220208-220041-marostegui.json [22:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:12] !log doing planned 1-by-1 shutdown of ports xe-0/1/1, xe-0/1/2 and xe-0/1/9 on cr2-esams, to test reliability of each following user reports of issues at AMS-IX. [22:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1164 (T300402)', diff saved to https://phabricator.wikimedia.org/P20402 and previous config saved to /var/cache/conftool/dbconfig/20220208-221545-marostegui.json [22:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:51] T300402: Add namespace column to Linter table - https://phabricator.wikimedia.org/T300402 [22:18:18] (03PS2) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [22:19:34] (03CR) 10jerkins-bot: [V: 04-1] graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:21:04] (03PS3) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [22:25:13] (03PS1) 10Dzahn: miscweb: bump staging to 2022-02-08-214018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/761058 [22:25:24] (03CR) 10jerkins-bot: [V: 04-1] miscweb: bump staging to 2022-02-08-214018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/761058 (owner: 10Dzahn) [22:28:02] (03PS2) 10Dzahn: miscweb: bump staging to 2022-02-08-214018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/761058 [22:28:20] (03PS4) 10Zabe: graphite: whisper_cleanup: migrate cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) [22:29:54] (03PS1) 10Arlolra: Remove case for installing nodejs 10 on stretch for testreduce [puppet] - 10https://gerrit.wikimedia.org/r/761059 [22:30:14] (03PS1) 10Dzahn: microsites: remove 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761060 [22:31:52] (03CR) 10Subramanya Sastry: [C: 03+1] Remove case for installing nodejs 10 on stretch for testreduce [puppet] - 10https://gerrit.wikimedia.org/r/761059 (owner: 10Arlolra) [22:32:02] (03PS1) 10Dzahn: trafficserver: switch 15.wikipedia.org backend [puppet] - 10https://gerrit.wikimedia.org/r/761062 [22:33:44] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:18] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q3): Switch Logstash/apifeatureusage to use the system OpenJDK 11 - https://phabricator.wikimedia.org/T300853 (10colewhite) 05In progress→03Resolved Logstash is now using the system Java runtime. [22:34:23] (03PS1) 10Dzahn: httpbb: move tests for 15.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/761063 [22:34:54] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:56] (03CR) 10Dzahn: [C: 03+2] miscweb: bump staging to 2022-02-08-214018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/761058 (owner: 10Dzahn) [22:36:26] (03CR) 10Dzahn: [C: 03+2] "Thank you! Compiled on "C:testreduce" and this is the only host and no diff: https://puppet-compiler.wmflabs.org/pcc-worker1003/33633/" [puppet] - 10https://gerrit.wikimedia.org/r/761059 (owner: 10Arlolra) [22:39:07] (03Merged) 10jenkins-bot: miscweb: bump staging to 2022-02-08-214018-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/761058 (owner: 10Dzahn) [22:40:58] (03CR) 10Dzahn: "confirmed noop on testreduce1001" [puppet] - 10https://gerrit.wikimedia.org/r/761059 (owner: 10Arlolra) [22:41:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2052.codfw.wmnet with OS buster [22:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:49] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2052.codfw.wmnet with OS buster [22:42:17] !log dzahn@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply on main [22:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:10] !log dzahn@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: sync on main [22:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:58] (03PS1) 10Herron: watchrat: route alerts to irc and noc@ [puppet] - 10https://gerrit.wikimedia.org/r/761064 (https://phabricator.wikimedia.org/T299147) [22:50:07] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2053.codfw.wmnet with OS buster [22:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2053.codfw.wmnet with OS buster [22:53:08] (03CR) 10Zabe: "PCC fails: https://puppet-compiler.wmflabs.org/pcc-worker1001/33632/graphite2003.codfw.wmnet/change.graphite2003.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/751470 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [23:01:27] 10SRE, 10Parsoid, 10serviceops: Move testreduce to nodejs 12 - https://phabricator.wikimedia.org/T301303 (10Arlolra) [23:02:47] PROBLEM - MariaDB Replica IO: s2 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2104.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:03:47] RECOVERY - MariaDB Replica IO: s2 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:12:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2052.codfw.wmnet with OS buster [23:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:11] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2052.codfw.wmnet with OS buster completed: - mc2052 (**PASS**) - Remo... [23:17:21] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2054.codfw.wmnet with OS buster [23:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2054.codfw.wmnet with OS buster [23:19:10] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:20:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2053.codfw.wmnet with OS buster [23:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:50] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2053.codfw.wmnet with OS buster completed: - mc2053 (**PASS**) - Remo... [23:21:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host mc2055.codfw.wmnet with OS buster [23:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:18] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host mc2055.codfw.wmnet with OS buster [23:22:33] !log removing 1 file for legal compliance [23:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:36] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:34:26] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:35:06] PROBLEM - SSH on wtp1027.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:42:18] (03PS1) 10Ebernhardson: Provide jwt secret to blazegraph for logging [puppet] - 10https://gerrit.wikimedia.org/r/761075 (https://phabricator.wikimedia.org/T293462) [23:47:51] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:48:25] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2054.codfw.wmnet with OS buster [23:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:29] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2054.codfw.wmnet with OS buster completed: - mc2054 (**PASS**) - Remo... [23:52:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc2055.codfw.wmnet with OS buster [23:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:02] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host mc2055.codfw.wmnet with OS buster completed: - mc2055 (**PASS**) - Remo... [23:54:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) [23:55:16] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:(Need By: TBD) rack/setup/install mc20[38-55] - https://phabricator.wikimedia.org/T294962 (10Papaul) 05Open→03Resolved This is complete