[00:00:05] RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211125T0000). [00:00:05] No Gerrit patches in the queue for this window AFAICS. [00:00:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:10:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:12:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:14:56] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 91.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:19:08] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [00:20:44] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:22:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:29:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:31:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:35:49] mutante: back around now, thanks for restarting that blazegraph instance! will also take a look at the docs and see if there's some more context I can add for the future [00:37:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2066.codfw.wmnet with OS buster [00:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:10] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2066.codfw.wmnet with OS buster comp... [00:39:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:44:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:50:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:52:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:00:05] twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211125T0100). [01:04:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2067.codfw.wmnet with OS buster [01:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:20] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2067.codfw.wmnet with OS buster [01:05:08] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) [01:13:02] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:18:54] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:19:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2068.codfw.wmnet with OS buster [01:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2068.codfw.wmnet with OS buster [01:20:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:23:10] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [01:28:06] PROBLEM - cassandra CQL 10.64.16.27:9042 on maps1008 is CRITICAL: connect to address 10.64.16.27 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [01:29:00] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:30] PROBLEM - cassandra service on maps1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:34:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2067.codfw.wmnet with OS buster [01:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:07] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2067.codfw.wmnet with OS buster comp... [01:36:31] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:36:46] RECOVERY - cassandra CQL 10.64.16.27:9042 on maps1008 is OK: TCP OK - 0.000 second response time on 10.64.16.27 port 9042 https://phabricator.wikimedia.org/T93886 [01:37:14] RECOVERY - cassandra service on maps1008 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:49:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:49:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2068.codfw.wmnet with OS buster [01:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2068.codfw.wmnet with OS buster comp... [01:54:01] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2070.codfw.wmnet with OS buster [01:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2070.codfw.wmnet with OS buster [01:56:32] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) As I said, if my wikimedia email needs to be in the puppet file, that's fine. I do prefer not to use my real name publicly, but I believe this particular instance to be acceptable (as in, not... [02:04:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2071.codfw.wmnet with OS buster [02:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2071.codfw.wmnet with OS buster [02:05:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:08:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:14:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:16:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:16:49] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:19:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:23:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:23:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2070.codfw.wmnet with OS buster [02:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2070.codfw.wmnet with OS buster comp... [02:26:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:34:46] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2071.codfw.wmnet with OS buster [02:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:34:51] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2071.codfw.wmnet with OS buster comp... [02:36:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:38:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:39:12] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [02:42:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2072.codfw.wmnet with OS buster [02:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2072.codfw.wmnet with OS buster [02:44:42] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:07:15] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:12:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2072.codfw.wmnet with OS buster [03:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2072.codfw.wmnet with OS buster comp... [03:12:33] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:12:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) [03:17:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) @RKemper @Gehel all the servers are ready to put in service but not elastic2069 for some reason i can not login to it so I will h... [03:17:25] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:19:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:22:34] RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:54] PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:15] 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) hehe I was too impatient! :-) Thanks for the explanation. [03:48:10] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.4% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:52:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:17:45] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RKemper) >>! In T294154#7528550, @Papaul wrote: > @RKemper @Gehel all the servers are ready to put in service but not elastic2069 for som... [04:20:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:22:04] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:24:08] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:25:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:25:38] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.93`. Pre-deploy tests passing on canary `wdqs1003` [04:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:51] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@29c5cd7]: 0.3.93 [04:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:01] !log [WDQS Deploy] Tests passing following deploy of `0.3.93` on canary `wdqs1003`; proceeding to rest of fleet [04:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:30:34] !log [Elastic] Cleaning up dangling apt packages: `ryankemper@cumin1001:~$ sudo cumin -b 4 'elastic*' 'sudo apt autoremove -y'` [04:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:35:14] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@29c5cd7]: 0.3.93 (duration: 09m 23s) [04:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:44] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:38:46] !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [04:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:49] !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [04:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:01] !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [04:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:40:04] !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@29c5cd7] (wcqs): Deploy 0.3.93 to WCQS [04:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:57] !log [WCQS Deploy] Tests look good following deploy of `0.3.93` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet [04:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:45:32] !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@29c5cd7] (wcqs): Deploy 0.3.93 to WCQS (duration: 05m 27s) [04:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:24:38] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:20] PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.1% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [05:36:48] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:51:55] !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good [05:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:07] (03PS4) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [06:16:00] (03CR) 10Majavah: P::doc: sync data to non-active servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah) [06:21:54] (03PS5) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [06:30:19] (03PS6) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) [06:31:26] !log Restart tendril's DB [06:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:44] RECOVERY - MariaDB memory on db1115 is OK: OK Memory 59% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [06:57:22] Just a short reminder: we will start re-deploy services in eqiad Kubernetes cluster soon. Feel free to ping me any time. [06:58:46] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:05:22] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:49] !log start re-deploy procedure in eqiad Kubernetes T251305 [07:09:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:56] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [07:10:00] !log downtime PyBal backends health check on lvs1015 and lvs1016 for helm3 de-deploy T251305. I'm keeping an eye on icing and remove downtime as soon as I'm finished [07:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Unless you also submit a patch to add php-yaml to the php7.X-fpm-multiversion-base images, this can't be merged." [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall) [07:17:23] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 32 hosts with reason: helm3 de-deploy T251305 [07:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:27] T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 [07:17:46] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 32 hosts with reason: helm3 de-deploy T251305 [07:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:43] !log jelto@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntax [07:20:43] highlight|shellbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero) [07:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:47] (03PS1) 10Marostegui: db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965) [07:23:21] (03CR) 10jerkins-bot: [V: 04-1] db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [07:23:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance T296143 [07:23:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance T296143 [07:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:46] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [07:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:36] (03PS2) 10Marostegui: db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965) [07:26:44] (03CR) 10Marostegui: [C: 03+2] db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui) [07:27:29] marostegui: I'm running the schema change on db1145:3314 without depooling, because it's not pooled. Is that correct? https://noc.wikimedia.org/dbconfig/eqiad.json [07:27:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1128.eqiad.wmnet with OS bullseye [07:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:10] 10SRE, 10RESTBase-Cassandra: Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10elukey) [07:28:37] Amir1: yeah, it is a backup source [07:28:54] cool [07:29:53] !log elukey@mwdebug2002:~$ sudo systemctl reset-failed ifup@ens5.service [07:29:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:20] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:51] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [07:32:51] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [07:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:18] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2521 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:35:21] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [07:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:37:32] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:38:06] remaining active connections to eventgate, potentially [07:38:20] i forgot to depool eventgate-main in my list. Is this a big problem. So it was pooled during the re-deploy and some requests hit the redeploy :/ [07:38:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:39:05] jelto: ah...shit. We should double check the list then [07:39:35] but for now, no longer an issue I guess as it is available again now [07:45:31] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . [07:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:22] !log elevated MediaWiki exceptions and fatals (from ~07:35) due to a mistake during re-deploy of eventgate-main [07:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:16] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [07:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:14] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [07:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:43] !log Stop mysql on db1133 to clone db1128 as a test host T295965 [07:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:47] T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965 [07:51:21] !log jelto@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(echostore|sessionstore) [07:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1128.eqiad.wmnet with OS bullseye [07:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-test site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:56:32] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [07:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:31] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [07:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:02] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [08:00:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211125T0800) [08:02:05] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:15] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' . [08:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:45] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [08:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:14] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [08:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:05] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [08:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:27] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [08:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:40] (03PS1) 10Elukey: kserve: fix a typo in the inference service config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/741844 [08:14:06] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [08:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:19] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [08:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:23] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [08:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:44] (03CR) 10Elukey: [C: 03+2] kserve: fix a typo in the inference service config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/741844 (owner: 10Elukey) [08:21:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:32] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [08:21:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:47] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:16] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:35] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:21] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 79 probes of 640 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:25:20] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [08:25:20] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [08:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:41] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [08:28:41] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [08:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:33] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 44 probes of 640 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:31:14] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [08:31:14] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [08:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:38] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [08:34:38] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [08:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:07] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [08:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:39:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:00] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:03] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' . [08:40:03] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [08:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:41:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:41:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:42:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:51] sorry for too many downtimes, I'm debugging something [08:43:02] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle) [08:43:17] PROBLEM - cassandra-c CQL 10.192.48.144:9042 on restbase2023 is CRITICAL: connect to address 10.192.48.144 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [08:43:21] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [08:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:44:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:30] ignore these times [08:44:35] *downtimes [08:45:27] RECOVERY - cassandra-c CQL 10.192.48.144:9042 on restbase2023 is OK: TCP OK - 0.033 second response time on 10.192.48.144 port 9042 https://phabricator.wikimedia.org/T93886 [08:46:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:46:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:06] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [08:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:44] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' . [08:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:48:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143 [08:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17837 and previous config saved to /var/cache/conftool/dbconfig/20211125-084834-ladsgroup.json [08:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:54] okay fixed now, this supposed the be the last down time [08:50:30] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [08:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:34] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [08:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:47] (03CR) 10Filippo Giunchedi: alertmanager: Update address for perf-team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle) [08:58:43] (03CR) 10Volans: [V: 03+2 C: 03+2] netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) (owner: 10Volans) [08:59:57] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [08:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:01] PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [09:02:25] (03PS3) 10Vgutierrez: cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) [09:02:27] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [09:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:00] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [09:05:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:13] RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.78 ms [09:10:12] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [09:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:16:44] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [09:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:29] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' . [09:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:30] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' . [09:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:36] (03PS1) 10Elukey: pontoon: add profile::base::certificates basic config [puppet] - 10https://gerrit.wikimedia.org/r/741847 [09:23:14] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [09:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:00] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' . [09:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:29] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:26:19] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too, and all environments have their own puppet master CAs. [09:27:03] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' . [09:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:20] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' . [09:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:00] (03PS2) 10Elukey: pontoon: add profile::base::certificates basic config [puppet] - 10https://gerrit.wikimedia.org/r/741847 [09:31:28] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' . [09:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:16] (03CR) 10Elukey: [C: 03+2] pontoon: add profile::base::certificates basic config [puppet] - 10https://gerrit.wikimedia.org/r/741847 (owner: 10Elukey) [09:34:11] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [09:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:15] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [09:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:31] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' . [09:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:09] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [09:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:46] !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [09:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:46] !log jelto@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntaxh [09:55:46] ighlight|shellbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero) [09:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:57] (03PS1) 10David Caro: timesyncd: add package requirement [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) [09:58:28] (03PS1) 10Jbond: O:puppet_compiler::puppetdb: Add role for puppetdb compiler (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/741850 [09:59:06] (03CR) 10jerkins-bot: [V: 04-1] O:puppet_compiler::puppetdb: Add role for puppetdb compiler (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/741850 (owner: 10Jbond) [10:02:52] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32626/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [10:02:55] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10JMeybohm) >>! In T296089#7527221, @elukey wrote: > A simplification would be to avoid the install check and create the pem bundle at build time as well, but there are probably some use cases that I don't have in m... [10:05:38] (03CR) 10Jelto: [C: 03+2] hiera::role::common::deployment_server update helmBinary eqiad [puppet] - 10https://gerrit.wikimedia.org/r/741681 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto) [10:07:56] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) >>! In T296089#7528885, @JMeybohm wrote: >>>! In T296089#7527221, @elukey wrote: >> If we had a way to generate multiple package from the same debian source (IIRC there should be the possibility), we could... [10:18:20] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) cc from ops list: The re-deploy for all services in the eqiad Kubernetes cluster was successful. However this time we had an impact on service availability. Planned reduced serv... [10:18:32] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [10:19:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17840 and previous config saved to /var/cache/conftool/dbconfig/20211125-101921-ladsgroup.json [10:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:26] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [10:21:22] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) Just some early notes ill follow up more in a bit > p12/jks bundles In this method we would still do the jks/p12 generation in puppet > As described above, the wmf-certificates package checks in /etc/ca-c... [10:24:59] (03PS1) 10Filippo Giunchedi: pontoon: add prometheus-02 (Bullseye instance) to o11y [puppet] - 10https://gerrit.wikimedia.org/r/741855 [10:25:01] (03PS1) 10Filippo Giunchedi: prometheus: fix blackbox-exporter config syntax for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741856 [10:25:03] (03PS1) 10Filippo Giunchedi: pontoon: fix o11y stack for recent changes [puppet] - 10https://gerrit.wikimedia.org/r/741857 [10:25:11] !log rolling restart of varnish and HAProxy on cp2042.codfw.wmnet,cp1090.eqiad.wmnet,cp[5012].eqsin.wmnet,cp3065.esams.wmnet,cp[4026,4032].ulsfo.wmnet' to disable PROXY protocol - T290005 [10:25:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:15] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:25:21] (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [10:25:39] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:27:33] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add prometheus-02 (Bullseye instance) to o11y [puppet] - 10https://gerrit.wikimedia.org/r/741855 (owner: 10Filippo Giunchedi) [10:33:01] seeking soul(s) for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/741856 [10:33:19] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: fix o11y stack for recent changes [puppet] - 10https://gerrit.wikimedia.org/r/741857 (owner: 10Filippo Giunchedi) [10:33:25] (03PS2) 10Filippo Giunchedi: pontoon: fix o11y stack for recent changes [puppet] - 10https://gerrit.wikimedia.org/r/741857 [10:34:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17841 and previous config saved to /var/cache/conftool/dbconfig/20211125-103425-ladsgroup.json [10:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:30] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [10:37:49] godog: LGTM but is it possible to have a pcc to confirm? [10:37:54] maybe buster vs bullseye [10:39:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:13] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) [10:40:45] elukey: mhhh I don't have a bullseye prometheus host available to pcc yet I think, I can do buster though [10:41:52] ah yes yes ok [10:41:56] just a quick check [10:41:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:42:53] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:44:32] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32627/console" [puppet] - 10https://gerrit.wikimedia.org/r/741856 (owner: 10Filippo Giunchedi) [10:44:44] elukey: SGTM, done ^ [10:49:07] (03CR) 10Elukey: [C: 03+1] prometheus: fix blackbox-exporter config syntax for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741856 (owner: 10Filippo Giunchedi) [10:49:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17842 and previous config saved to /var/cache/conftool/dbconfig/20211125-104930-ladsgroup.json [10:49:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:35] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [10:52:20] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: fix blackbox-exporter config syntax for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741856 (owner: 10Filippo Giunchedi) [10:52:26] nice, thanks elukey [11:04:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:04:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17843 and previous config saved to /var/cache/conftool/dbconfig/20211125-110435-ladsgroup.json [11:04:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1147.eqiad.wmnet with reason: Maintenance T296143 [11:04:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1147.eqiad.wmnet with reason: Maintenance T296143 [11:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:40] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [11:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17844 and previous config saved to /var/cache/conftool/dbconfig/20211125-110443-ladsgroup.json [11:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:51] 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) >>! In T296089#7528900, @jbond wrote: >> Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too, > In relation to this, I want to say that imo having change... [11:13:02] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [11:19:05] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:01] PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:28:34] (03PS1) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [11:29:08] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [11:29:20] (03PS1) 10Hnowlan: cassandra: correct check notes URL [puppet] - 10https://gerrit.wikimedia.org/r/741868 [11:29:53] 10SRE, 10RESTBase-Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10hnowlan) a:03hnowlan [11:30:51] (03CR) 10David Caro: [V: 03+1 C: 04-1] "We have an issue here." [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [11:31:27] RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:32:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:33:40] 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Manuel) [11:37:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:37:31] PROBLEM - Check systemd state on db1139 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:57] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:49:15] (03PS2) 10David Caro: timesyncd: handle bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) [11:51:17] (03PS2) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [11:51:56] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [11:55:13] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32630/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [11:56:02] !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1163 load', diff saved to https://phabricator.wikimedia.org/P17845 and previous config saved to /var/cache/conftool/dbconfig/20211125-115602-jynus.json [11:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:13] (03CR) 10David Caro: [V: 03+1] "PCC just shows the notify changed (from a string to a list). Looks ok, will try to get a bullseye host." [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [11:56:31] !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase202[1-3].codfw.wmnet: Restarting for certificate updates - hnowlan@cumin1001 [11:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:49] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32631/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [11:59:13] (03PS3) 10David Caro: timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) [12:01:26] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add ceph packages in the octopus/bullseye combo [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [12:03:46] no deploys all day, this includes the backport window that would normally happen at this time. [12:04:09] carry on! [12:04:35] !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1163 load even more', diff saved to https://phabricator.wikimedia.org/P17846 and previous config saved to /var/cache/conftool/dbconfig/20211125-120435-jynus.json [12:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:06:49] (03PS1) 10Arturo Borrero Gonzalez: aptrepo: fix duplicate update name [puppet] - 10https://gerrit.wikimedia.org/r/741870 (https://phabricator.wikimedia.org/T296175) [12:09:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:11:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [12:11:38] !log jynus@cumin1001 dbctl commit (dc=all): 'Temp. depool db1163 fully', diff saved to https://phabricator.wikimedia.org/P17847 and previous config saved to /var/cache/conftool/dbconfig/20211125-121138-jynus.json [12:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix duplicate update name [puppet] - 10https://gerrit.wikimedia.org/r/741870 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [12:14:06] !log disable temp. gtid on db1163 [12:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:38] !log update repo bullseye-wikimedia/thirdparty/ceph-octopus (T296175) [12:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:42] T296175: cloudcephosd1021 is using an old ceph version because its running debian bullseye instead of buster - https://phabricator.wikimedia.org/T296175 [12:20:06] (03PS1) 10Arturo Borrero Gonzalez: ceph: common: support both buster & bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741883 (https://phabricator.wikimedia.org/T296175) [12:24:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:27:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] "PCC as expected: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/32633/console" [puppet] - 10https://gerrit.wikimedia.org/r/741883 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [12:27:22] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ceph: common: support both buster & bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741883 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez) [12:27:50] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase202[1-3].codfw.wmnet: Restarting for certificate updates - hnowlan@cumin1001 [12:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:03] RECOVERY - Check systemd state on db1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:03] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/741868 (owner: 10Hnowlan) [12:31:49] (03CR) 10Hnowlan: [C: 03+2] cassandra: correct check notes URL [puppet] - 10https://gerrit.wikimedia.org/r/741868 (owner: 10Hnowlan) [12:32:38] 10SRE, 10RESTBase-Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10hnowlan) 05Open→03Resolved [12:32:48] 10SRE, 10RESTBase-Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10hnowlan) Thanks for reporting this ! [12:34:40] hnowlan: o/ thanks for the new docs - IIUC in this use case we'd need to rm the keys from the private repo (for the three hosts) and then re-run the script to generate the new keys (and then commit and let puppet run etc..) [12:35:09] ah I see you already done it probably :D [12:36:06] elukey: yeah :) rm the files, run cassandra-ca-manager, commit, let puppet run and then do a roll-restart [12:36:47] ack thanks :) [12:48:16] 10SRE, 10Patch-For-Review: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) Im not sure this is the place to have this discussion, perhaps we should fork to another task? > I disagree with this John, Pontoon was a big effort to allow reusable testing environ... [12:50:53] (03PS3) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [12:51:29] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [12:57:07] (03PS1) 10Ayounsi: Prepare site.pp for new ping VMs [puppet] - 10https://gerrit.wikimedia.org/r/741912 (https://phabricator.wikimedia.org/T295767) [12:57:43] (03CR) 10David Caro: [C: 03+1] "LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [12:59:10] (03PS2) 10David Caro: WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) [12:59:12] (03CR) 10David Caro: WIP cli: add --fail-fast flag and behavior (0316 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [12:59:36] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32636/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [12:59:45] (03CR) 10David Caro: "I have not yet fixed the tests, and have to run some tests locally, but mypy/flake8 pass now." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [13:01:09] (03CR) 10jerkins-bot: [V: 04-1] WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [13:05:17] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host ping3002.esams.wmnet [13:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:55] (03CR) 10Ayounsi: [C: 03+2] Prepare site.pp for new ping VMs [puppet] - 10https://gerrit.wikimedia.org/r/741912 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi) [13:06:45] (03CR) 10Jbond: timesyncd: handle bullseye ntp hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [13:07:43] (03CR) 10Jbond: [C: 03+2] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [13:09:36] (03Merged) 10jenkins-bot: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [13:10:06] (03PS4) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:10:41] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:14:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping3002.esams.wmnet [13:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:28] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32642/console" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [13:20:31] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host ping2002.codfw.wmnet [13:20:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:20] (03PS5) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:23:58] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:26:13] (03PS6) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:26:48] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:28:46] !log killing lingering process from mwmaint to depooled db1147 [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping2002.codfw.wmnet [13:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:02] !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host ping1002.eqiad.wmnet [13:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:02] (03CR) 10Jbond: [C: 03+1] "LGTM, possible im being to picky on the comment so feel free to merge as is" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro) [13:39:33] (03PS2) 10Jelto: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 [13:39:55] (03PS7) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:40:31] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:40:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping1002.eqiad.wmnet [13:40:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32646/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32647/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:43:08] (03PS1) 10Ayounsi: Add new ping VMs to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741916 (https://phabricator.wikimedia.org/T295767) [13:44:25] (03PS8) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [13:44:59] (03CR) 10Ayounsi: [C: 03+2] Add new ping VMs to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741916 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi) [13:45:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32648/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:45:55] (03PS2) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:46:17] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:46:46] (03CR) 10Jelto: [C: 03+2] miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto) [13:47:31] (03PS2) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 [13:47:36] (03CR) 10jerkins-bot: [V: 04-1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:49:03] (03CR) 10jerkins-bot: [V: 04-1] helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [13:49:42] (03PS3) 10Jelto: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 [13:52:58] (03PS3) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:54:19] (03CR) 10Jelto: [C: 03+2] miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto) [13:54:48] (03CR) 10jerkins-bot: [V: 04-1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [13:54:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah) [13:55:47] (03CR) 10Giuseppe Lavagetto: "Wait a sec, how large is the miscweb image?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [13:56:23] (03PS1) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [13:57:27] (03PS1) 10Ayounsi: Set flat partman receipe for all ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/741918 (https://phabricator.wikimedia.org/T295767) [13:57:50] (03PS2) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [13:58:02] (03Merged) 10jenkins-bot: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto) [13:58:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32651/console" [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [13:59:32] (03CR) 10Ayounsi: [C: 03+2] Set flat partman receipe for all ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/741918 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi) [13:59:59] (03CR) 10jerkins-bot: [V: 04-1] P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:00:13] (03PS9) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [14:00:49] (03CR) 10Majavah: P:cache::kafka::Webrequest: use cert defined in P:certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:02:14] (03CR) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto) [14:02:16] (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:02:27] (03PS3) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 [14:09:11] (03PS1) 10Filippo Giunchedi: rancid: add ability to disable emails [puppet] - 10https://gerrit.wikimedia.org/r/741919 [14:10:43] (03CR) 10jerkins-bot: [V: 04-1] rancid: add ability to disable emails [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi) [14:11:22] ORLY? [14:12:20] looks like an unrelated failure [14:12:22] 15:10:23 error during compilation: Evaluation Error: Error while evaluating a Function Call, node codename does not meet requirement `stretch >= buster` (file: /srv/workspace/puppet/modules/debian/functions/codename/require.pp, line: 22, column: 9) on node 89b14fc12ee3.integration.eqiad.wmflabs [14:12:53] (03CR) 10Filippo Giunchedi: "CI failure is unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi) [14:12:55] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [14:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:18] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [14:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:16] !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [14:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:26] (03PS10) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) [14:18:28] (03PS1) 10Jbond: P:wmcs::backy2: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/741920 [14:18:48] (03CR) 10Jbond: [V: 03+2 C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/741920 (owner: 10Jbond) [14:19:19] (03PS4) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [14:19:26] (03PS3) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [14:19:50] (03PS1) 10Kormat: Initial structure and configs. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/741921 [14:19:55] (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:20:48] (03CR) 10Kormat: [V: 03+2 C: 03+2] Initial structure and configs. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/741921 (owner: 10Kormat) [14:21:29] jbond: thanks for the puppet rspec CI fixes [14:23:03] 10SRE, 10Patch-For-Review: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) >>! In T296089#7522709, @elukey wrote: > I am wondering what is best to do for use cases like: > > * https://gerrit.wikimedia.org/r/c/operations/puppet/+/739463 (not merged yet) > * h... [14:24:55] (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org [14:24:59] (03PS4) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) [14:25:08] !log uncordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet - T293729 [14:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:13] T293729: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729 [14:25:19] (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond) [14:28:58] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Remove debian/watch and fix the distribution in the changelog; otherwise lgtm." [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/738500 (owner: 10RLazarus) [14:30:03] (03CR) 10Jbond: WIP cli: add --fail-fast flag and behavior (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [14:38:13] (03PS1) 10Ayounsi: Move ping offload to new ping VMs [homer/public] - 10https://gerrit.wikimedia.org/r/741923 (https://phabricator.wikimedia.org/T295767) [14:38:40] PROBLEM - Check size of conntrack table on ping3002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.20.0.8: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:39:27] (03CR) 10Jbond: [C: 03+2] interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [14:39:42] RECOVERY - Check size of conntrack table on ping3002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [14:40:42] (03PS1) 10Klausman: Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) [14:42:16] (03CR) 10Ayounsi: [C: 03+2] Move ping offload to new ping VMs [homer/public] - 10https://gerrit.wikimedia.org/r/741923 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi) [14:42:53] !log Update ping redirect to point to new ping VMs - T295767 [14:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:57] (03Merged) 10jenkins-bot: Move ping offload to new ping VMs [homer/public] - 10https://gerrit.wikimedia.org/r/741923 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi) [14:42:57] T295767: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 [14:43:40] (03CR) 10Elukey: Add inference codfw service record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:43:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17849 and previous config saved to /var/cache/conftool/dbconfig/20211125-144344-ladsgroup.json [14:43:48] (03PS5) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) [14:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:49] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [14:44:36] (03CR) 10jerkins-bot: [V: 04-1] (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond) [14:45:14] (03PS2) 10Klausman: Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) [14:45:22] (03CR) 10Klausman: Add inference codfw service record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:47:05] 10SRE-tools, 10Infrastructure-Foundations, 10User-jbond: Create base cook book for rebooting/restarting servers/daemons - https://phabricator.wikimedia.org/T284079 (10jbond) 05Open→03Resolved [14:47:30] (03CR) 10Elukey: [C: 03+1] Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:48:15] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10jbond) 05Open→03Resolved a:03jbond this has now been implmented [14:49:42] (03CR) 10Vgutierrez: [C: 03+1] Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:49:50] (03CR) 10Klausman: [C: 03+2] Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:52:45] (03PS1) 10Klausman: conftool-data: add new inference discovery service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741925 (https://phabricator.wikimedia.org/T289835) [14:53:05] 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond) [14:53:19] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi) [14:53:54] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10jbond) 05Open→03Resolved a:03jbond Resolving this ultimatly we have decided that we will bypass SSO for autom... [14:54:25] (03CR) 10Elukey: [C: 03+1] conftool-data: add new inference discovery service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741925 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:54:30] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts ping2001.codfw.wmnet [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:59] 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Fix Puppet CA expired certs - https://phabricator.wikimedia.org/T286229 (10jbond) 05Open→03Resolved [14:57:24] (03CR) 10Klausman: [C: 03+2] conftool-data: add new inference discovery service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741925 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [14:57:50] (03PS1) 10Ayounsi: Remove old ping servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741926 (https://phabricator.wikimedia.org/T295767) [14:58:02] (03CR) 10Giuseppe Lavagetto: sre.discovery: use CNAME records for swift dns lookup (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 (owner: 10Giuseppe Lavagetto) [14:58:47] (03CR) 10Ayounsi: [C: 03+2] Remove old ping servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741926 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi) [14:58:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17850 and previous config saved to /var/cache/conftool/dbconfig/20211125-145849-ladsgroup.json [14:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:53] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [15:01:13] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32652/console" [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi) [15:04:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping2001.codfw.wmnet [15:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:31] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping2001.codfw.wmnet` - ping2001.codfw.wmnet (**PASS**) - Dow... [15:05:13] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts ping3001.esams.wmnet [15:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:38] 10SRE, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10MNadrofsky) Adding to the Foundational Tech Requests board for Steering Committee intake. This will help us prioritize/resource this work effectively. [15:10:28] 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10MNadrofsky) a:03MNadrofsky [15:12:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping3001.esams.wmnet [15:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:57] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping3001.esams.wmnet` - ping3001.esams.wmnet (**PASS**) - Dow... [15:13:14] !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts ping1001.eqiad.wmnet [15:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17851 and previous config saved to /var/cache/conftool/dbconfig/20211125-151354-ladsgroup.json [15:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:58] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [15:19:44] !log klausman@cumin1001 conftool action : set/pooled=yes:weight=1; selector: cluster=ml_serve,service=kubesvc [15:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:31] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping1001.eqiad.wmnet [15:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping1001.eqiad.wmnet` - ping1001.eqiad.wmnet (**PASS**) - Dow... [15:28:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ayounsi) a:03ayounsi All 3 VMs got rebuilt with larger disks, but with the default Debian Buster. @MoritzMuehlenhoff let me know if they need to be re-rebu... [15:28:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17852 and previous config saved to /var/cache/conftool/dbconfig/20211125-152858-ladsgroup.json [15:29:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1148.eqiad.wmnet with reason: Maintenance T296143 [15:29:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1148.eqiad.wmnet with reason: Maintenance T296143 [15:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:03] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [15:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17853 and previous config saved to /var/cache/conftool/dbconfig/20211125-152906-ladsgroup.json [15:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:19] (03PS1) 10Klausman: role::ml_k8s::worker: Activate LVS config for inference in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) [15:38:10] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32653/console" [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [15:38:38] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] role::ml_k8s::worker: Activate LVS config for inference in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [15:39:52] (03CR) 10Klausman: [C: 03+2] role::ml_k8s::worker: Activate LVS config for inference in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman) [15:47:19] !log reenable gtid on db1163 [15:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:30] !loh restarting pybal on lvs2010 T289835 [15:52:30] T289835: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835 [15:55:38] !log jynus@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163', diff saved to https://phabricator.wikimedia.org/P17856 and previous config saved to /var/cache/conftool/dbconfig/20211125-155538-jynus.json [15:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:03] !log restarting pybal on lvs2010 - T289835 [15:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:09] (03PS1) 10Volans: Update to v2.10.4-wmf6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/741936 [16:10:19] !log restarting pybal on lvs2009 T289835 [16:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:23] T289835: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835 [16:11:56] (03PS1) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) [16:14:05] !log jynus@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163+', diff saved to https://phabricator.wikimedia.org/P17859 and previous config saved to /var/cache/conftool/dbconfig/20211125-161404-jynus.json [16:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:53] hnowlan: <3 <3 <3 <3 <3 <3 [16:15:29] (03CR) 10Ayounsi: [C: 03+1] Update to v2.10.4-wmf6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/741936 (owner: 10Volans) [16:15:52] (03CR) 10jerkins-bot: [V: 04-1] api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [16:15:57] lmao ;_; [16:16:17] elukey: tbh I am not 100% certain this is the right approach, I suspect petr will put me right though :) [16:18:34] !log jynus@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163++', diff saved to https://phabricator.wikimedia.org/P17860 and previous config saved to /var/cache/conftool/dbconfig/20211125-161833-jynus.json [16:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:08] PROBLEM - puppet last run on ms-backup1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:25:09] RECOVERY - puppet last run on ms-backup1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:26:38] (03PS1) 10Elukey: Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) [16:29:32] (03PS1) 10Elukey: service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) [16:29:33] PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:30:03] (03PS2) 10Elukey: service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) [16:31:50] (03CR) 10Volans: [V: 03+2 C: 03+2] Update to v2.10.4-wmf6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/741936 (owner: 10Volans) [16:32:28] (03CR) 10Hnowlan: [C: 04-1] "This is the wrong approach in terms of syntax - _ratelimit.yaml needs to be a generic template and we need to write n+1 configs where n is" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan) [16:32:51] (03CR) 10Klausman: [C: 03+1] service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:32:57] (03CR) 10Klausman: [C: 03+1] Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:41:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17861 and previous config saved to /var/cache/conftool/dbconfig/20211125-164153-ladsgroup.json [16:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:59] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [16:45:53] !log volans@deploy1002 Started deploy [netbox/deploy@87a36a7]: Test v2.10.4-wmf6 on netbox-next [16:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:58] !log volans@deploy1002 Finished deploy [netbox/deploy@87a36a7]: Test v2.10.4-wmf6 on netbox-next (duration: 01m 04s) [16:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:41] !log jynus@cumin1001 dbctl commit (dc=all): 'Fully repool db1163', diff saved to https://phabricator.wikimedia.org/P17862 and previous config saved to /var/cache/conftool/dbconfig/20211125-164941-jynus.json [16:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:54] !log volans@deploy1002 Started deploy [netbox/deploy@87a36a7]: Deploy v2.10.4-wmf6 [16:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:50] (03CR) 10Vgutierrez: [C: 03+1] Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [16:56:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17863 and previous config saved to /var/cache/conftool/dbconfig/20211125-165657-ladsgroup.json [16:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:03] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [16:57:54] !log volans@deploy1002 Finished deploy [netbox/deploy@87a36a7]: Deploy v2.10.4-wmf6 (duration: 06m 59s) [16:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:25] (03CR) 10Elukey: [C: 03+2] service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [17:05:24] (03PS3) 10David Caro: WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) [17:06:43] (03CR) 10jerkins-bot: [V: 04-1] WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro) [17:09:37] (03CR) 10Elukey: [C: 03+2] Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [17:12:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17864 and previous config saved to /var/cache/conftool/dbconfig/20211125-171202-ladsgroup.json [17:12:03] !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference [17:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:07] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [17:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:05] (03PS4) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [17:16:17] (03PS5) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [17:16:26] (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [17:17:38] (03PS1) 10Elukey: Revert "Add discovery record support for the inference LVS" [dns] - 10https://gerrit.wikimedia.org/r/741904 [17:18:35] (03CR) 10Elukey: "This led to an error: https://phabricator.wikimedia.org/P17865" [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey) [17:18:45] (03CR) 10Elukey: [C: 03+2] Revert "Add discovery record support for the inference LVS" [dns] - 10https://gerrit.wikimedia.org/r/741904 (owner: 10Elukey) [17:20:31] PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:27:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17866 and previous config saved to /var/cache/conftool/dbconfig/20211125-172707-ladsgroup.json [17:27:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1149.eqiad.wmnet with reason: Maintenance T296143 [17:27:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1149.eqiad.wmnet with reason: Maintenance T296143 [17:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:13] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [17:27:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17867 and previous config saved to /var/cache/conftool/dbconfig/20211125-172714-ladsgroup.json [17:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:45] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:30:35] RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:35:07] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:13:46] 10SRE, 10MediaWiki-Core-Snapshots, 10Wikimedia-Site-requests: Transwiki import not working in production - https://phabricator.wikimedia.org/T140206 (10Stang) [18:21:27] RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:34:09] PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17868 and previous config saved to /var/cache/conftool/dbconfig/20211125-184336-ladsgroup.json [18:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:42] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [18:46:29] (03CR) 10Arturo Borrero Gonzalez: "Your initial PCC runs were basically NOOP. How do you feel about merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [18:48:45] (03CR) 10Arturo Borrero Gonzalez: "good catch. Sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/741920 (owner: 10Jbond) [18:51:41] (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez) [18:58:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17869 and previous config saved to /var/cache/conftool/dbconfig/20211125-185841-ladsgroup.json [18:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:45] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:13:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17870 and previous config saved to /var/cache/conftool/dbconfig/20211125-191345-ladsgroup.json [19:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:51] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:28:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17871 and previous config saved to /var/cache/conftool/dbconfig/20211125-192850-ladsgroup.json [19:28:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance T296143 [19:28:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance T296143 [19:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:55] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [19:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:23:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:43:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance T296143 [20:43:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance T296143 [20:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:56] T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143 [20:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T296143)', diff saved to https://phabricator.wikimedia.org/P17872 and previous config saved to /var/cache/conftool/dbconfig/20211125-204357-ladsgroup.json [20:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:40] (03PS1) 104nn1l2: Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154) [22:07:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:09:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:14:40] Do we have another branch called refs/for/master besides master? [22:15:06] Here I mean: https://phabricator.wikimedia.org/source/mediawiki-config/branches/master/ [22:16:33] I want to know why I should use "git push origin HEAD:refs/for/master" instead of "git push origin HEAD:master" when pushing commits. [22:20:11] Because gerrit [22:20:22] I know we should have that because only sysadmins should be allowed to push to the *original* master branch, and mere volunteers such as me should push somewhere else, but why can't I see that "experimental" branch? [22:20:29] You're not pushing to the branch, you're pushing to basically a review queue [22:21:33] Thanks Reedy [23:29:28] (03PS1) 104nn1l2: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 [23:38:13] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [23:39:00] (03PS2) 104nn1l2: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 (https://phabricator.wikimedia.org/T296136) [23:40:21] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:49:28] How do you make jenkins-bot to test the pushed patch? [23:50:01] For example, compare https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/741097 with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/741980