[00:00:05] <jouncebot>	 RoanKattouw and Urbanecm: #bothumor I � Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211125T0000).
[00:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[00:00:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:10:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:12:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:14:56] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 91.9% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:19:08] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[00:20:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:22:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:29:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:31:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:35:49] <ryankemper>	 mutante: back around now, thanks for restarting that blazegraph instance! will also take a look at the docs and see if there's some more context I can add for the future
[00:37:04] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2066.codfw.wmnet with OS buster
[00:37:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2066.codfw.wmnet with OS buster comp...
[00:39:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:44:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:50:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:52:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:00:05] <jouncebot>	 twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211125T0100).
[01:04:14] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2067.codfw.wmnet with OS buster
[01:04:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:04:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2067.codfw.wmnet with OS buster
[01:05:08] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul)
[01:13:02] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:18:54] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:19:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2068.codfw.wmnet with OS buster
[01:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2068.codfw.wmnet with OS buster
[01:20:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:23:10] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[01:28:06] <icinga-wm>	 PROBLEM - cassandra CQL 10.64.16.27:9042 on maps1008 is CRITICAL: connect to address 10.64.16.27 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[01:29:00] <icinga-wm>	 PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: cassandra.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:29:30] <icinga-wm>	 PROBLEM - cassandra service on maps1008 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:34:01] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2067.codfw.wmnet with OS buster
[01:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:34:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2067.codfw.wmnet with OS buster comp...
[01:36:31] <icinga-wm>	 RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:36:46] <icinga-wm>	 RECOVERY - cassandra CQL 10.64.16.27:9042 on maps1008 is OK: TCP OK - 0.000 second response time on 10.64.16.27 port 9042 https://phabricator.wikimedia.org/T93886
[01:37:14] <icinga-wm>	 RECOVERY - cassandra service on maps1008 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:49:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:49:55] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2068.codfw.wmnet with OS buster
[01:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:50:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2068.codfw.wmnet with OS buster comp...
[01:54:01] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2070.codfw.wmnet with OS buster
[01:54:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:06] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2070.codfw.wmnet with OS buster
[01:56:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Daimona - https://phabricator.wikimedia.org/T295993 (10Daimona) As I said, if my wikimedia email needs to be in the puppet file, that's fine. I do prefer not to use my real name publicly, but I believe this particular instance to be acceptable (as in, not...
[02:04:53] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2071.codfw.wmnet with OS buster
[02:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:04:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2071.codfw.wmnet with OS buster
[02:05:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:08:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:14:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:16:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:16:49] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:18:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:19:27] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:23:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:23:39] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2070.codfw.wmnet with OS buster
[02:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2070.codfw.wmnet with OS buster comp...
[02:26:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:34:46] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2071.codfw.wmnet with OS buster
[02:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:34:51] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2071.codfw.wmnet with OS buster comp...
[02:36:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,swagger_check_citoid_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:38:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:39:12] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[02:42:16] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2072.codfw.wmnet with OS buster
[02:42:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:42:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host elastic2072.codfw.wmnet with OS buster
[02:44:42] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[03:07:15] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[03:12:24] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2072.codfw.wmnet with OS buster
[03:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:12:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host elastic2072.codfw.wmnet with OS buster comp...
[03:12:33] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.0% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[03:12:36] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul)
[03:17:22] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10Papaul) @RKemper @Gehel all the servers are ready to put in service but not elastic2069 for some reason i can not login to it so I will h...
[03:17:25] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:19:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:22:34] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:54] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2011 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Community-Tech: Grant Access to PII in Superset for samwilson - https://phabricator.wikimedia.org/T296161 (10Samwilson) hehe I was too impatient! :-) Thanks for the explanation.
[03:48:10] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.4% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[03:52:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:17:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search, 10Elasticsearch: Q2:(Need By: TBD) rack/setup/install elastic20[61-72] - https://phabricator.wikimedia.org/T294154 (10RKemper) >>! In T294154#7528550, @Papaul wrote: > @RKemper @Gehel all the servers are ready to put in service but not elastic2069 for som...
[04:20:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:22:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:24:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:25:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:25:38] <ryankemper>	 !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.93`. Pre-deploy tests passing on canary `wdqs1003`
[04:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:51] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@29c5cd7]: 0.3.93
[04:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:27:01] <ryankemper>	 !log [WDQS Deploy] Tests passing following deploy of `0.3.93` on canary `wdqs1003`; proceeding to rest of fleet
[04:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:29:36] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:30:34] <ryankemper>	 !log [Elastic] Cleaning up dangling apt packages: `ryankemper@cumin1001:~$ sudo cumin -b 4 'elastic*' 'sudo apt autoremove -y'`
[04:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:34:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:35:14] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@29c5cd7]: 0.3.93 (duration: 09m 23s)
[04:35:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:35:44] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:38:46] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`
[04:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:38:49] <ryankemper>	 !log [WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`
[04:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:39:01] <ryankemper>	 !log [WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`
[04:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:40:04] <logmsgbot>	 !log ryankemper@deploy1002 Started deploy [wdqs/wdqs@29c5cd7] (wcqs): Deploy 0.3.93 to WCQS
[04:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:43:57] <ryankemper>	 !log [WCQS Deploy] Tests look good following deploy of `0.3.93` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet
[04:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:45:32] <logmsgbot>	 !log ryankemper@deploy1002 Finished deploy [wdqs/wdqs@29c5cd7] (wcqs): Deploy 0.3.93 to WCQS (duration: 05m 27s)
[04:45:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:38] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:20] <icinga-wm>	 PROBLEM - MariaDB memory on db1115 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (11742) = 92.1% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[05:36:48] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:51:55] <ryankemper>	 !log [WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good
[05:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:07] <wikibugs>	 (03PS4) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653)
[06:16:00] <wikibugs>	 (03CR) 10Majavah: P::doc: sync data to non-active servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653) (owner: 10Majavah)
[06:21:54] <wikibugs>	 (03PS5) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653)
[06:30:19] <wikibugs>	 (03PS6) 10Majavah: P::doc: sync data to non-active servers [puppet] - 10https://gerrit.wikimedia.org/r/741713 (https://phabricator.wikimedia.org/T247653)
[06:31:26] <marostegui>	 !log Restart tendril's DB 
[06:31:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:44] <icinga-wm>	 RECOVERY - MariaDB memory on db1115 is OK: OK Memory 59% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[06:57:22] <jelto>	 Just a short reminder: we will start re-deploy services in eqiad Kubernetes cluster soon. Feel free to ping me any time.
[06:58:46] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:05:22] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:09:49] <jelto>	 !log start re-deploy procedure in eqiad Kubernetes T251305
[07:09:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:56] <stashbot>	 T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305
[07:10:00] <jelto>	 !log downtime PyBal backends health check on lvs1015 and lvs1016 for helm3 de-deploy T251305. I'm keeping an eye on icing and remove downtime as soon as I'm finished
[07:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Unless you also submit a patch to add php-yaml to the php7.X-fpm-multiversion-base images, this can't be merged." [puppet] - 10https://gerrit.wikimedia.org/r/740927 (https://phabricator.wikimedia.org/T296331) (owner: 10Dduvall)
[07:17:23] <logmsgbot>	 !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 32 hosts with reason: helm3 de-deploy T251305
[07:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:27] <stashbot>	 T251305: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305
[07:17:46] <logmsgbot>	 !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 32 hosts with reason: helm3 de-deploy T251305
[07:17:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:43] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntax
[07:20:43] <logmsgbot>	 highlight|shellbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero)
[07:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:47] <wikibugs>	 (03PS1) 10Marostegui: db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965)
[07:23:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui)
[07:23:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance T296143
[07:23:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance T296143
[07:23:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:46] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[07:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:24:36] <wikibugs>	 (03PS2) 10Marostegui: db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965)
[07:26:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1128: Move it to test-s1 [puppet] - 10https://gerrit.wikimedia.org/r/741754 (https://phabricator.wikimedia.org/T295965) (owner: 10Marostegui)
[07:27:29] <Amir1>	 marostegui: I'm running the schema change on db1145:3314 without depooling, because it's not pooled. Is that correct? https://noc.wikimedia.org/dbconfig/eqiad.json
[07:27:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1128.eqiad.wmnet with OS bullseye
[07:27:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:10] <wikibugs>	 10SRE, 10RESTBase-Cassandra: Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10elukey)
[07:28:37] <marostegui>	 Amir1: yeah, it is a backup source
[07:28:54] <Amir1>	 cool
[07:29:53] <elukey_>	 !log elukey@mwdebug2002:~$ sudo systemctl reset-failed ifup@ens5.service
[07:29:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:30:20] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:32:51] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[07:32:51] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[07:32:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2521 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:35:21] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[07:35:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_eventgate_main_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:37:32] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:38:06] <jayme>	 remaining active connections to eventgate, potentially
[07:38:20] <jelto>	 i forgot to depool eventgate-main in my list. Is this a big problem. So it was pooled during the re-deploy and some requests hit the redeploy :/
[07:38:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:39:05] <jayme>	 jelto: ah...shit. We should double check the list then
[07:39:35] <jayme>	 but for now, no longer an issue I guess as it is available again now
[07:45:31] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' .
[07:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:22] <jayme>	 !log elevated MediaWiki exceptions and fatals (from ~07:35) due to a mistake during re-deploy of eventgate-main
[07:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:16] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[07:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:14] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[07:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:43] <marostegui>	 !log Stop mysql on db1133 to clone db1128 as a test host T295965
[07:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:47] <stashbot>	 T295965: Test MariaDB 10.4 with Bullseye - https://phabricator.wikimedia.org/T295965
[07:51:21] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(echostore|sessionstore)
[07:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1128.eqiad.wmnet with OS bullseye
[07:53:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=mysql-test site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:56:32] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
[07:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:31] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
[07:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:02] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[08:00:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211125T0800)
[08:02:05] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' .
[08:02:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:15] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apple-search' for release 'main' .
[08:03:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:45] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[08:05:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:14] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[08:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:05] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[08:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:27] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[08:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:40] <wikibugs>	 (03PS1) 10Elukey: kserve: fix a typo in the inference service config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/741844
[08:14:06] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[08:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:19] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[08:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:23] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[08:18:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kserve: fix a typo in the inference service config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/741844 (owner: 10Elukey)
[08:21:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[08:21:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:32] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[08:21:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:47] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[08:21:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[08:22:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:35] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[08:22:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:21] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 79 probes of 640 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:25:20] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[08:25:20] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[08:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:41] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[08:28:41] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[08:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:33] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 44 probes of 640 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:31:14] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[08:31:14] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[08:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:38] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[08:34:38] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[08:34:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:07] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[08:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:39:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:00] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[08:40:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:03] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'internal' .
[08:40:03] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[08:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:41:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:41:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:42:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:51] <Amir1>	 sorry for too many downtimes, I'm debugging something
[08:43:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: Update address for perf-team alerts [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle)
[08:43:17] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.48.144:9042 on restbase2023 is CRITICAL: connect to address 10.192.48.144 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[08:43:21] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[08:43:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:44:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:30] <Amir1>	 ignore these times
[08:44:35] <Amir1>	 *downtimes
[08:45:27] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.48.144:9042 on restbase2023 is OK: TCP OK - 0.033 second response time on 10.192.48.144 port 9042 https://phabricator.wikimedia.org/T93886
[08:46:02] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:46:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:46:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:06] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[08:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'miscweb' for release 'main' .
[08:47:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:29] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:48:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1146.eqiad.wmnet with reason: Maintenance T296143
[08:48:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17837 and previous config saved to /var/cache/conftool/dbconfig/20211125-084834-ladsgroup.json
[08:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:54] <Amir1>	 okay fixed now, this supposed the be the last down time
[08:50:30] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[08:50:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:34] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[08:51:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: alertmanager: Update address for perf-team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740963 (https://phabricator.wikimedia.org/T296368) (owner: 10Krinkle)
[08:58:43] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] netbox - cas: allow users with active=False [software/netbox] - 10https://gerrit.wikimedia.org/r/739309 (https://phabricator.wikimedia.org/T295148) (owner: 10Volans)
[08:59:57] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .
[08:59:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:01] <icinga-wm>	 PROBLEM - Host logstash2028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[09:02:25] <wikibugs>	 (03PS3) 10Vgutierrez: cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005)
[09:02:27] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[09:02:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:00] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[09:05:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:13] <icinga-wm>	 RECOVERY - Host logstash2028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.78 ms
[09:10:12] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' .
[09:10:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:16:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[09:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:29] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox' for release 'main' .
[09:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:30] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-constraints' for release 'main' .
[09:21:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:36] <wikibugs>	 (03PS1) 10Elukey: pontoon: add profile::base::certificates basic config [puppet] - 10https://gerrit.wikimedia.org/r/741847
[09:23:14] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[09:23:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:00] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-media' for release 'main' .
[09:24:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:24:29] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:26:19] <wikibugs>	 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too, and all environments have their own puppet master CAs.
[09:27:03] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-syntaxhighlight' for release 'main' .
[09:27:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:20] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'shellbox-timeline' for release 'main' .
[09:29:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:00] <wikibugs>	 (03PS2) 10Elukey: pontoon: add profile::base::certificates basic config [puppet] - 10https://gerrit.wikimedia.org/r/741847
[09:31:28] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[09:31:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] pontoon: add profile::base::certificates basic config [puppet] - 10https://gerrit.wikimedia.org/r/741847 (owner: 10Elukey)
[09:34:11] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' .
[09:34:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:15] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
[09:37:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:31] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'toolhub' for release 'main' .
[09:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:09] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[09:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:46] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
[09:45:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:46] <logmsgbot>	 !log jelto@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntaxh
[09:55:46] <logmsgbot>	 ighlight|shellbox-timeline|similar-users|tegola-vector-tiles|termbox|wikifeeds|zotero)
[09:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:57] <wikibugs>	 (03PS1) 10David Caro: timesyncd: add package requirement [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456)
[09:58:28] <wikibugs>	 (03PS1) 10Jbond: O:puppet_compiler::puppetdb: Add role for puppetdb compiler (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/741850
[09:59:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:puppet_compiler::puppetdb: Add role for puppetdb compiler (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/741850 (owner: 10Jbond)
[10:02:52] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32626/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[10:02:55] <wikibugs>	 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10JMeybohm) >>! In T296089#7527221, @elukey wrote: > A simplification would be to avoid the install check and create the pem bundle at build time as well, but there are probably some use cases that I don't have in m...
[10:05:38] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] hiera::role::common::deployment_server update helmBinary eqiad [puppet] - 10https://gerrit.wikimedia.org/r/741681 (https://phabricator.wikimedia.org/T251305) (owner: 10Jelto)
[10:07:56] <wikibugs>	 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) >>! In T296089#7528885, @JMeybohm wrote: >>>! In T296089#7527221, @elukey wrote: >> If we had a way to generate multiple package from the same debian source (IIRC there should be the possibility), we could...
[10:18:20] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto) cc from ops list:   The re-deploy for all services in the eqiad Kubernetes cluster was successful. However this time we had an impact on service availability. Planned reduced serv...
[10:18:32] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[10:19:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17840 and previous config saved to /var/cache/conftool/dbconfig/20211125-101921-ladsgroup.json
[10:19:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:26] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[10:21:22] <wikibugs>	 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) Just some early notes ill follow up more in a bit  > p12/jks bundles In this method we would still do the jks/p12 generation in puppet  > As described above, the wmf-certificates package checks in /etc/ca-c...
[10:24:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add prometheus-02 (Bullseye instance) to o11y [puppet] - 10https://gerrit.wikimedia.org/r/741855
[10:25:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix blackbox-exporter config syntax for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741856
[10:25:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: fix o11y stack for recent changes [puppet] - 10https://gerrit.wikimedia.org/r/741857
[10:25:11] <vgutierrez>	 !log rolling restart of varnish and HAProxy on cp2042.codfw.wmnet,cp1090.eqiad.wmnet,cp[5012].eqsin.wmnet,cp3065.esams.wmnet,cp[4026,4032].ulsfo.wmnet' to disable PROXY protocol - T290005
[10:25:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:15] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:25:21] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache::haproxy: Disable PROXY protocol [puppet] - 10https://gerrit.wikimedia.org/r/741693 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez)
[10:25:39] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:27:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add prometheus-02 (Bullseye instance) to o11y [puppet] - 10https://gerrit.wikimedia.org/r/741855 (owner: 10Filippo Giunchedi)
[10:33:01] <godog>	 seeking soul(s) for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/741856
[10:33:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: fix o11y stack for recent changes [puppet] - 10https://gerrit.wikimedia.org/r/741857 (owner: 10Filippo Giunchedi)
[10:33:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: pontoon: fix o11y stack for recent changes [puppet] - 10https://gerrit.wikimedia.org/r/741857
[10:34:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17841 and previous config saved to /var/cache/conftool/dbconfig/20211125-103425-ladsgroup.json
[10:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:30] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[10:37:49] <elukey>	 godog: LGTM but is it possible to have a pcc to confirm?
[10:37:54] <elukey>	 maybe buster vs bullseye
[10:39:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:40:13] <wikibugs>	 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10Jelto)
[10:40:45] <godog>	 elukey: mhhh I don't have a bullseye prometheus host available to pcc yet I think, I can do buster though
[10:41:52] <elukey>	 ah yes yes ok
[10:41:56] <elukey>	 just a quick check
[10:41:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:42:53] <icinga-wm>	 PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:44:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32627/console" [puppet] - 10https://gerrit.wikimedia.org/r/741856 (owner: 10Filippo Giunchedi)
[10:44:44] <godog>	 elukey: SGTM, done ^
[10:49:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] prometheus: fix blackbox-exporter config syntax for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741856 (owner: 10Filippo Giunchedi)
[10:49:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17842 and previous config saved to /var/cache/conftool/dbconfig/20211125-104930-ladsgroup.json
[10:49:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:35] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[10:52:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: fix blackbox-exporter config syntax for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741856 (owner: 10Filippo Giunchedi)
[10:52:26] <godog>	 nice, thanks elukey 
[11:04:01] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:04:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1146:3314 (T296143)', diff saved to https://phabricator.wikimedia.org/P17843 and previous config saved to /var/cache/conftool/dbconfig/20211125-110435-ladsgroup.json
[11:04:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1147.eqiad.wmnet with reason: Maintenance T296143
[11:04:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1147.eqiad.wmnet with reason: Maintenance T296143
[11:04:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:40] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[11:04:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17844 and previous config saved to /var/cache/conftool/dbconfig/20211125-110443-ladsgroup.json
[11:04:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:51] <wikibugs>	 10SRE: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10elukey) >>! In T296089#7528900, @jbond wrote: >> Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too,  > In relation to this, I want to say that imo having change...
[11:13:02] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331)
[11:19:05] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:27:01] <icinga-wm>	 PROBLEM - SSH on bast5002 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:28:34] <wikibugs>	 (03PS1) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[11:29:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[11:29:20] <wikibugs>	 (03PS1) 10Hnowlan: cassandra: correct check notes URL [puppet] - 10https://gerrit.wikimedia.org/r/741868
[11:29:53] <wikibugs>	 10SRE, 10RESTBase-Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10hnowlan) a:03hnowlan
[11:30:51] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 04-1] "We have an issue here." [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[11:31:27] <icinga-wm>	 RECOVERY - SSH on bast5002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:32:37] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:33:40] <wikibugs>	 10SRE, 10Commons, 10Data-Persistence (Consultation), 10MediaWiki-extensions-WikibaseClient, and 4 others: Enable statement usage tracking on Commons and Co - https://phabricator.wikimedia.org/T188730 (10Manuel)
[11:37:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:37:31] <icinga-wm>	 PROBLEM - Check systemd state on db1139 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:57] <icinga-wm>	 RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:49:15] <wikibugs>	 (03PS2) 10David Caro: timesyncd: handle bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456)
[11:51:17] <wikibugs>	 (03PS2) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[11:51:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[11:55:13] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32630/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[11:56:02] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1163 load', diff saved to https://phabricator.wikimedia.org/P17845 and previous config saved to /var/cache/conftool/dbconfig/20211125-115602-jynus.json
[11:56:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:13] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC just shows the notify changed (from a string to a list). Looks ok, will try to get a bullseye host." [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[11:56:31] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase202[1-3].codfw.wmnet: Restarting for certificate updates - hnowlan@cumin1001
[11:56:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:49] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32631/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[11:59:13] <wikibugs>	 (03PS3) 10David Caro: timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456)
[12:01:26] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add ceph packages in the octopus/bullseye combo [puppet] - 10https://gerrit.wikimedia.org/r/741113 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[12:03:46] <apergos>	 no deploys all day, this includes the backport window that would normally happen at this time. 
[12:04:09] <apergos>	 carry on!
[12:04:35] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Reduce db1163 load even more', diff saved to https://phabricator.wikimedia.org/P17846 and previous config saved to /var/cache/conftool/dbconfig/20211125-120435-jynus.json
[12:04:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:06:49] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: fix duplicate update name [puppet] - 10https://gerrit.wikimedia.org/r/741870 (https://phabricator.wikimedia.org/T296175)
[12:09:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:11:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] timesyncd: handle bullseye ntp hosts [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[12:11:38] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Temp. depool db1163 fully', diff saved to https://phabricator.wikimedia.org/P17847 and previous config saved to /var/cache/conftool/dbconfig/20211125-121138-jynus.json
[12:11:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: fix duplicate update name [puppet] - 10https://gerrit.wikimedia.org/r/741870 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[12:14:06] <jynus>	 !log disable temp. gtid on db1163
[12:14:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:38] <arturo>	 !log update repo bullseye-wikimedia/thirdparty/ceph-octopus (T296175)
[12:14:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:42] <stashbot>	 T296175: cloudcephosd1021 is using an old ceph version because its running debian bullseye instead of buster - https://phabricator.wikimedia.org/T296175
[12:20:06] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: ceph: common: support both buster & bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741883 (https://phabricator.wikimedia.org/T296175)
[12:24:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:27:10] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2] "PCC as expected: https://integration.wikimedia.org/ci/view/operations/job/operations-puppet-catalog-compiler/32633/console" [puppet] - 10https://gerrit.wikimedia.org/r/741883 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[12:27:22] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] ceph: common: support both buster & bullseye [puppet] - 10https://gerrit.wikimedia.org/r/741883 (https://phabricator.wikimedia.org/T296175) (owner: 10Arturo Borrero Gonzalez)
[12:27:50] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase202[1-3].codfw.wmnet: Restarting for certificate updates - hnowlan@cumin1001
[12:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:03] <icinga-wm>	 RECOVERY - Check systemd state on db1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/741868 (owner: 10Hnowlan)
[12:31:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] cassandra: correct check notes URL [puppet] - 10https://gerrit.wikimedia.org/r/741868 (owner: 10Hnowlan)
[12:32:38] <wikibugs>	 10SRE, 10RESTBase-Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10hnowlan) 05Open→03Resolved
[12:32:48] <wikibugs>	 10SRE, 10RESTBase-Cassandra, 10Platform Team Workboards (Platform Engineering Reliability): Restbase/Cassandra TLS cert expiration warnings - https://phabricator.wikimedia.org/T296448 (10hnowlan) Thanks for reporting this !
[12:34:40] <elukey>	 hnowlan: o/ thanks for the new docs - IIUC in this use case we'd need to rm the keys from the private repo (for the three hosts) and then re-run the script to generate the new keys (and then commit and let puppet run etc..)
[12:35:09] <elukey>	 ah I see you already done it probably :D
[12:36:06] <hnowlan>	 elukey: yeah :) rm the files, run cassandra-ca-manager, commit, let puppet run and then do a roll-restart 
[12:36:47] <elukey>	 ack thanks :)
[12:48:16] <wikibugs>	 10SRE, 10Patch-For-Review: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) Im not sure this is the place to have this discussion, perhaps we should fork to another task?   > I disagree with this John, Pontoon was a big effort to allow reusable testing environ...
[12:50:53] <wikibugs>	 (03PS3) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[12:51:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[12:57:07] <wikibugs>	 (03PS1) 10Ayounsi: Prepare site.pp for new ping VMs [puppet] - 10https://gerrit.wikimedia.org/r/741912 (https://phabricator.wikimedia.org/T295767)
[12:57:43] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond)
[12:59:10] <wikibugs>	 (03PS2) 10David Caro: WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028)
[12:59:12] <wikibugs>	 (03CR) 10David Caro: WIP cli: add --fail-fast flag and behavior (0316 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro)
[12:59:36] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32636/console" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[12:59:45] <wikibugs>	 (03CR) 10David Caro: "I have not yet fixed the tests, and have to run some tests locally, but mypy/flake8 pass now." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro)
[13:01:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro)
[13:05:17] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host ping3002.esams.wmnet
[13:05:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prepare site.pp for new ping VMs [puppet] - 10https://gerrit.wikimedia.org/r/741912 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi)
[13:06:45] <wikibugs>	 (03CR) 10Jbond: timesyncd: handle bullseye ntp hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[13:07:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond)
[13:09:36] <wikibugs>	 (03Merged) 10jenkins-bot: Add Typing: And fix other minor lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond)
[13:10:06] <wikibugs>	 (03PS4) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[13:10:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:14:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping3002.esams.wmnet
[13:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:28] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32642/console" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[13:20:31] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host ping2002.codfw.wmnet
[13:20:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:20] <wikibugs>	 (03PS5) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[13:23:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:26:13] <wikibugs>	 (03PS6) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[13:26:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:28:46] <Amir1>	 !log killing lingering process from mwmaint to depooled db1147
[13:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:15] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping2002.codfw.wmnet
[13:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host ping1002.eqiad.wmnet
[13:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:39:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, possible im being to picky on the comment so feel free to merge as is" [puppet] - 10https://gerrit.wikimedia.org/r/741849 (https://phabricator.wikimedia.org/T296456) (owner: 10David Caro)
[13:39:33] <wikibugs>	 (03PS2) 10Jelto: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149
[13:39:55] <wikibugs>	 (03PS7) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[13:40:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:40:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping1002.eqiad.wmnet
[13:40:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32646/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:40:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:12] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32647/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:43:08] <wikibugs>	 (03PS1) 10Ayounsi: Add new ping VMs to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741916 (https://phabricator.wikimedia.org/T295767)
[13:44:25] <wikibugs>	 (03PS8) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[13:44:59] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add new ping VMs to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741916 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi)
[13:45:10] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32648/console" [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:45:55] <wikibugs>	 (03PS2) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[13:46:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:46:46] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto)
[13:47:31] <wikibugs>	 (03PS2) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124
[13:47:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[13:49:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto)
[13:49:42] <wikibugs>	 (03PS3) 10Jelto: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149
[13:52:58] <wikibugs>	 (03PS3) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[13:54:19] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto)
[13:54:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[13:54:50] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove search.wikimedia.org files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741115 (https://phabricator.wikimedia.org/T289224) (owner: 10Majavah)
[13:55:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "Wait a sec, how large is the miscweb image?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto)
[13:56:23] <wikibugs>	 (03PS1) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089)
[13:57:27] <wikibugs>	 (03PS1) 10Ayounsi: Set flat partman receipe for all ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/741918 (https://phabricator.wikimedia.org/T295767)
[13:57:50] <wikibugs>	 (03PS2) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089)
[13:58:02] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: fix whitespace for affinity [deployment-charts] - 10https://gerrit.wikimedia.org/r/741149 (owner: 10Jelto)
[13:58:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32651/console" [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[13:59:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Set flat partman receipe for all ping hosts [puppet] - 10https://gerrit.wikimedia.org/r/741918 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi)
[13:59:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[14:00:13] <wikibugs>	 (03PS9) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[14:00:49] <wikibugs>	 (03CR) 10Majavah: P:cache::kafka::Webrequest: use cert defined in P:certificates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[14:02:14] <wikibugs>	 (03CR) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124 (owner: 10Jelto)
[14:02:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[14:02:27] <wikibugs>	 (03PS3) 10Jelto: helmfile.d:miscweb add node affinity to ssd nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/741124
[14:09:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rancid: add ability to disable emails [puppet] - 10https://gerrit.wikimedia.org/r/741919
[14:10:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rancid: add ability to disable emails [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi)
[14:11:22] <godog>	 ORLY?
[14:12:20] <godog>	 looks like an unrelated failure
[14:12:22] <godog>	 15:10:23        error during compilation: Evaluation Error: Error while evaluating a Function Call, node codename does not meet requirement `stretch >= buster` (file: /srv/workspace/puppet/modules/debian/functions/codename/require.pp, line: 22, column: 9) on node 89b14fc12ee3.integration.eqiad.wmflabs
[14:12:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: "CI failure is unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi)
[14:12:55] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[14:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:18] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[14:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:16] <logmsgbot>	 !log klausman@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .
[14:17:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:26] <wikibugs>	 (03PS10) 10Jbond: P:base::certificates: update support for trusted CA [puppet] - 10https://gerrit.wikimedia.org/r/741867 (https://phabricator.wikimedia.org/T296089)
[14:18:28] <wikibugs>	 (03PS1) 10Jbond: P:wmcs::backy2: fix spec tests [puppet] - 10https://gerrit.wikimedia.org/r/741920
[14:18:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/741920 (owner: 10Jbond)
[14:19:19] <wikibugs>	 (03PS4) 10Jbond: profile::rsyslog: move Kafka TLS CA settings to the new bundle [puppet] - 10https://gerrit.wikimedia.org/r/739463 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey)
[14:19:26] <wikibugs>	 (03PS3) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089)
[14:19:50] <wikibugs>	 (03PS1) 10Kormat: Initial structure and configs. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/741921
[14:19:55] <jinxer-wm>	 (LogstashIndexingFailures) firing: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[14:20:48] <wikibugs>	 (03CR) 10Kormat: [V: 03+2 C: 03+2] Initial structure and configs. [software/wmfdb] - 10https://gerrit.wikimedia.org/r/741921 (owner: 10Kormat)
[14:21:29] <godog>	 jbond: thanks for the puppet rspec CI fixes
[14:23:03] <wikibugs>	 10SRE, 10Patch-For-Review: Unify WMF internal CA certs bundle generation - https://phabricator.wikimedia.org/T296089 (10jbond) >>! In T296089#7522709, @elukey wrote: > I am wondering what is best to do for use cases like: >  > * https://gerrit.wikimedia.org/r/c/operations/puppet/+/739463 (not merged yet) > * h...
[14:24:55] <jinxer-wm>	 (LogstashIndexingFailures) resolved: Logstash Elasticsearch indexing errors - https://wikitech.wikimedia.org/wiki/Logstash#Indexing_errors - https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=40 - https://alerts.wikimedia.org
[14:24:59] <wikibugs>	 (03PS4) 10Jbond: P:cache::kafka::Webrequest: use cert defined in P:certificates [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089)
[14:25:08] <jayme>	 !log uncordoned kubestage1003.eqiad.wmnet kubestage1004.eqiad.wmnet - T293729
[14:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:13] <stashbot>	 T293729: setup/install kubestage100[34] - https://phabricator.wikimedia.org/T293729
[14:25:19] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/741917 (https://phabricator.wikimedia.org/T296089) (owner: 10Jbond)
[14:28:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Remove debian/watch and fix the distribution in the changelog; otherwise lgtm." [docker-images/imagecatalog] (debian) - 10https://gerrit.wikimedia.org/r/738500 (owner: 10RLazarus)
[14:30:03] <wikibugs>	 (03CR) 10Jbond: WIP cli: add --fail-fast flag and behavior (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro)
[14:38:13] <wikibugs>	 (03PS1) 10Ayounsi: Move ping offload to new ping VMs [homer/public] - 10https://gerrit.wikimedia.org/r/741923 (https://phabricator.wikimedia.org/T295767)
[14:38:40] <icinga-wm>	 PROBLEM - Check size of conntrack table on ping3002 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.20.0.8: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:39:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] interfaces: remove ethtool configueration [puppet] - 10https://gerrit.wikimedia.org/r/662699 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond)
[14:39:42] <icinga-wm>	 RECOVERY - Check size of conntrack table on ping3002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[14:40:42] <wikibugs>	 (03PS1) 10Klausman: Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835)
[14:42:16] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Move ping offload to new ping VMs [homer/public] - 10https://gerrit.wikimedia.org/r/741923 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi)
[14:42:53] <XioNoX>	 !log Update ping redirect to point to new ping VMs - T295767
[14:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:57] <wikibugs>	 (03Merged) 10jenkins-bot: Move ping offload to new ping VMs [homer/public] - 10https://gerrit.wikimedia.org/r/741923 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi)
[14:42:57] <stashbot>	 T295767: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767
[14:43:40] <wikibugs>	 (03CR) 10Elukey: Add inference codfw service record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:43:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17849 and previous config saved to /var/cache/conftool/dbconfig/20211125-144344-ladsgroup.json
[14:43:48] <wikibugs>	 (03PS5) 10Jbond: (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208)
[14:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:49] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[14:44:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] (WIP) interface: try to update the numa integrations [puppet] - 10https://gerrit.wikimedia.org/r/662751 (https://phabricator.wikimedia.org/T236208) (owner: 10Jbond)
[14:45:14] <wikibugs>	 (03PS2) 10Klausman: Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835)
[14:45:22] <wikibugs>	 (03CR) 10Klausman: Add inference codfw service record (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:47:05] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10User-jbond: Create base cook book for rebooting/restarting servers/daemons - https://phabricator.wikimedia.org/T284079 (10jbond) 05Open→03Resolved
[14:47:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:48:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Cookbook for centralised logouts and session status queries - https://phabricator.wikimedia.org/T283242 (10jbond) 05Open→03Resolved a:03jbond this has now been implmented
[14:49:42] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:49:50] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] Add inference codfw service record [dns] - 10https://gerrit.wikimedia.org/r/741924 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:52:45] <wikibugs>	 (03PS1) 10Klausman: conftool-data: add new inference discovery service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741925 (https://phabricator.wikimedia.org/T289835)
[14:53:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Security-Team, 10CAS-SSO, 10User-jbond: Further steps for CAS/web SSO - https://phabricator.wikimedia.org/T233921 (10jbond)
[14:53:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi)
[14:53:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Investigate how automated tasks can authenticate against CAS - https://phabricator.wikimedia.org/T239323 (10jbond) 05Open→03Resolved a:03jbond Resolving this ultimatly we have decided that we will bypass SSO for autom...
[14:54:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] conftool-data: add new inference discovery service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741925 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:54:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts ping2001.codfw.wmnet
[14:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:59] <wikibugs>	 10Puppet, 10Infrastructure-Foundations, 10User-jbond: Fix Puppet CA expired certs - https://phabricator.wikimedia.org/T286229 (10jbond) 05Open→03Resolved
[14:57:24] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] conftool-data: add new inference discovery service in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741925 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[14:57:50] <wikibugs>	 (03PS1) 10Ayounsi: Remove old ping servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741926 (https://phabricator.wikimedia.org/T295767)
[14:58:02] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: sre.discovery: use CNAME records for swift dns lookup (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/730692 (owner: 10Giuseppe Lavagetto)
[14:58:47] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove old ping servers from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/741926 (https://phabricator.wikimedia.org/T295767) (owner: 10Ayounsi)
[14:58:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17850 and previous config saved to /var/cache/conftool/dbconfig/20211125-145849-ladsgroup.json
[14:58:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:53] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[15:01:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32652/console" [puppet] - 10https://gerrit.wikimedia.org/r/741919 (owner: 10Filippo Giunchedi)
[15:04:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping2001.codfw.wmnet
[15:04:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping2001.codfw.wmnet` - ping2001.codfw.wmnet (**PASS**)   - Dow...
[15:05:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts ping3001.esams.wmnet
[15:05:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:09:38] <wikibugs>	 10SRE, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10MNadrofsky) Adding to the Foundational Tech Requests board for Steering Committee intake. This will help us prioritize/resource this work effectively.
[15:10:28] <wikibugs>	 10SRE, 10Foundational Technology Requests, 10Traffic, 10Wikimedia Enterprise, 10Wikimedia Enterprise Discussion: Allow-Listing for Enterprise IPs - https://phabricator.wikimedia.org/T294798 (10MNadrofsky) a:03MNadrofsky
[15:12:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping3001.esams.wmnet
[15:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping3001.esams.wmnet` - ping3001.esams.wmnet (**PASS**)   - Dow...
[15:13:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.decommission for hosts ping1001.eqiad.wmnet
[15:13:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17851 and previous config saved to /var/cache/conftool/dbconfig/20211125-151354-ladsgroup.json
[15:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:58] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[15:19:44] <logmsgbot>	 !log klausman@cumin1001 conftool action : set/pooled=yes:weight=1; selector: cluster=ml_serve,service=kubesvc
[15:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:31] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ping1001.eqiad.wmnet
[15:22:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1001 for hosts: `ping1001.eqiad.wmnet` - ping1001.eqiad.wmnet (**PASS**)   - Dow...
[15:28:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Rebuild ping* hosts with 10G disks - https://phabricator.wikimedia.org/T295767 (10ayounsi) a:03ayounsi All 3 VMs got rebuilt with larger disks, but with the default Debian Buster.  @MoritzMuehlenhoff let me know if they need to be re-rebu...
[15:28:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1147 (T296143)', diff saved to https://phabricator.wikimedia.org/P17852 and previous config saved to /var/cache/conftool/dbconfig/20211125-152858-ladsgroup.json
[15:29:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1148.eqiad.wmnet with reason: Maintenance T296143
[15:29:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1148.eqiad.wmnet with reason: Maintenance T296143
[15:29:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:03] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[15:29:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17853 and previous config saved to /var/cache/conftool/dbconfig/20211125-152906-ladsgroup.json
[15:29:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:19] <wikibugs>	 (03PS1) 10Klausman: role::ml_k8s::worker: Activate LVS config for inference in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835)
[15:38:10] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32653/console" [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[15:38:38] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] role::ml_k8s::worker: Activate LVS config for inference in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[15:39:52] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] role::ml_k8s::worker: Activate LVS config for inference in codfw [puppet] - 10https://gerrit.wikimedia.org/r/741934 (https://phabricator.wikimedia.org/T289835) (owner: 10Klausman)
[15:47:19] <jynus>	 !log reenable gtid on db1163
[15:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:52:30] <klausman>	 !loh restarting pybal on lvs2010  T289835
[15:52:30] <stashbot>	 T289835: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835
[15:55:38] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163', diff saved to https://phabricator.wikimedia.org/P17856 and previous config saved to /var/cache/conftool/dbconfig/20211125-155538-jynus.json
[15:55:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:03] <vgutierrez>	 !log restarting pybal  on lvs2010 - T289835
[15:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:09] <wikibugs>	 (03PS1) 10Volans: Update to v2.10.4-wmf6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/741936
[16:10:19] <klausman>	 !log restarting pybal on lvs2009 T289835
[16:10:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:23] <stashbot>	 T289835: Create a LB service for inference.discovery.wmnet - https://phabricator.wikimedia.org/T289835
[16:11:56] <wikibugs>	 (03PS1) 10Hnowlan: api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956)
[16:14:05] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163+', diff saved to https://phabricator.wikimedia.org/P17859 and previous config saved to /var/cache/conftool/dbconfig/20211125-161404-jynus.json
[16:14:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:53] <elukey>	 hnowlan: <3 <3 <3 <3 <3 <3
[16:15:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Update to v2.10.4-wmf6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/741936 (owner: 10Volans)
[16:15:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] api-gateway: allow discovery services to set custom rate limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[16:15:57] <hnowlan>	 lmao ;_;
[16:16:17] <hnowlan>	 elukey: tbh I am not 100% certain this is the right approach, I suspect petr will put me right though :) 
[16:18:34] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Slowly repool db1163++', diff saved to https://phabricator.wikimedia.org/P17860 and previous config saved to /var/cache/conftool/dbconfig/20211125-161833-jynus.json
[16:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:08] <icinga-wm>	 PROBLEM - puppet last run on ms-backup1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:25:09] <icinga-wm>	 RECOVERY - puppet last run on ms-backup1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[16:26:38] <wikibugs>	 (03PS1) 10Elukey: Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835)
[16:29:32] <wikibugs>	 (03PS1) 10Elukey: service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835)
[16:29:33] <icinga-wm>	 PROBLEM - SSH on kubernetes1004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:30:03] <wikibugs>	 (03PS2) 10Elukey: service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835)
[16:31:50] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Update to v2.10.4-wmf6 [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/741936 (owner: 10Volans)
[16:32:28] <wikibugs>	 (03CR) 10Hnowlan: [C: 04-1] "This is the wrong approach in terms of syntax - _ratelimit.yaml needs to be a generic template and we need to write n+1 configs where n is" [deployment-charts] - 10https://gerrit.wikimedia.org/r/741937 (https://phabricator.wikimedia.org/T295956) (owner: 10Hnowlan)
[16:32:51] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[16:32:57] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[16:41:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17861 and previous config saved to /var/cache/conftool/dbconfig/20211125-164153-ladsgroup.json
[16:41:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:41:59] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[16:45:53] <logmsgbot>	 !log volans@deploy1002 Started deploy [netbox/deploy@87a36a7]: Test v2.10.4-wmf6 on netbox-next
[16:45:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:58] <logmsgbot>	 !log volans@deploy1002 Finished deploy [netbox/deploy@87a36a7]: Test v2.10.4-wmf6 on netbox-next (duration: 01m 04s)
[16:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:41] <logmsgbot>	 !log jynus@cumin1001 dbctl commit (dc=all): 'Fully repool db1163', diff saved to https://phabricator.wikimedia.org/P17862 and previous config saved to /var/cache/conftool/dbconfig/20211125-164941-jynus.json
[16:49:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:54] <logmsgbot>	 !log volans@deploy1002 Started deploy [netbox/deploy@87a36a7]: Deploy v2.10.4-wmf6
[16:50:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:50] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[16:56:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17863 and previous config saved to /var/cache/conftool/dbconfig/20211125-165657-ladsgroup.json
[16:57:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:03] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[16:57:54] <logmsgbot>	 !log volans@deploy1002 Finished deploy [netbox/deploy@87a36a7]: Deploy v2.10.4-wmf6 (duration: 06m 59s)
[16:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] service::catalog: set inference as active-active [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[17:05:24] <wikibugs>	 (03PS3) 10David Caro: WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028)
[17:06:43] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP cli: add --fail-fast flag and behavior [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/740539 (https://phabricator.wikimedia.org/T295028) (owner: 10David Caro)
[17:09:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add discovery record support for the inference LVS [dns] - 10https://gerrit.wikimedia.org/r/741939 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[17:12:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17864 and previous config saved to /var/cache/conftool/dbconfig/20211125-171202-ladsgroup.json
[17:12:03] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=inference
[17:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:07] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[17:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:05] <wikibugs>	 (03PS4) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[17:16:17] <wikibugs>	 (03PS5) 10Jbond: puppetmaster - hiera: order site after role [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[17:16:26] <wikibugs>	 (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[17:17:38] <wikibugs>	 (03PS1) 10Elukey: Revert "Add discovery record support for the inference LVS" [dns] - 10https://gerrit.wikimedia.org/r/741904
[17:18:35] <wikibugs>	 (03CR) 10Elukey: "This led to an error: https://phabricator.wikimedia.org/P17865" [puppet] - 10https://gerrit.wikimedia.org/r/741940 (https://phabricator.wikimedia.org/T289835) (owner: 10Elukey)
[17:18:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Add discovery record support for the inference LVS" [dns] - 10https://gerrit.wikimedia.org/r/741904 (owner: 10Elukey)
[17:20:31] <icinga-wm>	 PROBLEM - SSH on mw2252.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:27:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1148 (T296143)', diff saved to https://phabricator.wikimedia.org/P17866 and previous config saved to /var/cache/conftool/dbconfig/20211125-172707-ladsgroup.json
[17:27:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1149.eqiad.wmnet with reason: Maintenance T296143
[17:27:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1149.eqiad.wmnet with reason: Maintenance T296143
[17:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:13] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[17:27:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17867 and previous config saved to /var/cache/conftool/dbconfig/20211125-172714-ladsgroup.json
[17:27:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:35] <icinga-wm>	 RECOVERY - SSH on kubernetes1004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[17:35:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:13:46] <wikibugs>	 10SRE, 10MediaWiki-Core-Snapshots, 10Wikimedia-Site-requests: Transwiki import not working in production - https://phabricator.wikimedia.org/T140206 (10Stang)
[18:21:27] <icinga-wm>	 RECOVERY - SSH on mw2252.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:34:09] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17868 and previous config saved to /var/cache/conftool/dbconfig/20211125-184336-ladsgroup.json
[18:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:42] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[18:46:29] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Your initial PCC runs were basically NOOP. How do you feel about merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[18:48:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "good catch. Sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/741920 (owner: 10Jbond)
[18:51:41] <wikibugs>	 (03CR) 10Jbond: puppetmaster - hiera: order site after role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/740141 (owner: 10Arturo Borrero Gonzalez)
[18:58:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17869 and previous config saved to /var/cache/conftool/dbconfig/20211125-185841-ladsgroup.json
[18:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:58:45] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:13:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17870 and previous config saved to /var/cache/conftool/dbconfig/20211125-191345-ladsgroup.json
[19:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:13:51] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:28:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'After maintenance db1149 (T296143)', diff saved to https://phabricator.wikimedia.org/P17871 and previous config saved to /var/cache/conftool/dbconfig/20211125-192850-ladsgroup.json
[19:28:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance T296143
[19:28:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1150.eqiad.wmnet with reason: Maintenance T296143
[19:28:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:55] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[19:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:23:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:43:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance T296143
[20:43:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance T296143
[20:43:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:56] <stashbot>	 T296143: Optimize commonswiki image table - https://phabricator.wikimedia.org/T296143
[20:43:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T296143)', diff saved to https://phabricator.wikimedia.org/P17872 and previous config saved to /var/cache/conftool/dbconfig/20211125-204357-ladsgroup.json
[20:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:40] <wikibugs>	 (03PS1) 104nn1l2: Add templateeditor group and protection level at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741972 (https://phabricator.wikimedia.org/T296154)
[22:07:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:09:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:14:40] <nn1l2>	 Do we have another branch called refs/for/master besides master?
[22:15:06] <nn1l2>	 Here I mean: https://phabricator.wikimedia.org/source/mediawiki-config/branches/master/ 
[22:16:33] <nn1l2>	 I want to know why I should use "git push origin HEAD:refs/for/master" instead of "git push origin HEAD:master" when pushing commits.
[22:20:11] <Reedy>	 Because gerrit
[22:20:22] <nn1l2>	 I know we should have that because only sysadmins should be allowed to push to the *original* master branch, and mere volunteers such as me should push somewhere else, but why can't I see that "experimental" branch?
[22:20:29] <Reedy>	 You're not pushing to the branch, you're pushing to basically a review queue
[22:21:33] <nn1l2>	 Thanks Reedy
[23:29:28] <wikibugs>	 (03PS1) 104nn1l2: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980
[23:38:13] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /robots.txt (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 503 (expecting: 200): / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[23:39:00] <wikibugs>	 (03PS2) 104nn1l2: Add planet4589.org to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/741980 (https://phabricator.wikimedia.org/T296136)
[23:40:21] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[23:49:28] <nn1l2>	 How do you make jenkins-bot to test the pushed patch?
[23:50:01] <nn1l2>	 For example, compare https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/741097 with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/741980